Skip to content

index out of range error when corpus size is in thousands #57

Description

@manas007

Hi, thank you for your work on this.
I am noticing that I run into index out of range error when my corpus is around 1k-2k documents

i have precomputed embeddings stored in my data.

import umap
document_vectors = np.stack(df['EMBEDDINGS'].values)
document_map = umap.UMAP(metric='cosine').fit_transform(document_vectors)
from toponymy import ToponymyClusterer
clusterer = ToponymyClusterer(min_clusters=6)
clusterer.fit(clusterable_vectors=document_map, embedding_vectors=document_vectors)
for i, layer in enumerate(clusterer.cluster_layers_):
    print(f'{len(np.unique(layer.cluster_labels))-1} clusters in layer {i}')

Output:
111 clusters in layer 0
36 clusters in layer 1
9 clusters in layer 2

so far, all good.

from sentence_transformers import SentenceTransformer
from toponymy import Toponymy, KeyphraseBuilder
from toponymy.llm_wrappers import HuggingFace

embedding_model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
llm = HuggingFace("Qwen/Qwen2.5-1.5B-Instruct")

text = df['TITLE_TEXT'].values

topic_model = Toponymy(
    llm_wrapper=llm,
    text_embedding_model=embedding_model,
    clusterer=clusterer,
)
topic_model.fit(text, document_vectors, document_map)

topic_names = topic_model.topic_names_

IndexError: list index out of range
File , line 8
1 text = df['TITLE_TEXT'].values
3 topic_model = Toponymy(
4 llm_wrapper=llm,
5 text_embedding_model=embedding_model,
6 clusterer=clusterer,
7 )
----> 8 topic_model.fit(text, document_vectors, document_map)
10 topic_names = topic_model.topic_names_
11 topics_per_document = [cluster_layer.topic_name_vector for cluster_layer in topic_model.cluster_layers_]
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-bf30105a-4657-4047-8f9f-0e0ae989b858/lib/python3.11/site-packages/toponymy/cluster_layer.py:229, in ClusterLayer._update_topic_names(self, new_topic_names, topic_indices)
225 """
226 Update the topic names for the specified indices.
227 """
228 for i, topic_index in enumerate(topic_indices):
--> 229 self.topic_names[topic_index] = new_topic_names[i]

I am running this on databricks with cluster config as below:
databricks runtime : 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12)
nodetype Standard_DS3_v2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions