Hi, thank you for your work on this.
I am noticing that I run into index out of range error when my corpus is around 1k-2k documents
i have precomputed embeddings stored in my data.
import umap
document_vectors = np.stack(df['EMBEDDINGS'].values)
document_map = umap.UMAP(metric='cosine').fit_transform(document_vectors)
from toponymy import ToponymyClusterer
clusterer = ToponymyClusterer(min_clusters=6)
clusterer.fit(clusterable_vectors=document_map, embedding_vectors=document_vectors)
for i, layer in enumerate(clusterer.cluster_layers_):
print(f'{len(np.unique(layer.cluster_labels))-1} clusters in layer {i}')
Output:
111 clusters in layer 0
36 clusters in layer 1
9 clusters in layer 2
so far, all good.
from sentence_transformers import SentenceTransformer
from toponymy import Toponymy, KeyphraseBuilder
from toponymy.llm_wrappers import HuggingFace
embedding_model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
llm = HuggingFace("Qwen/Qwen2.5-1.5B-Instruct")
text = df['TITLE_TEXT'].values
topic_model = Toponymy(
llm_wrapper=llm,
text_embedding_model=embedding_model,
clusterer=clusterer,
)
topic_model.fit(text, document_vectors, document_map)
topic_names = topic_model.topic_names_
IndexError: list index out of range
File , line 8
1 text = df['TITLE_TEXT'].values
3 topic_model = Toponymy(
4 llm_wrapper=llm,
5 text_embedding_model=embedding_model,
6 clusterer=clusterer,
7 )
----> 8 topic_model.fit(text, document_vectors, document_map)
10 topic_names = topic_model.topic_names_
11 topics_per_document = [cluster_layer.topic_name_vector for cluster_layer in topic_model.cluster_layers_]
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-bf30105a-4657-4047-8f9f-0e0ae989b858/lib/python3.11/site-packages/toponymy/cluster_layer.py:229, in ClusterLayer._update_topic_names(self, new_topic_names, topic_indices)
225 """
226 Update the topic names for the specified indices.
227 """
228 for i, topic_index in enumerate(topic_indices):
--> 229 self.topic_names[topic_index] = new_topic_names[i]
I am running this on databricks with cluster config as below:
databricks runtime : 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12)
nodetype Standard_DS3_v2
Hi, thank you for your work on this.
I am noticing that I run into index out of range error when my corpus is around 1k-2k documents
i have precomputed embeddings stored in my data.
Output:
111 clusters in layer 0
36 clusters in layer 1
9 clusters in layer 2
so far, all good.
IndexError: list index out of range
File , line 8
1 text = df['TITLE_TEXT'].values
3 topic_model = Toponymy(
4 llm_wrapper=llm,
5 text_embedding_model=embedding_model,
6 clusterer=clusterer,
7 )
----> 8 topic_model.fit(text, document_vectors, document_map)
10 topic_names = topic_model.topic_names_
11 topics_per_document = [cluster_layer.topic_name_vector for cluster_layer in topic_model.cluster_layers_]
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-bf30105a-4657-4047-8f9f-0e0ae989b858/lib/python3.11/site-packages/toponymy/cluster_layer.py:229, in ClusterLayer._update_topic_names(self, new_topic_names, topic_indices)
225 """
226 Update the topic names for the specified indices.
227 """
228 for i, topic_index in enumerate(topic_indices):
--> 229 self.topic_names[topic_index] = new_topic_names[i]
I am running this on databricks with cluster config as below:
databricks runtime : 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12)
nodetype Standard_DS3_v2