Skip to content
This repository was archived by the owner on Jun 11, 2025. It is now read-only.

Feature/cohere embedding#2

Open
AndreasThinks wants to merge 5 commits intomainfrom
feature/cohere_embedding
Open

Feature/cohere embedding#2
AndreasThinks wants to merge 5 commits intomainfrom
feature/cohere_embedding

Conversation

@AndreasThinks
Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown
Contributor

@alexmoore-iai alexmoore-iai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I have added a couple of comments with questions to better understand the process. Probably not the best person to approve this but happy to do so if you want to get this merged.

Comment thread core_utils.py
added_docs += len(batch)
print("added batch", i + 1)
break
except Exception:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, is there one particular type of error that causes this process to fail?

Comment thread core_utils.py
def delete_duplicate_urls_from_store(vectorstore):
"""Looks for duplicate source urls in the Opensearch vectorstore, and removes them, keeping only the most recent based on metadata.time_scraped"""
def delete_duplicate_chunks_from_store(vectorstore):
"""Looks for duplicate source urls and text chunks in the Opensearch vectorstore, and removes them, keeping only the most recent based on metadata.time_scraped"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right in thinking the process follows these steps:

  • scrape website
  • upload new scrape to vectorstore
  • check for matching chunks in vector store
  • delete old chunks if a new chunk has been found.

Are the errors you encounter when uploading documents to the vector store the main reason you don't: delete all entries in the vector store from the old scrape of the same website before uploading the new scrape?

Comment thread chunking.py
# remove markdown index links on all the content
document.page_content = remove_markdown_index_links(document.page_content)

split_long_document_list = text_splitter.split_documents(list_of_too_long_docs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may not need to separate the long/short documents. If you set the chunk size param in RecursiveCharacterTextSplitter relative to max_tokens then all the shorter docs will just be skipped. Although this would require you to apply remove_makrdown_index_links to all documents.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants