Conversation
alexmoore-iai
left a comment
There was a problem hiding this comment.
Looks good, I have added a couple of comments with questions to better understand the process. Probably not the best person to approve this but happy to do so if you want to get this merged.
| added_docs += len(batch) | ||
| print("added batch", i + 1) | ||
| break | ||
| except Exception: |
There was a problem hiding this comment.
Out of interest, is there one particular type of error that causes this process to fail?
| def delete_duplicate_urls_from_store(vectorstore): | ||
| """Looks for duplicate source urls in the Opensearch vectorstore, and removes them, keeping only the most recent based on metadata.time_scraped""" | ||
| def delete_duplicate_chunks_from_store(vectorstore): | ||
| """Looks for duplicate source urls and text chunks in the Opensearch vectorstore, and removes them, keeping only the most recent based on metadata.time_scraped""" |
There was a problem hiding this comment.
Am I right in thinking the process follows these steps:
- scrape website
- upload new scrape to vectorstore
- check for matching chunks in vector store
- delete old chunks if a new chunk has been found.
Are the errors you encounter when uploading documents to the vector store the main reason you don't: delete all entries in the vector store from the old scrape of the same website before uploading the new scrape?
| # remove markdown index links on all the content | ||
| document.page_content = remove_markdown_index_links(document.page_content) | ||
|
|
||
| split_long_document_list = text_splitter.split_documents(list_of_too_long_docs) |
There was a problem hiding this comment.
I think you may not need to separate the long/short documents. If you set the chunk size param in RecursiveCharacterTextSplitter relative to max_tokens then all the shorter docs will just be skipped. Although this would require you to apply remove_makrdown_index_links to all documents.
No description provided.