Feature/cohere embedding by AndreasThinks · Pull Request #2 · i-dot-ai/caddy-scraper

AndreasThinks · 2024-07-30T13:17:45Z

No description provided.

alexmoore-iai

Looks good, I have added a couple of comments with questions to better understand the process. Probably not the best person to approve this but happy to do so if you want to get this merged.

alexmoore-iai · 2024-08-01T13:20:01Z

+                added_docs += len(batch)
+                print("added batch", i + 1)
+                break
+            except Exception:


Out of interest, is there one particular type of error that causes this process to fail?

alexmoore-iai · 2024-08-01T13:26:45Z

-def delete_duplicate_urls_from_store(vectorstore):
-    """Looks for duplicate source urls in the Opensearch vectorstore, and removes them, keeping only the most recent based on metadata.time_scraped"""
+def delete_duplicate_chunks_from_store(vectorstore):
+    """Looks for duplicate source urls and text chunks in the Opensearch vectorstore, and removes them, keeping only the most recent based on metadata.time_scraped"""


Am I right in thinking the process follows these steps:

scrape website

upload new scrape to vectorstore

check for matching chunks in vector store

delete old chunks if a new chunk has been found.

Are the errors you encounter when uploading documents to the vector store the main reason you don't: delete all entries in the vector store from the old scrape of the same website before uploading the new scrape?

alexmoore-iai · 2024-08-01T13:34:42Z

+            # remove markdown index links on all the content
+            document.page_content = remove_markdown_index_links(document.page_content)
+
+    split_long_document_list = text_splitter.split_documents(list_of_too_long_docs)


I think you may not need to separate the long/short documents. If you set the chunk size param in RecursiveCharacterTextSplitter relative to max_tokens then all the shorter docs will just be skipped. Although this would require you to apply remove_makrdown_index_links to all documents.

AndreasThinks added 5 commits June 5, 2024 20:01

first cohere implementation

1ea2e08

debug

26da06b

changed chunking approach to replicate local

cbd64d8

debug chunking

19b8439

added example config

447937e

AndreasThinks requested a review from alexmoore-iai July 30, 2024 13:17

alexmoore-iai reviewed Aug 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/cohere embedding#2

Feature/cohere embedding#2
AndreasThinks wants to merge 5 commits intomainfrom
feature/cohere_embedding

AndreasThinks commented Jul 30, 2024

Uh oh!

alexmoore-iai left a comment

Uh oh!

alexmoore-iai Aug 1, 2024

Uh oh!

alexmoore-iai Aug 1, 2024

Uh oh!

alexmoore-iai Aug 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AndreasThinks commented Jul 30, 2024

Uh oh!

alexmoore-iai left a comment

Choose a reason for hiding this comment

Uh oh!

alexmoore-iai Aug 1, 2024

Choose a reason for hiding this comment

Uh oh!

alexmoore-iai Aug 1, 2024

Choose a reason for hiding this comment

Uh oh!

alexmoore-iai Aug 1, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants