OpenSearch Hybrid Experiments by alexmoore-iai · Pull Request #6 · i-dot-ai/caddy-scraper

alexmoore-iai · 2024-12-06T15:32:43Z

Initial experiments around setting up OpenSearch for Hybrid search. I have used the opensearch python client. There is also a REST API, which I think might be better (or at least not all features are available through the python client). This includes:

simple python helper class to upload documents to OS. I create the embeddings before uploading (OS encourages you to create an ingestion pipeline that will automatically create these but it was a bit tricky to set up with the python client).
Query class that implements hybrid search methodology
A simple python script showing how this would all fit together.

This was all prototyped with a local docker container running OpenSearch, to replicate run:
docker run -p 9200:9200 -p 9600:9600 -e "DISABLE_SECURITY_PLUGIN=true" -e "discovery.type=single-node" -e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=Caddy_14211" opensearchproject/opensearch:latest

duncanjbrown

I can't speak to the search methodology, but this looks great. We could probably evolve in the direction of having a scripted opensearch config step where we define search capabilities/pipelines, and push any site-specific scraping rules down into the scraper?

duncanjbrown · 2024-12-09T10:22:48Z

+        self.client.indices.create(index=self.index_name, body=index_body)
+
+    async def async_bulk_upload(self, file_path: str, domain: str = "citizen-advice"):
+        json_files = glob.glob(os.path.join(file_path, "scrape_result_*.json"))


intermediate disk step — we could also stream out of the scraper!

That's a cool idea, we could get rid of the distinction between scraper/document_manager and just stream everything

duncanjbrown · 2024-12-09T10:23:34Z

+
+            # Citizen Advice Specific Logic to remove low quality docs
+            docs = [d for d in docs if d.metadata['markdown_length'] > 1000]
+            docs = [d for d in docs if "cymraeg" not in d.metadata['source']]


Being site-specific, this could be in the scraper in due course I guess?

100%, it's included so we have a semi-permenant record of it (before the scraper refactor)

duncanjbrown · 2024-12-09T10:26:58Z

+        }
+
+        try:
+            self.client.transport.perform_request(


We're currently configuring each time we construct an engine, but this preliminary setup for the OpenSearch instance?

Definitely not wedded to where this happens. In this PR I tried to make a distinction between upload (document-management) and retrieval (query_engine) and this felt more closely related to the later. Only benefit to keeping it in a query_engine is the ability to tweak hybrid search weightings at retrieval time. I agree that once we have stopped experimenting it makes sense to move this to a separate configuration process

alexmoore-iai added 3 commits December 6, 2024 15:25

Add: python class to upload docs to opensearch

49bf527

Add: python class to query opensearch with hybrid search.

40b331a

Add: simple python script with usage example.

39a627e

alexmoore-iai requested review from AndreasThinks, RyanWhite25, dk-singh, duncanjbrown and kurtismassey December 6, 2024 15:33

duncanjbrown reviewed Dec 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenSearch Hybrid Experiments#6

OpenSearch Hybrid Experiments#6
alexmoore-iai wants to merge 3 commits intomainfrom
feature/opensearch

alexmoore-iai commented Dec 6, 2024

Uh oh!

duncanjbrown left a comment

Uh oh!

duncanjbrown Dec 9, 2024

Uh oh!

alexmoore-iai Dec 9, 2024

Uh oh!

duncanjbrown Dec 9, 2024

Uh oh!

alexmoore-iai Dec 9, 2024

Uh oh!

duncanjbrown Dec 9, 2024

Uh oh!

alexmoore-iai Dec 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexmoore-iai commented Dec 6, 2024

Uh oh!

duncanjbrown left a comment

Choose a reason for hiding this comment

Uh oh!

duncanjbrown Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

alexmoore-iai Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

duncanjbrown Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

alexmoore-iai Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

duncanjbrown Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

alexmoore-iai Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants