Skip to content
This repository was archived by the owner on Jun 11, 2025. It is now read-only.

OpenSearch Hybrid Experiments#6

Open
alexmoore-iai wants to merge 3 commits intomainfrom
feature/opensearch
Open

OpenSearch Hybrid Experiments#6
alexmoore-iai wants to merge 3 commits intomainfrom
feature/opensearch

Conversation

@alexmoore-iai
Copy link
Copy Markdown
Contributor

Initial experiments around setting up OpenSearch for Hybrid search. I have used the opensearch python client. There is also a REST API, which I think might be better (or at least not all features are available through the python client). This includes:

  • simple python helper class to upload documents to OS. I create the embeddings before uploading (OS encourages you to create an ingestion pipeline that will automatically create these but it was a bit tricky to set up with the python client).
  • Query class that implements hybrid search methodology
  • A simple python script showing how this would all fit together.

This was all prototyped with a local docker container running OpenSearch, to replicate run:
docker run -p 9200:9200 -p 9600:9600 -e "DISABLE_SECURITY_PLUGIN=true" -e "discovery.type=single-node" -e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=Caddy_14211" opensearchproject/opensearch:latest

Copy link
Copy Markdown

@duncanjbrown duncanjbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't speak to the search methodology, but this looks great. We could probably evolve in the direction of having a scripted opensearch config step where we define search capabilities/pipelines, and push any site-specific scraping rules down into the scraper?

self.client.indices.create(index=self.index_name, body=index_body)

async def async_bulk_upload(self, file_path: str, domain: str = "citizen-advice"):
json_files = glob.glob(os.path.join(file_path, "scrape_result_*.json"))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intermediate disk step — we could also stream out of the scraper!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a cool idea, we could get rid of the distinction between scraper/document_manager and just stream everything


# Citizen Advice Specific Logic to remove low quality docs
docs = [d for d in docs if d.metadata['markdown_length'] > 1000]
docs = [d for d in docs if "cymraeg" not in d.metadata['source']]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being site-specific, this could be in the scraper in due course I guess?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100%, it's included so we have a semi-permenant record of it (before the scraper refactor)

}

try:
self.client.transport.perform_request(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're currently configuring each time we construct an engine, but this preliminary setup for the OpenSearch instance?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not wedded to where this happens. In this PR I tried to make a distinction between upload (document-management) and retrieval (query_engine) and this felt more closely related to the later. Only benefit to keeping it in a query_engine is the ability to tweak hybrid search weightings at retrieval time. I agree that once we have stopped experimenting it makes sense to move this to a separate configuration process

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants