Conversation
duncanjbrown
left a comment
There was a problem hiding this comment.
I can't speak to the search methodology, but this looks great. We could probably evolve in the direction of having a scripted opensearch config step where we define search capabilities/pipelines, and push any site-specific scraping rules down into the scraper?
| self.client.indices.create(index=self.index_name, body=index_body) | ||
|
|
||
| async def async_bulk_upload(self, file_path: str, domain: str = "citizen-advice"): | ||
| json_files = glob.glob(os.path.join(file_path, "scrape_result_*.json")) |
There was a problem hiding this comment.
intermediate disk step — we could also stream out of the scraper!
There was a problem hiding this comment.
That's a cool idea, we could get rid of the distinction between scraper/document_manager and just stream everything
|
|
||
| # Citizen Advice Specific Logic to remove low quality docs | ||
| docs = [d for d in docs if d.metadata['markdown_length'] > 1000] | ||
| docs = [d for d in docs if "cymraeg" not in d.metadata['source']] |
There was a problem hiding this comment.
Being site-specific, this could be in the scraper in due course I guess?
There was a problem hiding this comment.
100%, it's included so we have a semi-permenant record of it (before the scraper refactor)
| } | ||
|
|
||
| try: | ||
| self.client.transport.perform_request( |
There was a problem hiding this comment.
We're currently configuring each time we construct an engine, but this preliminary setup for the OpenSearch instance?
There was a problem hiding this comment.
Definitely not wedded to where this happens. In this PR I tried to make a distinction between upload (document-management) and retrieval (query_engine) and this felt more closely related to the later. Only benefit to keeping it in a query_engine is the ability to tweak hybrid search weightings at retrieval time. I agree that once we have stopped experimenting it makes sense to move this to a separate configuration process
Initial experiments around setting up OpenSearch for Hybrid search. I have used the opensearch python client. There is also a REST API, which I think might be better (or at least not all features are available through the python client). This includes:
This was all prototyped with a local docker container running OpenSearch, to replicate run:
docker run -p 9200:9200 -p 9600:9600 -e "DISABLE_SECURITY_PLUGIN=true" -e "discovery.type=single-node" -e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=Caddy_14211" opensearchproject/opensearch:latest