Do you need to ask a question?
Your Question
Hi LightRAG team,
We’re working on ontology-based RAG project using supply-chain data and official USTR documents. Now using the LightRAG as a base and relying heavily on a graph engine to model relationships between document and domain, connected by event nodes which hold date information.
We have some questions about best practices for graph modeling for document:
1. About using chunk_id as source_id instead of document_id
In LightRAG’s example(example/insert_custom_kg.py), each chunk is assigned a unique source_id. We’re currently following this, but we wonder if this is ideal for our case.
Since our downstream queries and reasoning are often document-level (e.g., linking supply-chain events to official documents), would it make more sense to assign the source_id based on the document instead of each chunk?
2. Use existing entity node or create new one?
Now we extract entities and relationships from each chunk of a USTR document. Naturally, the same name of entities extract in multiple chunks with slightly different descriptions.
About this, we considers below two approaches and would love your advice:
Approach1. Use existing entity node
Pros: Avoid redundancy, easier to count node with filtering entity name
Cons: Hard to handle descriptions of entity, risk of losing context
Approach2. Create new entity node per chunk
Pros: Keep contextual info intact, no conflict in metadata
Cons: Cause duplication, harder to analyze globally
Thank you so much in advance :)
Your insights would be helpful as scale our RAG project.
Best Regards,
Byeonggyu
Additional Context
No response
Do you need to ask a question?
Your Question
Hi LightRAG team,
We’re working on ontology-based RAG project using supply-chain data and official USTR documents. Now using the LightRAG as a base and relying heavily on a graph engine to model relationships between document and domain, connected by event nodes which hold date information.
We have some questions about best practices for graph modeling for document:
1. About using chunk_id as source_id instead of document_id
In LightRAG’s example(example/insert_custom_kg.py), each chunk is assigned a unique source_id. We’re currently following this, but we wonder if this is ideal for our case.
Since our downstream queries and reasoning are often document-level (e.g., linking supply-chain events to official documents), would it make more sense to assign the source_id based on the document instead of each chunk?
2. Use existing entity node or create new one?
Now we extract entities and relationships from each chunk of a USTR document. Naturally, the same name of entities extract in multiple chunks with slightly different descriptions.
About this, we considers below two approaches and would love your advice:
Approach1. Use existing entity node
Pros: Avoid redundancy, easier to count node with filtering entity name
Cons: Hard to handle descriptions of entity, risk of losing context
Approach2. Create new entity node per chunk
Pros: Keep contextual info intact, no conflict in metadata
Cons: Cause duplication, harder to analyze globally
Thank you so much in advance :)
Your insights would be helpful as scale our RAG project.
Best Regards,
Byeonggyu
Additional Context
No response