[Feat] Line-level KB citations#1523
Conversation
|
@CREDO23 is attempting to deploy a commit to the Rohan Verma's projects Team on Vercel. A member of the Team first needs to authorize it. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
Replaces opaque chunk-id citations with verifiable, line-level references into a document's canonical
source_markdown. The agent cites[citation:d<docId>#L<a>-<b>]; clicking a citation opens the editor in source view and scrolls to/highlights the exact lines.What changed
source_markdownonly, never reconstruct the body from chunks.start_char/end_charcolumns; a lossless span-aware chunker records offsets intosource_markdown, persisted on index and refreshed on incremental reconcile (NOTE writes share the same builder).source_markdownreads with a citation preamble (doc id + matched line ranges); prompts updated for web/KB/legacy channels.d<docId>#L<a>-<b>tokens and resolve them to the source view.Compatibility
Test plan
[citation:d<id>#L...]-> click -> editor scrolls + highlights (desktop + mobile)citations_enabledflag in target environmentHigh-level PR Summary
This PR introduces line-level citations for knowledge-base documents, replacing opaque chunk-id references with verifiable line ranges into a document's canonical
source_markdown. The agent now cites passages as[citation:d<docId>#L<start>-<end>], and clicking a citation opens the editor scrolled to and highlighting the exact source lines. The implementation adds nullablestart_char/end_charcolumns to chunks (recording their offset intosource_markdown), updates the chunker to track these spans losslessly, modifies search and retrieval to surface line metadata, teaches the agent to read and cite numbered source views, updates the frontend parser to recognize line citations, and ensures the editor reveals/highlights cited lines. All document read paths now servesource_markdownas the canonical body rather than reconstructing from chunks. Backward compatibility is maintained through nullable span columns—existing documents degrade gracefully to legacy chunk-id citations and accumulate spans as they're reindexed.⏱️ Estimated Review Time: 1-3 hours
💡 Review Order Suggestion
surfsense_backend/alembic/versions/166_add_chunk_char_spans.pysurfsense_backend/app/db.pysurfsense_backend/app/indexing_pipeline/document_chunker.pysurfsense_backend/app/utils/text_spans.pysurfsense_backend/app/indexing_pipeline/cache/cached_indexing.pysurfsense_backend/app/indexing_pipeline/indexing_pipeline_service.pysurfsense_backend/app/agents/chat/multi_agent_chat/main_agent/middleware/kb_persistence/middleware.pysurfsense_backend/app/indexing_pipeline/chunk_reconciler.pysurfsense_backend/app/retriever/chunks_hybrid_search.pysurfsense_backend/app/agents/chat/multi_agent_chat/main_agent/tools/search_knowledge_base.pysurfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/numbered_document.pysurfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/kb_postgres.pysurfsense_backend/app/routes/documents_routes.pysurfsense_backend/app/routes/editor_routes.pysurfsense_backend/app/schemas/chunks.pysurfsense_backend/app/schemas/documents.pysurfsense_backend/app/agents/chat/multi_agent_chat/main_agent/system_prompt/prompts/citations/on.mdsurfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_cloud.mdsurfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_readonly_cloud.mdsurfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_desktop.mdsurfsense_web/lib/citations/citation-parser.tssurfsense_web/components/citations/citation-renderer.tsxsurfsense_web/components/assistant-ui/inline-citation.tsxsurfsense_web/components/editor/plugins/citation-kit.tsxsurfsense_web/atoms/editor/editor-panel.atom.tssurfsense_web/components/citation-panel/citation-panel.tsxsurfsense_web/components/editor/source-code-editor.tsxsurfsense_web/components/editor-panel/editor-panel.tsxsurfsense_web/components/layout/ui/right-panel/RightPanel.tsxsurfsense_web/app/globals.csssurfsense_web/contracts/types/document.types.tssurfsense_backend/app/config/__init__.pysurfsense_backend/tests/unit/utils/test_text_spans.pysurfsense_backend/tests/unit/indexing_pipeline/test_chunk_markdown_with_spans.pysurfsense_backend/tests/unit/middleware/test_numbered_document.pysurfsense_backend/tests/unit/agents/multi_agent_chat/tools/test_search_knowledge_base.pysurfsense_backend/tests/integration/agents/multi_agent_chat/test_kb_persistence_spans.pysurfsense_backend/tests/integration/indexing_pipeline/test_index_spans.pysurfsense_backend/tests/integration/test_documents_by_chunk_route.pysurfsense_backend/tests/integration/test_editor_routes.py