Skip to content

[Feat] Line-level KB citations#1523

Merged
MODSetter merged 48 commits into
MODSetter:devfrom
CREDO23:fix/chat-citations
Jun 20, 2026
Merged

[Feat] Line-level KB citations#1523
MODSetter merged 48 commits into
MODSetter:devfrom
CREDO23:fix/chat-citations

Conversation

@CREDO23

@CREDO23 CREDO23 commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

Replaces opaque chunk-id citations with verifiable, line-level references into a document's canonical source_markdown. The agent cites [citation:d<docId>#L<a>-<b>]; clicking a citation opens the editor in source view and scrolls to/highlights the exact lines.

What changed

  • Canonical body: editor read paths (get/download/export) serve source_markdown only, never reconstruct the body from chunks.
  • Chunk spans: new nullable start_char/end_char columns; a lossless span-aware chunker records offsets into source_markdown, persisted on index and refreshed on incremental reconcile (NOTE writes share the same builder).
  • Honest retrieval: hybrid search returns chunk spans; the search tool renders the matched passage with line metadata when spans exist.
  • Resolve -> UI: by-chunk API derives the cited line range; the citation panel shows it and the editor reveals/highlights those lines.
  • Agent reads: numbered source_markdown reads with a citation preamble (doc id + matched line ranges); prompts updated for web/KB/legacy channels.
  • Frontend: parser/renderer recognize d<docId>#L<a>-<b> tokens and resolve them to the source view.

Compatibility

  • Span columns are nullable; no backfill. Existing docs degrade gracefully to the legacy chunk-id/XML path; spans accumulate as docs are reindexed.
  • Single linear Alembic head; migration is non-blocking.

Test plan

  • Integration suite green (citation routes, retriever, document_upload)
  • Unit tests: span chunker, char->line helper, read preamble, search hits
  • Manual: KB question -> agent cites [citation:d<id>#L...] -> click -> editor scrolls + highlights (desktop + mobile)
  • Confirm citations_enabled flag in target environment

High-level PR Summary

This PR introduces line-level citations for knowledge-base documents, replacing opaque chunk-id references with verifiable line ranges into a document's canonical source_markdown. The agent now cites passages as [citation:d<docId>#L<start>-<end>], and clicking a citation opens the editor scrolled to and highlighting the exact source lines. The implementation adds nullable start_char/end_char columns to chunks (recording their offset into source_markdown), updates the chunker to track these spans losslessly, modifies search and retrieval to surface line metadata, teaches the agent to read and cite numbered source views, updates the frontend parser to recognize line citations, and ensures the editor reveals/highlights cited lines. All document read paths now serve source_markdown as the canonical body rather than reconstructing from chunks. Backward compatibility is maintained through nullable span columns—existing documents degrade gracefully to legacy chunk-id citations and accumulate spans as they're reindexed.

⏱️ Estimated Review Time: 1-3 hours

💡 Review Order Suggestion
Order File Path
1 surfsense_backend/alembic/versions/166_add_chunk_char_spans.py
2 surfsense_backend/app/db.py
3 surfsense_backend/app/indexing_pipeline/document_chunker.py
4 surfsense_backend/app/utils/text_spans.py
5 surfsense_backend/app/indexing_pipeline/cache/cached_indexing.py
6 surfsense_backend/app/indexing_pipeline/indexing_pipeline_service.py
7 surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/middleware/kb_persistence/middleware.py
8 surfsense_backend/app/indexing_pipeline/chunk_reconciler.py
9 surfsense_backend/app/retriever/chunks_hybrid_search.py
10 surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/tools/search_knowledge_base.py
11 surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/numbered_document.py
12 surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/kb_postgres.py
13 surfsense_backend/app/routes/documents_routes.py
14 surfsense_backend/app/routes/editor_routes.py
15 surfsense_backend/app/schemas/chunks.py
16 surfsense_backend/app/schemas/documents.py
17 surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/system_prompt/prompts/citations/on.md
18 surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_cloud.md
19 surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_readonly_cloud.md
20 surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_desktop.md
21 surfsense_web/lib/citations/citation-parser.ts
22 surfsense_web/components/citations/citation-renderer.tsx
23 surfsense_web/components/assistant-ui/inline-citation.tsx
24 surfsense_web/components/editor/plugins/citation-kit.tsx
25 surfsense_web/atoms/editor/editor-panel.atom.ts
26 surfsense_web/components/citation-panel/citation-panel.tsx
27 surfsense_web/components/editor/source-code-editor.tsx
28 surfsense_web/components/editor-panel/editor-panel.tsx
29 surfsense_web/components/layout/ui/right-panel/RightPanel.tsx
30 surfsense_web/app/globals.css
31 surfsense_web/contracts/types/document.types.ts
32 surfsense_backend/app/config/__init__.py
33 surfsense_backend/tests/unit/utils/test_text_spans.py
34 surfsense_backend/tests/unit/indexing_pipeline/test_chunk_markdown_with_spans.py
35 surfsense_backend/tests/unit/middleware/test_numbered_document.py
36 surfsense_backend/tests/unit/agents/multi_agent_chat/tools/test_search_knowledge_base.py
37 surfsense_backend/tests/integration/agents/multi_agent_chat/test_kb_persistence_spans.py
38 surfsense_backend/tests/integration/indexing_pipeline/test_index_spans.py
39 surfsense_backend/tests/integration/test_documents_by_chunk_route.py
40 surfsense_backend/tests/integration/test_editor_routes.py

Need help? Join our Discord

CREDO23 added 30 commits June 18, 2026 19:23
@vercel

vercel Bot commented Jun 19, 2026

Copy link
Copy Markdown

@CREDO23 is attempting to deploy a commit to the Rohan Verma's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 631dff71-b790-42c7-9f3e-004968904bd2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@MODSetter MODSetter merged commit cd22421 into MODSetter:dev Jun 20, 2026
6 of 12 checks passed
MODSetter added a commit that referenced this pull request Jun 23, 2026
This reverts commit cd22421, reversing
changes made to a4bb0a5.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants