Test/blocknote to markdown unit tests by Glodykajabika · Pull Request #1533 · MODSetter/SurfSense

Glodykajabika · 2026-06-22T22:08:38Z

Description

Adds comprehensive unit tests for the blocknote_to_markdown() function as specified in issue #1519. Tests cover all block types, inline styles, edge cases, and nested content with proper indentation.

The new test file tests/unit/utils/test_blocknote_to_markdown.py includes test cases for:

Headings (levels 1-6 with clamping)
Inline styles (bold, italic, code, strikethrough with code priority)
Lists (bullet, numbered with props.start, checklists checked/unchecked)
Code blocks (with/without language tags)
Tables (dict and list row shapes, mixed cell types)
Images and links nested within blocks
Nested children with indentation at multiple levels
Edge cases (None, empty lists, unknown block types)

Motivation and Context

The blocknote_to_markdown() function is a critical utility for converting BlockNote editor JSON to Markdown, but it had no unit tests. This PR provides comprehensive test coverage to ensure the function works correctly across all supported block types and edge cases.

closes#1519

Screenshots

N/A (unit tests, no visual changes)

API Changes

This PR includes API changes

Change Type

Testing Performed

All tests pass locally with:

uv run pytest -m unit tests/unit/utils/test_blocknote_to_markdown.py -v
# 33 passed, 0 failed
uv run ruff check .
uv run ruff format --check .
# Both passed clean

Tested locally
Manual/QA verification

Checklist

Follows project coding standards and conventions
Documentation updated as needed
Dependencies updated as needed
No lint/build errors or new warnings
All relevant tests are passing

Summary by CodeRabbit

New Features
- Added numbered knowledge-base line citations ([citation:d<id>#L<start>-<end>]) and clickable inline tokens.
- Clicking a line citation opens the editor in read-only source view with the referenced line range highlighted.
Improvements
- Knowledge-base chunking, retrieval, and citation rendering now align to character-span offsets for more accurate copyable line-range tokens.
- Editor/citation UI updated to label and highlight cited ranges when available.
- Editor and export now use the canonical stored document body, returning clearer errors when it’s missing or unavailable.

vercel · 2026-06-22T22:08:42Z

@Glodykajabika is attempting to deploy a commit to the Rohan Verma's projects Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2026-06-22T22:09:07Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 335d5682-2fb1-40cd-b7f5-d7cab25a56d3

📥 Commits

Reviewing files that changed from the base of the PR and between 21dfdd5 and 5625ca1.

📒 Files selected for processing (1)

surfsense_backend/tests/unit/utils/test_blocknote_to_markdown.py

🚧 Files skipped from review as they are similar to previous changes (1)

surfsense_backend/tests/unit/utils/test_blocknote_to_markdown.py

📝 Walkthrough

Walkthrough

This PR implements a complete chunk character span system for precise source line citations. The backend adds start_char/end_char columns to store half-open character offsets, replaces string chunking with lossless span-addressed chunk_markdown_with_spans, propagates spans through indexing, retrieval, and KB filesystem reads. The KB search tool renders matched passages with line-range citations, and the KB persistence middleware stores spans on chunks. Agent prompts updated for the new [citation:d<id>#L<start>-<end>] format. The frontend adds a LineCitation component, extends the citation parser to recognize line citations, and implements Monaco editor line decorations and reveal behavior triggered from citation clicks.

Changes

End-to-end chunk character span and line-citation feature

Layer / File(s)	Summary
Core data contracts: ChunkSlice, DB schema, schemas, and text_spans utility `surfsense_backend/app/indexing_pipeline/document_chunker.py`, `surfsense_backend/app/db.py`, `surfsense_backend/alembic/versions/166_add_chunk_char_spans.py`, `surfsense_backend/app/schemas/chunks.py`, `surfsense_backend/app/schemas/documents.py`, `surfsense_backend/app/utils/text_spans.py`, `surfsense_backend/app/indexing_pipeline/chunk_reconciler.py`, `surfsense_backend/tests/unit/utils/test_text_spans.py`	Introduces `ChunkSlice(text, start_char, end_char)` dataclass for lossless span-addressed markdown slices. Adds `start_char`/`end_char` nullable integer columns to `Chunk` via Alembic migration and ORM model. Extends `ChunkRead` and `DocumentWithChunksRead` schemas with optional citation span fields. Updates `ExistingChunk` to carry loaded span metadata. Adds the shared `char_span_to_line_range` utility function to convert character offsets to 1-based inclusive line ranges.
Span-aware chunker and embedding cache pipeline `surfsense_backend/app/indexing_pipeline/cache/cached_indexing.py`, `surfsense_backend/app/config/__init__.py`, `surfsense_backend/tests/unit/indexing_pipeline/test_chunk_markdown_with_spans.py`, `surfsense_backend/tests/unit/indexing_pipeline/test_index_batch_parallel.py`	Replaces the prior string-returning `chunk_text_hybrid` with `chunk_markdown_with_spans`, which produces a lossless, contiguous partition of `ChunkSlice` objects where tables are emitted whole and other regions are chunked without stripping whitespace. Refactors `build_chunk_embeddings` to precompute slices and align cache hits by text content, returning `(summary_emb, list[SliceEmbedding])` pairs. Bumps `EMBEDDING_CACHE_CHUNKER_VERSION` to 2 to invalidate old caches.
Indexing pipeline service: scratch and incremental reindex with spans `surfsense_backend/app/indexing_pipeline/indexing_pipeline_service.py`, `surfsense_backend/tests/integration/indexing_pipeline/test_index_spans.py`, `surfsense_backend/tests/integration/indexing_pipeline/test_index_editions.py`, `surfsense_backend/tests/integration/document_upload/conftest.py`, `surfsense_backend/tests/integration/conftest.py`	Scratch reindex persists `start_char`/`end_char` extracted from `SliceEmbedding` pairs on new chunks. Incremental reindex uses `chunk_slices` and a new `_kept_row_span_updates` helper to compute `UPDATE` payloads for kept chunks when span boundaries shift, avoiding unnecessary re-embeddings. `_load_existing_chunks` now fetches stored span columns for reconciliation.
Retrieval and document/editor API routes `surfsense_backend/app/retriever/chunks_hybrid_search.py`, `surfsense_backend/app/routes/documents_routes.py`, `surfsense_backend/app/routes/editor_routes.py`, `surfsense_backend/tests/integration/retriever/conftest.py`, `surfsense_backend/tests/integration/retriever/test_optimized_chunk_retriever.py`, `surfsense_backend/tests/integration/test_documents_by_chunk_route.py`, `surfsense_backend/tests/integration/test_editor_routes.py`	`hybrid_search` selects and returns `start_char`/`end_char` per chunk in result dictionaries. `get_document_by_chunk_id` endpoint computes `cited_start_line`/`cited_end_line` from chunk character spans via `char_span_to_line_range` when both spans and `source_markdown` are available. Editor routes remove chunk-based markdown reconstruction and introduce `_raise_no_canonical_body` to raise typed HTTP errors (400/409/422) based on document processing state when `source_markdown` is absent.
KB filesystem read path: numbered preamble, kb_postgres, and tool adaptations `surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/numbered_document.py`, `surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/kb_postgres.py`, `surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/tools/edit_file/index.py`, `surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/tools/move_file/helpers.py`, `surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/tools/read_file/index.py`, `surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/tools/rm/helpers.py`, `surfsense_backend/tests/unit/middleware/test_numbered_document.py`, `surfsense_backend/tests/unit/middleware/test_kb_persistence_filesystem_parity.py`, `surfsense_backend/tests/unit/middleware/test_b_filesystem_rm_rmdir_cloud.py`	New `numbered_document.py` module computes matched line ranges from chunk character spans and builds an XML `<document_metadata>` preamble. `_load_file_data` returns a 3-tuple `(file_data, doc_id, preamble)` and the `read_file` tool prepends the preamble when `offset == 0`. Edit, move, read, and rm filesystem tools updated to unpack the 3-tuple return value.
Search KB tool, KB persistence middleware, and agent system prompts `surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/tools/search_knowledge_base.py`, `surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/middleware/kb_persistence/middleware.py`, `surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/system_prompt/prompts/citations/on.md`, `surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_cloud.md`, `surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_desktop.md`, `surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_readonly_cloud.md`, `surfsense_backend/tests/unit/agents/multi_agent_chat/tools/test_search_knowledge_base.py`, `surfsense_backend/tests/integration/agents/multi_agent_chat/test_kb_persistence_spans.py`	`search_knowledge_base` tool adds `_resolve_doc_context`, `_render_passage`, `_citation_token`, and matched-passage rendering helpers to emit formatted KB search results with line-range citations (`[citation:dID#Lstart-end]`). KB persistence middleware replaces the prior `asyncio.to_thread(embed_texts, chunk_text(...))` flow with `build_chunk_embeddings`, persisting `start_char`/`end_char` on all created and updated chunks. Agent system prompts updated to define three citation channels with explicit line-range citation format and stricter accuracy constraints.
Frontend: citation parser, LineCitation component, editor highlighting, and utilities `surfsense_web/lib/citations/citation-parser.ts`, `surfsense_web/atoms/editor/editor-panel.atom.ts`, `surfsense_web/components/assistant-ui/inline-citation.tsx`, `surfsense_web/components/citations/citation-renderer.tsx`, `surfsense_web/components/editor/plugins/citation-kit.tsx`, `surfsense_web/components/editor/source-code-editor.tsx`, `surfsense_web/components/citation-panel/citation-panel.tsx`, `surfsense_web/components/editor-panel/editor-panel.tsx`, `surfsense_web/components/layout/ui/right-panel/RightPanel.tsx`, `surfsense_web/contracts/types/document.types.ts`, `surfsense_web/app/globals.css`, `surfsense_backend/tests/unit/utils/test_blocknote_to_markdown.py`	Citation parser extends `CITATION_REGEX` to match `d<documentId>`#L`<start>-<end>` line citations and emits a new `"line"` token type. New `LineCitation` component renders clickable line citations that open the editor panel with `highlightLines`/`forceSourceView` flags set. `EditorLineRange` state type added to `editor-panel.atom.ts` and wired through `EditorPanelContent`, desktop/mobile panel variants, and `RightPanel` (desktop-only). `SourceCodeEditor` accepts `highlightLines` and applies Monaco whole-line decorations with viewport reveal on mount and layout changes. `CitationElementNode` and `makeCitationElement` support the new `line` variant. Adds comprehensive `blocknote_to_markdown` utility test suite covering all block types and edge cases.

Sequence Diagram(s)

sequenceDiagram
  participant Agent
  participant search_knowledge_base
  participant DB
  participant kb_postgres._load_file_data
  participant numbered_document
  participant LLM

  Agent->>search_knowledge_base: query
  search_knowledge_base->>DB: hybrid_search(query)
  DB-->>search_knowledge_base: hits with start_char/end_char per chunk
  search_knowledge_base->>DB: _resolve_doc_context(doc_ids)
  DB-->>search_knowledge_base: document bodies & metadata
  search_knowledge_base->>search_knowledge_base: render matched passages + compute line ranges
  search_knowledge_base-->>Agent: formatted KB results with [citation:dID#Lx-y]

  Agent->>kb_postgres._load_file_data: read_file(path, offset=0, matched_chunk_ids={101,102})
  kb_postgres._load_file_data->>DB: fetch source_markdown + chunk spans
  kb_postgres._load_file_data->>numbered_document: compute_matched_line_ranges(chunks, {101,102})
  numbered_document-->>kb_postgres._load_file_data: [(5,8), (12,15)]
  kb_postgres._load_file_data->>numbered_document: build_read_preamble(doc_id=42, ranges)
  numbered_document-->>kb_postgres._load_file_data: XML preamble with metadata
  kb_postgres._load_file_data-->>Agent: (file_data, 42, preamble)
  Agent->>LLM: [preamble] + numbered source markdown
  LLM-->>Agent: response citing [citation:d42#L5-8]

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~90 minutes

Possibly related issues

[Test] Add unit tests for blocknote_to_markdown #1519: The main issue adds comprehensive unit tests for the blocknote_to_markdown function that directly implement the feature request outlined in the retrieved issue, covering all requested test cases (headings, inline styles, lists, code blocks, tables, images, links, nested children, and edge cases).

Possibly related PRs

MODSetter/SurfSense#286: Both PRs wire citations to "jump to source" behavior via the /documents/by-chunk/{chunk_id} endpoint; this PR extends that flow with backend start_char/end_char → cited_start_line/cited_end_line mapping and corresponding frontend line citation parsing and highlightLines editor panel navigation.
MODSetter/SurfSense#1489: Both PRs modify KB persistence middleware and the build_chunk_embeddings(...) embedding pipeline in surfsense_backend/app/indexing_pipeline/cache/cached_indexing.py, with this PR adding span-aware slice-embedding pairs to the caching and incremental reindexing logic.

Suggested reviewers

MODSetter

Poem

🐇✨ A rabbit found the spans one day,
Each chunk a slice with offsets—hooray!
"Start here at char, end over there,
Now line five through eight—citations laid bare!"
The Monaco editor lit up each line,
[citation:d42#L5-8] working divine! 🎯📖

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 49.09% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Test/blocknote to markdown unit tests' is directly related to the primary change: adding a new unit test module for the blocknote_to_markdown function with 33 test cases.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (1)

surfsense_backend/tests/integration/indexing_pipeline/test_index_spans.py (1)

32-37: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Assert span ordering/contiguity explicitly in the integration invariant helper.

Current checks can miss mispositioned offsets when chunk text repeats. Add a cursor-based monotonic/contiguous assertion to validate absolute positions, not just content equality.

🔧 Suggested strengthening

 def _assert_spans_address_body(chunks: list[Chunk], body: str) -> None:
+    cursor = 0
     for chunk in chunks:
         assert chunk.start_char is not None and chunk.end_char is not None
+        assert chunk.start_char == cursor
+        assert chunk.end_char >= chunk.start_char
         assert body[chunk.start_char : chunk.end_char] == chunk.content
+        cursor = chunk.end_char
     assert "".join(c.content for c in chunks) == body
+    assert cursor == len(body)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@surfsense_backend/tests/integration/indexing_pipeline/test_index_spans.py`
around lines 32 - 37, The _assert_spans_address_body function needs to add
explicit validation for chunk ordering and contiguity to catch mispositioned
offsets. Add assertions that verify: chunks are ordered with monotonically
increasing start_char positions, each chunk is contiguous with the previous one
(each chunk's start_char equals the previous chunk's end_char), the first chunk
begins at position 0, and the final chunk ends at the length of the body. Use a
cursor variable that tracks the expected position of each successive chunk to
validate these invariants.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/kb_postgres.py`:
- Around line 530-534: The query in the select statement is unnecessarily
fetching Chunk.content when canonical reads only require id, start_char, and
end_char for line-range mapping, causing wasteful DB I/O and memory overhead.
Remove Chunk.content from the select call at lines 530-534 to fetch only the
required fields (id, start_char, end_char, and position for ordering). Apply the
same fix to the similar queries at lines 544-547 and 558-563 that fetch
Chunk.content unnecessarily.
- Around line 523-524: The issue is that line 523 coerces None to an empty
string for source_markdown, and then the truthy check at line 543 treats empty
strings as falsy, incorrectly routing documents with intentionally empty
canonical markdown to the legacy XML fallback. To fix this, preserve the
distinction between None (no canonical body provided) and "" (intentionally
empty canonical body) by removing the coercion at line 523 that converts None to
"". Then update the conditional logic at lines 543-557 and 558-574 from truthy
checks to explicit None checks, so that documents with empty canonical markdown
are properly handled and not incorrectly routed to the legacy XML fallback.

In `@surfsense_backend/tests/integration/conftest.py`:
- Around line 163-167: The mock for chunk_markdown_with_spans is returning a
static chunk slice that doesn't depend on the actual input markdown, violating
the function contract. Modify the monkeypatched mock to accept the input
markdown as a parameter and dynamically generate ChunkSlice objects based on the
actual spans within that input, rather than returning a hardcoded fixed chunk
slice regardless of what markdown is passed to chunk_markdown_with_spans.

In `@surfsense_backend/tests/integration/document_upload/conftest.py`:
- Around line 291-295: The mock for chunk_markdown_with_spans in the monkeypatch
call returns a fixed ChunkSlice regardless of the markdown input, which breaks
span semantics and reduces test reliability. Modify the MagicMock to be a
function that accepts the markdown argument and derives the ChunkSlice from that
actual input instead of returning hardcoded values, ensuring the mock properly
processes different markdown inputs and maintains correct span semantics for the
integration test.

In `@surfsense_backend/tests/unit/utils/test_blocknote_to_markdown.py`:
- Around line 16-20: The test_block fixture and other shared mutable fixtures
defined at class level (including those at lines 101-104, 145-148, 170-177,
210-214, and 239-242) are being mutated across test methods, causing cross-test
state leakage and order-dependent behavior. Convert these class-level fixtures
to per-test fixtures by either using pytest fixture decorators with function
scope or by creating fresh local objects directly within each test method that
needs them. This ensures each test gets its own independent copy of the data
without mutations affecting other tests.
- Around line 154-157: The assert statements at lines 154-157, 165-168, 184-188,
and 202-206 are currently asserting generator objects (which are always truthy)
rather than the comparison results themselves, causing the tests to pass even
when blocknote_to_markdown() is wrong. Wrap each generator expression with the
all() function to properly validate that every comparison in the generator
evaluates to True, ensuring these assertions actually test the output from
blocknote_to_markdown() instead of just confirming a generator object exists.

In `@surfsense_web/lib/citations/citation-parser.ts`:
- Around line 102-109: In the code block where lineMatch is checked and a line
segment is pushed to the segments array, add validation to ensure the parsed
line-citation bounds follow the 1-based inclusive contract. Before pushing the
segment object in the segments.push() call, validate that startLine is greater
than or equal to 1 and that endLine is greater than or equal to startLine. Only
push the segment to the array if both validation conditions pass, otherwise skip
it to prevent invalid citations from propagating bad highlight ranges.

---

Nitpick comments:
In `@surfsense_backend/tests/integration/indexing_pipeline/test_index_spans.py`:
- Around line 32-37: The _assert_spans_address_body function needs to add
explicit validation for chunk ordering and contiguity to catch mispositioned
offsets. Add assertions that verify: chunks are ordered with monotonically
increasing start_char positions, each chunk is contiguous with the previous one
(each chunk's start_char equals the previous chunk's end_char), the first chunk
begins at position 0, and the final chunk ends at the length of the body. Use a
cursor variable that tracks the expected position of each successive chunk to
validate these invariants.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5a78f101-6f1e-4428-8d65-d0fb7cd5f164

📥 Commits

Reviewing files that changed from the base of the PR and between ee241e0 and 53368ea.

📒 Files selected for processing (54)

surfsense_backend/alembic/versions/166_add_chunk_char_spans.py
surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/middleware/kb_persistence/middleware.py
surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/system_prompt/prompts/citations/on.md
surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/tools/search_knowledge_base.py
surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/kb_postgres.py
surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/numbered_document.py
surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/tools/edit_file/index.py
surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/tools/move_file/helpers.py
surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/tools/read_file/index.py
surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/tools/rm/helpers.py
surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_cloud.md
surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_desktop.md
surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_readonly_cloud.md
surfsense_backend/app/config/__init__.py
surfsense_backend/app/db.py
surfsense_backend/app/indexing_pipeline/cache/cached_indexing.py
surfsense_backend/app/indexing_pipeline/chunk_reconciler.py
surfsense_backend/app/indexing_pipeline/document_chunker.py
surfsense_backend/app/indexing_pipeline/indexing_pipeline_service.py
surfsense_backend/app/retriever/chunks_hybrid_search.py
surfsense_backend/app/routes/documents_routes.py
surfsense_backend/app/routes/editor_routes.py
surfsense_backend/app/schemas/chunks.py
surfsense_backend/app/schemas/documents.py
surfsense_backend/app/utils/text_spans.py
surfsense_backend/tests/integration/agents/multi_agent_chat/test_kb_persistence_spans.py
surfsense_backend/tests/integration/conftest.py
surfsense_backend/tests/integration/document_upload/conftest.py
surfsense_backend/tests/integration/indexing_pipeline/adapters/test_file_upload_adapter.py
surfsense_backend/tests/integration/indexing_pipeline/test_index_editions.py
surfsense_backend/tests/integration/indexing_pipeline/test_index_spans.py
surfsense_backend/tests/integration/retriever/conftest.py
surfsense_backend/tests/integration/retriever/test_optimized_chunk_retriever.py
surfsense_backend/tests/integration/test_documents_by_chunk_route.py
surfsense_backend/tests/integration/test_editor_routes.py
surfsense_backend/tests/unit/agents/multi_agent_chat/tools/test_search_knowledge_base.py
surfsense_backend/tests/unit/indexing_pipeline/test_chunk_markdown_with_spans.py
surfsense_backend/tests/unit/indexing_pipeline/test_index_batch_parallel.py
surfsense_backend/tests/unit/middleware/test_b_filesystem_rm_rmdir_cloud.py
surfsense_backend/tests/unit/middleware/test_kb_persistence_filesystem_parity.py
surfsense_backend/tests/unit/middleware/test_numbered_document.py
surfsense_backend/tests/unit/utils/test_blocknote_to_markdown.py
surfsense_backend/tests/unit/utils/test_text_spans.py
surfsense_web/app/globals.css
surfsense_web/atoms/editor/editor-panel.atom.ts
surfsense_web/components/assistant-ui/inline-citation.tsx
surfsense_web/components/citation-panel/citation-panel.tsx
surfsense_web/components/citations/citation-renderer.tsx
surfsense_web/components/editor-panel/editor-panel.tsx
surfsense_web/components/editor/plugins/citation-kit.tsx
surfsense_web/components/editor/source-code-editor.tsx
surfsense_web/components/layout/ui/right-panel/RightPanel.tsx
surfsense_web/contracts/types/document.types.ts
surfsense_web/lib/citations/citation-parser.ts

coderabbitai · 2026-06-22T22:17:00Z

+            source_markdown = document.source_markdown or ""
+            document_type = (


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Do not treat empty canonical markdown as “no canonical body.”

Line 523 coerces None to "", and Line 543 uses a truthy check. A document with intentionally empty canonical content is incorrectly routed to the legacy XML fallback, which breaks the canonical-read contract.

💡 Suggested fix

- source_markdown = document.source_markdown or "" + source_markdown = document.source_markdown @@ - if source_markdown: + if source_markdown is not None: ranges = compute_matched_line_ranges( source_markdown, [(r.id, r.start_char, r.end_char) for r in chunk_records], matched, )

Also applies to: 543-557, 558-574

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/kb_postgres.py` around lines 523 - 524, The issue is that line 523 coerces None to an empty string for source_markdown, and then the truthy check at line 543 treats empty strings as falsy, incorrectly routing documents with intentionally empty canonical markdown to the legacy XML fallback. To fix this, preserve the distinction between None (no canonical body provided) and "" (intentionally empty canonical body) by removing the coercion at line 523 that converts None to "". Then update the conditional logic at lines 543-557 and 558-574 from truthy checks to explicit None checks, so that documents with empty canonical markdown are properly handled and not incorrectly routed to the legacy XML fallback.

coderabbitai · 2026-06-22T22:17:00Z

            chunk_rows = await session.execute(
-                select(Chunk.id, Chunk.content)
+                select(Chunk.id, Chunk.content, Chunk.start_char, Chunk.end_char)
                .where(Chunk.document_id == document.id)
                .order_by(Chunk.position, Chunk.id)
            )


🚀 Performance & Scalability | 🟠 Major | ⚡ Quick win

Avoid loading full chunk text on canonical reads.

Line 531 always fetches Chunk.content, but canonical reads only consume id/start_char/end_char for line-range mapping. This adds avoidable DB I/O and memory pressure on large documents.

💡 Suggested optimization

- chunk_rows = await session.execute( - select(Chunk.id, Chunk.content, Chunk.start_char, Chunk.end_char) - .where(Chunk.document_id == document.id) - .order_by(Chunk.position, Chunk.id) - ) + if document.source_markdown is not None: + chunk_rows = await session.execute( + select(Chunk.id, Chunk.start_char, Chunk.end_char) + .where(Chunk.document_id == document.id) + .order_by(Chunk.position, Chunk.id) + ) + else: + chunk_rows = await session.execute( + select(Chunk.id, Chunk.content, Chunk.start_char, Chunk.end_char) + .where(Chunk.document_id == document.id) + .order_by(Chunk.position, Chunk.id) + )

Also applies to: 544-547, 558-563

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/kb_postgres.py` around lines 530 - 534, The query in the select statement is unnecessarily fetching Chunk.content when canonical reads only require id, start_char, and end_char for line-range mapping, causing wasteful DB I/O and memory overhead. Remove Chunk.content from the select call at lines 530-534 to fetch only the required fields (id, start_char, end_char, and position for ordering). Apply the same fix to the similar queries at lines 544-547 and 558-563 that fetch Chunk.content unnecessarily.

coderabbitai · 2026-06-22T22:17:00Z

+    text = "Test chunk content."
+    mock = MagicMock(return_value=[ChunkSlice(text, 0, len(text))])
    monkeypatch.setattr(
-        "app.indexing_pipeline.cache.cached_indexing.chunk_text_hybrid",
+        "app.indexing_pipeline.cache.cached_indexing.chunk_markdown_with_spans",
        mock,


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Mocked chunk slices don’t reflect input markdown spans.

This fixture returns a fixed chunk/span pair regardless of input, which breaks the ChunkSlice contract and weakens span-sensitive integration assertions.

✅ Contract-preserving mock

- text = "Test chunk content." - mock = MagicMock(return_value=[ChunkSlice(text, 0, len(text))]) + def _mock_chunker(markdown: str, *args, **kwargs): + return [ChunkSlice(markdown, 0, len(markdown))] + + mock = MagicMock(side_effect=_mock_chunker)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

text = "Test chunk content."

mock = MagicMock(return_value=[ChunkSlice(text, 0, len(text))])

monkeypatch.setattr(

"app.indexing_pipeline.cache.cached_indexing.chunk_text_hybrid",

"app.indexing_pipeline.cache.cached_indexing.chunk_markdown_with_spans",

mock,

def _mock_chunker(markdown: str, *args, **kwargs):

return [ChunkSlice(markdown, 0, len(markdown))]

mock = MagicMock(side_effect=_mock_chunker)

monkeypatch.setattr(

"app.indexing_pipeline.cache.cached_indexing.chunk_markdown_with_spans",

mock,

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@surfsense_backend/tests/integration/conftest.py` around lines 163 - 167, The mock for chunk_markdown_with_spans is returning a static chunk slice that doesn't depend on the actual input markdown, violating the function contract. Modify the monkeypatched mock to accept the input markdown as a parameter and dynamically generate ChunkSlice objects based on the actual spans within that input, rather than returning a hardcoded fixed chunk slice regardless of what markdown is passed to chunk_markdown_with_spans.

coderabbitai · 2026-06-22T22:17:00Z

+    chunk = "Test chunk content."
    monkeypatch.setattr(
-        "app.indexing_pipeline.cache.cached_indexing.chunk_text",
-        MagicMock(return_value=["Test chunk content."]),
+        "app.indexing_pipeline.cache.cached_indexing.chunk_markdown_with_spans",
+        MagicMock(return_value=[ChunkSlice(chunk, 0, len(chunk))]),
    )


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Chunking mock should derive ChunkSlice from the incoming markdown.

Returning a fixed chunk/span pair for every call can invalidate span semantics and make integration checks less trustworthy.

✅ Contract-preserving mock

- chunk = "Test chunk content." monkeypatch.setattr( "app.indexing_pipeline.cache.cached_indexing.chunk_markdown_with_spans", - MagicMock(return_value=[ChunkSlice(chunk, 0, len(chunk))]), + MagicMock( + side_effect=lambda markdown, *args, **kwargs: [ + ChunkSlice(markdown, 0, len(markdown)) + ] + ), )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

chunk = "Test chunk content."

monkeypatch.setattr(

"app.indexing_pipeline.cache.cached_indexing.chunk_text",

MagicMock(return_value=["Test chunk content."]),

"app.indexing_pipeline.cache.cached_indexing.chunk_markdown_with_spans",

MagicMock(return_value=[ChunkSlice(chunk, 0, len(chunk))]),

)

monkeypatch.setattr(

"app.indexing_pipeline.cache.cached_indexing.chunk_markdown_with_spans",

MagicMock(

side_effect=lambda markdown, *args, **kwargs: [

ChunkSlice(markdown, 0, len(markdown))

]

),

)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@surfsense_backend/tests/integration/document_upload/conftest.py` around lines 291 - 295, The mock for chunk_markdown_with_spans in the monkeypatch call returns a fixed ChunkSlice regardless of the markdown input, which breaks span semantics and reduces test reliability. Modify the MagicMock to be a function that accepts the markdown argument and derives the ChunkSlice from that actual input instead of returning hardcoded values, ensuring the mock properly processes different markdown inputs and maintains correct span semantics for the integration test.

coderabbitai · 2026-06-22T22:17:00Z

+		const lineMatch = LINE_CITATION_REGEX.exec(captured);
+		if (lineMatch) {
+			segments.push({
+				kind: "line",
+				documentId: Number.parseInt(lineMatch[1], 10),
+				startLine: Number.parseInt(lineMatch[2], 10),
+				endLine: Number.parseInt(lineMatch[3], 10),
+			});


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Validate line-citation bounds before emitting a "line" token.

At Line 102–109, parsed values are accepted without enforcing the 1-based inclusive contract (startLine >= 1 and endLine >= startLine). Invalid citations can propagate bad highlight ranges.

Suggested fix

const lineMatch = LINE_CITATION_REGEX.exec(captured); if (lineMatch) { + const documentId = Number.parseInt(lineMatch[1], 10); + const startLine = Number.parseInt(lineMatch[2], 10); + const endLine = Number.parseInt(lineMatch[3], 10); + if (startLine < 1 || endLine < startLine) { + segments.push(match[0]); + lastIndex = match.index + match[0].length; + match = CITATION_REGEX.exec(text); + continue; + } segments.push({ kind: "line", - documentId: Number.parseInt(lineMatch[1], 10), - startLine: Number.parseInt(lineMatch[2], 10), - endLine: Number.parseInt(lineMatch[3], 10), + documentId, + startLine, + endLine, }); } else if (captured.startsWith("http://") || captured.startsWith("https://")) {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@surfsense_web/lib/citations/citation-parser.ts` around lines 102 - 109, In the code block where lineMatch is checked and a line segment is pushed to the segments array, add validation to ensure the parsed line-citation bounds follow the 1-based inclusive contract. Before pushing the segment object in the segments.push() call, validate that startLine is greater than or equal to 1 and that endLine is greater than or equal to startLine. Only push the segment to the array if both validation conditions pass, otherwise skip it to prevent invalid citations from propagating bad highlight ranges.

CREDO23 · 2026-06-23T04:47:07Z

@Glodykajabika can you please target the dev branch ?

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

Glodykajabika marked this pull request as draft June 22, 2026 22:25

Glodykajabika marked this pull request as ready for review June 22, 2026 22:59

Glodykajabika marked this pull request as draft June 22, 2026 23:31

test: add unit tests for blocknote_to_markdown

5625ca1

Glodykajabika force-pushed the test/blocknote-to-markdown-unit-tests branch from 21dfdd5 to 5625ca1 Compare June 22, 2026 23:44

Glodykajabika marked this pull request as ready for review June 22, 2026 23:45

Glodykajabika changed the base branch from main to dev June 23, 2026 06:06

MODSetter merged commit cb9da1b into MODSetter:dev Jun 23, 2026
1 of 2 checks passed

Glodykajabika deleted the test/blocknote-to-markdown-unit-tests branch June 23, 2026 09:29

		source_markdown = document.source_markdown or ""
		document_type = (

Uh oh!

Conversation

Glodykajabika commented Jun 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Screenshots

API Changes

Change Type

Testing Performed

Checklist

Summary by CodeRabbit

Uh oh!

vercel Bot commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

CREDO23 commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Glodykajabika commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading