Skip to content

Test/blocknote to markdown unit tests#1533

Merged
MODSetter merged 1 commit into
MODSetter:devfrom
Glodykajabika:test/blocknote-to-markdown-unit-tests
Jun 23, 2026
Merged

Test/blocknote to markdown unit tests#1533
MODSetter merged 1 commit into
MODSetter:devfrom
Glodykajabika:test/blocknote-to-markdown-unit-tests

Conversation

@Glodykajabika

@Glodykajabika Glodykajabika commented Jun 22, 2026

Copy link
Copy Markdown

Description

Adds comprehensive unit tests for the blocknote_to_markdown() function as specified in issue #1519. Tests cover all block types, inline styles, edge cases, and nested content with proper indentation.

The new test file tests/unit/utils/test_blocknote_to_markdown.py includes test cases for:

  • Headings (levels 1-6 with clamping)
  • Inline styles (bold, italic, code, strikethrough with code priority)
  • Lists (bullet, numbered with props.start, checklists checked/unchecked)
  • Code blocks (with/without language tags)
  • Tables (dict and list row shapes, mixed cell types)
  • Images and links nested within blocks
  • Nested children with indentation at multiple levels
  • Edge cases (None, empty lists, unknown block types)

Motivation and Context

The blocknote_to_markdown() function is a critical utility for converting BlockNote editor JSON to Markdown, but it had no unit tests. This PR provides comprehensive test coverage to ensure the function works correctly across all supported block types and edge cases.

closes#1519

Screenshots

N/A (unit tests, no visual changes)

API Changes

  • This PR includes API changes

Change Type

  • Bug fix
  • New feature
  • Performance improvement
  • Refactoring
  • Documentation
  • Dependency/Build system
  • Breaking change
  • Other (specify): test coverage

Testing Performed

All tests pass locally with:

uv run pytest -m unit tests/unit/utils/test_blocknote_to_markdown.py -v
# 33 passed, 0 failed
uv run ruff check .
uv run ruff format --check .
# Both passed clean
  • Tested locally
  • Manual/QA verification

Checklist

  • Follows project coding standards and conventions
  • Documentation updated as needed
  • Dependencies updated as needed
  • No lint/build errors or new warnings
  • All relevant tests are passing

Summary by CodeRabbit

  • New Features

    • Added numbered knowledge-base line citations ([citation:d<id>#L<start>-<end>]) and clickable inline tokens.
    • Clicking a line citation opens the editor in read-only source view with the referenced line range highlighted.
  • Improvements

    • Knowledge-base chunking, retrieval, and citation rendering now align to character-span offsets for more accurate copyable line-range tokens.
    • Editor/citation UI updated to label and highlight cited ranges when available.
    • Editor and export now use the canonical stored document body, returning clearer errors when it’s missing or unavailable.

@vercel

vercel Bot commented Jun 22, 2026

Copy link
Copy Markdown

@Glodykajabika is attempting to deploy a commit to the Rohan Verma's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 335d5682-2fb1-40cd-b7f5-d7cab25a56d3

📥 Commits

Reviewing files that changed from the base of the PR and between 21dfdd5 and 5625ca1.

📒 Files selected for processing (1)
  • surfsense_backend/tests/unit/utils/test_blocknote_to_markdown.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • surfsense_backend/tests/unit/utils/test_blocknote_to_markdown.py

📝 Walkthrough

Walkthrough

This PR implements a complete chunk character span system for precise source line citations. The backend adds start_char/end_char columns to store half-open character offsets, replaces string chunking with lossless span-addressed chunk_markdown_with_spans, propagates spans through indexing, retrieval, and KB filesystem reads. The KB search tool renders matched passages with line-range citations, and the KB persistence middleware stores spans on chunks. Agent prompts updated for the new [citation:d<id>#L<start>-<end>] format. The frontend adds a LineCitation component, extends the citation parser to recognize line citations, and implements Monaco editor line decorations and reveal behavior triggered from citation clicks.

Changes

End-to-end chunk character span and line-citation feature

Layer / File(s) Summary
Core data contracts: ChunkSlice, DB schema, schemas, and text_spans utility
surfsense_backend/app/indexing_pipeline/document_chunker.py, surfsense_backend/app/db.py, surfsense_backend/alembic/versions/166_add_chunk_char_spans.py, surfsense_backend/app/schemas/chunks.py, surfsense_backend/app/schemas/documents.py, surfsense_backend/app/utils/text_spans.py, surfsense_backend/app/indexing_pipeline/chunk_reconciler.py, surfsense_backend/tests/unit/utils/test_text_spans.py
Introduces ChunkSlice(text, start_char, end_char) dataclass for lossless span-addressed markdown slices. Adds start_char/end_char nullable integer columns to Chunk via Alembic migration and ORM model. Extends ChunkRead and DocumentWithChunksRead schemas with optional citation span fields. Updates ExistingChunk to carry loaded span metadata. Adds the shared char_span_to_line_range utility function to convert character offsets to 1-based inclusive line ranges.
Span-aware chunker and embedding cache pipeline
surfsense_backend/app/indexing_pipeline/cache/cached_indexing.py, surfsense_backend/app/config/__init__.py, surfsense_backend/tests/unit/indexing_pipeline/test_chunk_markdown_with_spans.py, surfsense_backend/tests/unit/indexing_pipeline/test_index_batch_parallel.py
Replaces the prior string-returning chunk_text_hybrid with chunk_markdown_with_spans, which produces a lossless, contiguous partition of ChunkSlice objects where tables are emitted whole and other regions are chunked without stripping whitespace. Refactors build_chunk_embeddings to precompute slices and align cache hits by text content, returning (summary_emb, list[SliceEmbedding]) pairs. Bumps EMBEDDING_CACHE_CHUNKER_VERSION to 2 to invalidate old caches.
Indexing pipeline service: scratch and incremental reindex with spans
surfsense_backend/app/indexing_pipeline/indexing_pipeline_service.py, surfsense_backend/tests/integration/indexing_pipeline/test_index_spans.py, surfsense_backend/tests/integration/indexing_pipeline/test_index_editions.py, surfsense_backend/tests/integration/document_upload/conftest.py, surfsense_backend/tests/integration/conftest.py
Scratch reindex persists start_char/end_char extracted from SliceEmbedding pairs on new chunks. Incremental reindex uses chunk_slices and a new _kept_row_span_updates helper to compute UPDATE payloads for kept chunks when span boundaries shift, avoiding unnecessary re-embeddings. _load_existing_chunks now fetches stored span columns for reconciliation.
Retrieval and document/editor API routes
surfsense_backend/app/retriever/chunks_hybrid_search.py, surfsense_backend/app/routes/documents_routes.py, surfsense_backend/app/routes/editor_routes.py, surfsense_backend/tests/integration/retriever/conftest.py, surfsense_backend/tests/integration/retriever/test_optimized_chunk_retriever.py, surfsense_backend/tests/integration/test_documents_by_chunk_route.py, surfsense_backend/tests/integration/test_editor_routes.py
hybrid_search selects and returns start_char/end_char per chunk in result dictionaries. get_document_by_chunk_id endpoint computes cited_start_line/cited_end_line from chunk character spans via char_span_to_line_range when both spans and source_markdown are available. Editor routes remove chunk-based markdown reconstruction and introduce _raise_no_canonical_body to raise typed HTTP errors (400/409/422) based on document processing state when source_markdown is absent.
KB filesystem read path: numbered preamble, kb_postgres, and tool adaptations
surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/numbered_document.py, surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/kb_postgres.py, surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/tools/edit_file/index.py, surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/tools/move_file/helpers.py, surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/tools/read_file/index.py, surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/tools/rm/helpers.py, surfsense_backend/tests/unit/middleware/test_numbered_document.py, surfsense_backend/tests/unit/middleware/test_kb_persistence_filesystem_parity.py, surfsense_backend/tests/unit/middleware/test_b_filesystem_rm_rmdir_cloud.py
New numbered_document.py module computes matched line ranges from chunk character spans and builds an XML <document_metadata> preamble. _load_file_data returns a 3-tuple (file_data, doc_id, preamble) and the read_file tool prepends the preamble when offset == 0. Edit, move, read, and rm filesystem tools updated to unpack the 3-tuple return value.
Search KB tool, KB persistence middleware, and agent system prompts
surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/tools/search_knowledge_base.py, surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/middleware/kb_persistence/middleware.py, surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/system_prompt/prompts/citations/on.md, surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_cloud.md, surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_desktop.md, surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_readonly_cloud.md, surfsense_backend/tests/unit/agents/multi_agent_chat/tools/test_search_knowledge_base.py, surfsense_backend/tests/integration/agents/multi_agent_chat/test_kb_persistence_spans.py
search_knowledge_base tool adds _resolve_doc_context, _render_passage, _citation_token, and matched-passage rendering helpers to emit formatted KB search results with line-range citations ([citation:dID#Lstart-end]). KB persistence middleware replaces the prior asyncio.to_thread(embed_texts, chunk_text(...)) flow with build_chunk_embeddings, persisting start_char/end_char on all created and updated chunks. Agent system prompts updated to define three citation channels with explicit line-range citation format and stricter accuracy constraints.
Frontend: citation parser, LineCitation component, editor highlighting, and utilities
surfsense_web/lib/citations/citation-parser.ts, surfsense_web/atoms/editor/editor-panel.atom.ts, surfsense_web/components/assistant-ui/inline-citation.tsx, surfsense_web/components/citations/citation-renderer.tsx, surfsense_web/components/editor/plugins/citation-kit.tsx, surfsense_web/components/editor/source-code-editor.tsx, surfsense_web/components/citation-panel/citation-panel.tsx, surfsense_web/components/editor-panel/editor-panel.tsx, surfsense_web/components/layout/ui/right-panel/RightPanel.tsx, surfsense_web/contracts/types/document.types.ts, surfsense_web/app/globals.css, surfsense_backend/tests/unit/utils/test_blocknote_to_markdown.py
Citation parser extends CITATION_REGEX to match d<documentId>#L<start>-<end> line citations and emits a new "line" token type. New LineCitation component renders clickable line citations that open the editor panel with highlightLines/forceSourceView flags set. EditorLineRange state type added to editor-panel.atom.ts and wired through EditorPanelContent, desktop/mobile panel variants, and RightPanel (desktop-only). SourceCodeEditor accepts highlightLines and applies Monaco whole-line decorations with viewport reveal on mount and layout changes. CitationElementNode and makeCitationElement support the new line variant. Adds comprehensive blocknote_to_markdown utility test suite covering all block types and edge cases.

Sequence Diagram(s)

sequenceDiagram
  participant Agent
  participant search_knowledge_base
  participant DB
  participant kb_postgres._load_file_data
  participant numbered_document
  participant LLM

  Agent->>search_knowledge_base: query
  search_knowledge_base->>DB: hybrid_search(query)
  DB-->>search_knowledge_base: hits with start_char/end_char per chunk
  search_knowledge_base->>DB: _resolve_doc_context(doc_ids)
  DB-->>search_knowledge_base: document bodies & metadata
  search_knowledge_base->>search_knowledge_base: render matched passages + compute line ranges
  search_knowledge_base-->>Agent: formatted KB results with [citation:dID#Lx-y]

  Agent->>kb_postgres._load_file_data: read_file(path, offset=0, matched_chunk_ids={101,102})
  kb_postgres._load_file_data->>DB: fetch source_markdown + chunk spans
  kb_postgres._load_file_data->>numbered_document: compute_matched_line_ranges(chunks, {101,102})
  numbered_document-->>kb_postgres._load_file_data: [(5,8), (12,15)]
  kb_postgres._load_file_data->>numbered_document: build_read_preamble(doc_id=42, ranges)
  numbered_document-->>kb_postgres._load_file_data: XML preamble with metadata
  kb_postgres._load_file_data-->>Agent: (file_data, 42, preamble)
  Agent->>LLM: [preamble] + numbered source markdown
  LLM-->>Agent: response citing [citation:d42#L5-8]
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~90 minutes

Possibly related issues

  • [Test] Add unit tests for blocknote_to_markdown #1519: The main issue adds comprehensive unit tests for the blocknote_to_markdown function that directly implement the feature request outlined in the retrieved issue, covering all requested test cases (headings, inline styles, lists, code blocks, tables, images, links, nested children, and edge cases).

Possibly related PRs

  • MODSetter/SurfSense#286: Both PRs wire citations to "jump to source" behavior via the /documents/by-chunk/{chunk_id} endpoint; this PR extends that flow with backend start_char/end_charcited_start_line/cited_end_line mapping and corresponding frontend line citation parsing and highlightLines editor panel navigation.
  • MODSetter/SurfSense#1489: Both PRs modify KB persistence middleware and the build_chunk_embeddings(...) embedding pipeline in surfsense_backend/app/indexing_pipeline/cache/cached_indexing.py, with this PR adding span-aware slice-embedding pairs to the caching and incremental reindexing logic.

Suggested reviewers

  • MODSetter

Poem

🐇✨ A rabbit found the spans one day,
Each chunk a slice with offsets—hooray!
"Start here at char, end over there,
Now line five through eight—citations laid bare!"
The Monaco editor lit up each line,
[citation:d42#L5-8] working divine! 🎯📖

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 49.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Test/blocknote to markdown unit tests' is directly related to the primary change: adding a new unit test module for the blocknote_to_markdown function with 33 test cases.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
surfsense_backend/tests/integration/indexing_pipeline/test_index_spans.py (1)

32-37: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Assert span ordering/contiguity explicitly in the integration invariant helper.

Current checks can miss mispositioned offsets when chunk text repeats. Add a cursor-based monotonic/contiguous assertion to validate absolute positions, not just content equality.

🔧 Suggested strengthening
 def _assert_spans_address_body(chunks: list[Chunk], body: str) -> None:
+    cursor = 0
     for chunk in chunks:
         assert chunk.start_char is not None and chunk.end_char is not None
+        assert chunk.start_char == cursor
+        assert chunk.end_char >= chunk.start_char
         assert body[chunk.start_char : chunk.end_char] == chunk.content
+        cursor = chunk.end_char
     assert "".join(c.content for c in chunks) == body
+    assert cursor == len(body)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@surfsense_backend/tests/integration/indexing_pipeline/test_index_spans.py`
around lines 32 - 37, The _assert_spans_address_body function needs to add
explicit validation for chunk ordering and contiguity to catch mispositioned
offsets. Add assertions that verify: chunks are ordered with monotonically
increasing start_char positions, each chunk is contiguous with the previous one
(each chunk's start_char equals the previous chunk's end_char), the first chunk
begins at position 0, and the final chunk ends at the length of the body. Use a
cursor variable that tracks the expected position of each successive chunk to
validate these invariants.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/kb_postgres.py`:
- Around line 530-534: The query in the select statement is unnecessarily
fetching Chunk.content when canonical reads only require id, start_char, and
end_char for line-range mapping, causing wasteful DB I/O and memory overhead.
Remove Chunk.content from the select call at lines 530-534 to fetch only the
required fields (id, start_char, end_char, and position for ordering). Apply the
same fix to the similar queries at lines 544-547 and 558-563 that fetch
Chunk.content unnecessarily.
- Around line 523-524: The issue is that line 523 coerces None to an empty
string for source_markdown, and then the truthy check at line 543 treats empty
strings as falsy, incorrectly routing documents with intentionally empty
canonical markdown to the legacy XML fallback. To fix this, preserve the
distinction between None (no canonical body provided) and "" (intentionally
empty canonical body) by removing the coercion at line 523 that converts None to
"". Then update the conditional logic at lines 543-557 and 558-574 from truthy
checks to explicit None checks, so that documents with empty canonical markdown
are properly handled and not incorrectly routed to the legacy XML fallback.

In `@surfsense_backend/tests/integration/conftest.py`:
- Around line 163-167: The mock for chunk_markdown_with_spans is returning a
static chunk slice that doesn't depend on the actual input markdown, violating
the function contract. Modify the monkeypatched mock to accept the input
markdown as a parameter and dynamically generate ChunkSlice objects based on the
actual spans within that input, rather than returning a hardcoded fixed chunk
slice regardless of what markdown is passed to chunk_markdown_with_spans.

In `@surfsense_backend/tests/integration/document_upload/conftest.py`:
- Around line 291-295: The mock for chunk_markdown_with_spans in the monkeypatch
call returns a fixed ChunkSlice regardless of the markdown input, which breaks
span semantics and reduces test reliability. Modify the MagicMock to be a
function that accepts the markdown argument and derives the ChunkSlice from that
actual input instead of returning hardcoded values, ensuring the mock properly
processes different markdown inputs and maintains correct span semantics for the
integration test.

In `@surfsense_backend/tests/unit/utils/test_blocknote_to_markdown.py`:
- Around line 16-20: The test_block fixture and other shared mutable fixtures
defined at class level (including those at lines 101-104, 145-148, 170-177,
210-214, and 239-242) are being mutated across test methods, causing cross-test
state leakage and order-dependent behavior. Convert these class-level fixtures
to per-test fixtures by either using pytest fixture decorators with function
scope or by creating fresh local objects directly within each test method that
needs them. This ensures each test gets its own independent copy of the data
without mutations affecting other tests.
- Around line 154-157: The assert statements at lines 154-157, 165-168, 184-188,
and 202-206 are currently asserting generator objects (which are always truthy)
rather than the comparison results themselves, causing the tests to pass even
when blocknote_to_markdown() is wrong. Wrap each generator expression with the
all() function to properly validate that every comparison in the generator
evaluates to True, ensuring these assertions actually test the output from
blocknote_to_markdown() instead of just confirming a generator object exists.

In `@surfsense_web/lib/citations/citation-parser.ts`:
- Around line 102-109: In the code block where lineMatch is checked and a line
segment is pushed to the segments array, add validation to ensure the parsed
line-citation bounds follow the 1-based inclusive contract. Before pushing the
segment object in the segments.push() call, validate that startLine is greater
than or equal to 1 and that endLine is greater than or equal to startLine. Only
push the segment to the array if both validation conditions pass, otherwise skip
it to prevent invalid citations from propagating bad highlight ranges.

---

Nitpick comments:
In `@surfsense_backend/tests/integration/indexing_pipeline/test_index_spans.py`:
- Around line 32-37: The _assert_spans_address_body function needs to add
explicit validation for chunk ordering and contiguity to catch mispositioned
offsets. Add assertions that verify: chunks are ordered with monotonically
increasing start_char positions, each chunk is contiguous with the previous one
(each chunk's start_char equals the previous chunk's end_char), the first chunk
begins at position 0, and the final chunk ends at the length of the body. Use a
cursor variable that tracks the expected position of each successive chunk to
validate these invariants.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5a78f101-6f1e-4428-8d65-d0fb7cd5f164

📥 Commits

Reviewing files that changed from the base of the PR and between ee241e0 and 53368ea.

📒 Files selected for processing (54)
  • surfsense_backend/alembic/versions/166_add_chunk_char_spans.py
  • surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/middleware/kb_persistence/middleware.py
  • surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/system_prompt/prompts/citations/on.md
  • surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/tools/search_knowledge_base.py
  • surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/kb_postgres.py
  • surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/numbered_document.py
  • surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/tools/edit_file/index.py
  • surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/tools/move_file/helpers.py
  • surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/tools/read_file/index.py
  • surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/tools/rm/helpers.py
  • surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_cloud.md
  • surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_desktop.md
  • surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_readonly_cloud.md
  • surfsense_backend/app/config/__init__.py
  • surfsense_backend/app/db.py
  • surfsense_backend/app/indexing_pipeline/cache/cached_indexing.py
  • surfsense_backend/app/indexing_pipeline/chunk_reconciler.py
  • surfsense_backend/app/indexing_pipeline/document_chunker.py
  • surfsense_backend/app/indexing_pipeline/indexing_pipeline_service.py
  • surfsense_backend/app/retriever/chunks_hybrid_search.py
  • surfsense_backend/app/routes/documents_routes.py
  • surfsense_backend/app/routes/editor_routes.py
  • surfsense_backend/app/schemas/chunks.py
  • surfsense_backend/app/schemas/documents.py
  • surfsense_backend/app/utils/text_spans.py
  • surfsense_backend/tests/integration/agents/multi_agent_chat/test_kb_persistence_spans.py
  • surfsense_backend/tests/integration/conftest.py
  • surfsense_backend/tests/integration/document_upload/conftest.py
  • surfsense_backend/tests/integration/indexing_pipeline/adapters/test_file_upload_adapter.py
  • surfsense_backend/tests/integration/indexing_pipeline/test_index_editions.py
  • surfsense_backend/tests/integration/indexing_pipeline/test_index_spans.py
  • surfsense_backend/tests/integration/retriever/conftest.py
  • surfsense_backend/tests/integration/retriever/test_optimized_chunk_retriever.py
  • surfsense_backend/tests/integration/test_documents_by_chunk_route.py
  • surfsense_backend/tests/integration/test_editor_routes.py
  • surfsense_backend/tests/unit/agents/multi_agent_chat/tools/test_search_knowledge_base.py
  • surfsense_backend/tests/unit/indexing_pipeline/test_chunk_markdown_with_spans.py
  • surfsense_backend/tests/unit/indexing_pipeline/test_index_batch_parallel.py
  • surfsense_backend/tests/unit/middleware/test_b_filesystem_rm_rmdir_cloud.py
  • surfsense_backend/tests/unit/middleware/test_kb_persistence_filesystem_parity.py
  • surfsense_backend/tests/unit/middleware/test_numbered_document.py
  • surfsense_backend/tests/unit/utils/test_blocknote_to_markdown.py
  • surfsense_backend/tests/unit/utils/test_text_spans.py
  • surfsense_web/app/globals.css
  • surfsense_web/atoms/editor/editor-panel.atom.ts
  • surfsense_web/components/assistant-ui/inline-citation.tsx
  • surfsense_web/components/citation-panel/citation-panel.tsx
  • surfsense_web/components/citations/citation-renderer.tsx
  • surfsense_web/components/editor-panel/editor-panel.tsx
  • surfsense_web/components/editor/plugins/citation-kit.tsx
  • surfsense_web/components/editor/source-code-editor.tsx
  • surfsense_web/components/layout/ui/right-panel/RightPanel.tsx
  • surfsense_web/contracts/types/document.types.ts
  • surfsense_web/lib/citations/citation-parser.ts

Comment on lines +523 to +524
source_markdown = document.source_markdown or ""
document_type = (

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Do not treat empty canonical markdown as “no canonical body.”

Line 523 coerces None to "", and Line 543 uses a truthy check. A document with intentionally empty canonical content is incorrectly routed to the legacy XML fallback, which breaks the canonical-read contract.

💡 Suggested fix
-            source_markdown = document.source_markdown or ""
+            source_markdown = document.source_markdown
@@
-        if source_markdown:
+        if source_markdown is not None:
             ranges = compute_matched_line_ranges(
                 source_markdown,
                 [(r.id, r.start_char, r.end_char) for r in chunk_records],
                 matched,
             )

Also applies to: 543-557, 558-574

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/kb_postgres.py`
around lines 523 - 524, The issue is that line 523 coerces None to an empty
string for source_markdown, and then the truthy check at line 543 treats empty
strings as falsy, incorrectly routing documents with intentionally empty
canonical markdown to the legacy XML fallback. To fix this, preserve the
distinction between None (no canonical body provided) and "" (intentionally
empty canonical body) by removing the coercion at line 523 that converts None to
"". Then update the conditional logic at lines 543-557 and 558-574 from truthy
checks to explicit None checks, so that documents with empty canonical markdown
are properly handled and not incorrectly routed to the legacy XML fallback.

Comment on lines 530 to 534
chunk_rows = await session.execute(
select(Chunk.id, Chunk.content)
select(Chunk.id, Chunk.content, Chunk.start_char, Chunk.end_char)
.where(Chunk.document_id == document.id)
.order_by(Chunk.position, Chunk.id)
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Performance & Scalability | 🟠 Major | ⚡ Quick win

Avoid loading full chunk text on canonical reads.

Line 531 always fetches Chunk.content, but canonical reads only consume id/start_char/end_char for line-range mapping. This adds avoidable DB I/O and memory pressure on large documents.

💡 Suggested optimization
-            chunk_rows = await session.execute(
-                select(Chunk.id, Chunk.content, Chunk.start_char, Chunk.end_char)
-                .where(Chunk.document_id == document.id)
-                .order_by(Chunk.position, Chunk.id)
-            )
+            if document.source_markdown is not None:
+                chunk_rows = await session.execute(
+                    select(Chunk.id, Chunk.start_char, Chunk.end_char)
+                    .where(Chunk.document_id == document.id)
+                    .order_by(Chunk.position, Chunk.id)
+                )
+            else:
+                chunk_rows = await session.execute(
+                    select(Chunk.id, Chunk.content, Chunk.start_char, Chunk.end_char)
+                    .where(Chunk.document_id == document.id)
+                    .order_by(Chunk.position, Chunk.id)
+                )

Also applies to: 544-547, 558-563

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@surfsense_backend/app/agents/chat/multi_agent_chat/shared/middleware/filesystem/backends/kb_postgres.py`
around lines 530 - 534, The query in the select statement is unnecessarily
fetching Chunk.content when canonical reads only require id, start_char, and
end_char for line-range mapping, causing wasteful DB I/O and memory overhead.
Remove Chunk.content from the select call at lines 530-534 to fetch only the
required fields (id, start_char, end_char, and position for ordering). Apply the
same fix to the similar queries at lines 544-547 and 558-563 that fetch
Chunk.content unnecessarily.

Comment on lines +163 to 167
text = "Test chunk content."
mock = MagicMock(return_value=[ChunkSlice(text, 0, len(text))])
monkeypatch.setattr(
"app.indexing_pipeline.cache.cached_indexing.chunk_text_hybrid",
"app.indexing_pipeline.cache.cached_indexing.chunk_markdown_with_spans",
mock,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Mocked chunk slices don’t reflect input markdown spans.

This fixture returns a fixed chunk/span pair regardless of input, which breaks the ChunkSlice contract and weakens span-sensitive integration assertions.

✅ Contract-preserving mock
-    text = "Test chunk content."
-    mock = MagicMock(return_value=[ChunkSlice(text, 0, len(text))])
+    def _mock_chunker(markdown: str, *args, **kwargs):
+        return [ChunkSlice(markdown, 0, len(markdown))]
+
+    mock = MagicMock(side_effect=_mock_chunker)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
text = "Test chunk content."
mock = MagicMock(return_value=[ChunkSlice(text, 0, len(text))])
monkeypatch.setattr(
"app.indexing_pipeline.cache.cached_indexing.chunk_text_hybrid",
"app.indexing_pipeline.cache.cached_indexing.chunk_markdown_with_spans",
mock,
def _mock_chunker(markdown: str, *args, **kwargs):
return [ChunkSlice(markdown, 0, len(markdown))]
mock = MagicMock(side_effect=_mock_chunker)
monkeypatch.setattr(
"app.indexing_pipeline.cache.cached_indexing.chunk_markdown_with_spans",
mock,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@surfsense_backend/tests/integration/conftest.py` around lines 163 - 167, The
mock for chunk_markdown_with_spans is returning a static chunk slice that
doesn't depend on the actual input markdown, violating the function contract.
Modify the monkeypatched mock to accept the input markdown as a parameter and
dynamically generate ChunkSlice objects based on the actual spans within that
input, rather than returning a hardcoded fixed chunk slice regardless of what
markdown is passed to chunk_markdown_with_spans.

Comment on lines +291 to 295
chunk = "Test chunk content."
monkeypatch.setattr(
"app.indexing_pipeline.cache.cached_indexing.chunk_text",
MagicMock(return_value=["Test chunk content."]),
"app.indexing_pipeline.cache.cached_indexing.chunk_markdown_with_spans",
MagicMock(return_value=[ChunkSlice(chunk, 0, len(chunk))]),
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Chunking mock should derive ChunkSlice from the incoming markdown.

Returning a fixed chunk/span pair for every call can invalidate span semantics and make integration checks less trustworthy.

✅ Contract-preserving mock
-    chunk = "Test chunk content."
     monkeypatch.setattr(
         "app.indexing_pipeline.cache.cached_indexing.chunk_markdown_with_spans",
-        MagicMock(return_value=[ChunkSlice(chunk, 0, len(chunk))]),
+        MagicMock(
+            side_effect=lambda markdown, *args, **kwargs: [
+                ChunkSlice(markdown, 0, len(markdown))
+            ]
+        ),
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
chunk = "Test chunk content."
monkeypatch.setattr(
"app.indexing_pipeline.cache.cached_indexing.chunk_text",
MagicMock(return_value=["Test chunk content."]),
"app.indexing_pipeline.cache.cached_indexing.chunk_markdown_with_spans",
MagicMock(return_value=[ChunkSlice(chunk, 0, len(chunk))]),
)
monkeypatch.setattr(
"app.indexing_pipeline.cache.cached_indexing.chunk_markdown_with_spans",
MagicMock(
side_effect=lambda markdown, *args, **kwargs: [
ChunkSlice(markdown, 0, len(markdown))
]
),
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@surfsense_backend/tests/integration/document_upload/conftest.py` around lines
291 - 295, The mock for chunk_markdown_with_spans in the monkeypatch call
returns a fixed ChunkSlice regardless of the markdown input, which breaks span
semantics and reduces test reliability. Modify the MagicMock to be a function
that accepts the markdown argument and derives the ChunkSlice from that actual
input instead of returning hardcoded values, ensuring the mock properly
processes different markdown inputs and maintains correct span semantics for the
integration test.

Comment thread surfsense_backend/tests/unit/utils/test_blocknote_to_markdown.py Outdated
Comment thread surfsense_backend/tests/unit/utils/test_blocknote_to_markdown.py Outdated
Comment on lines +102 to +109
const lineMatch = LINE_CITATION_REGEX.exec(captured);
if (lineMatch) {
segments.push({
kind: "line",
documentId: Number.parseInt(lineMatch[1], 10),
startLine: Number.parseInt(lineMatch[2], 10),
endLine: Number.parseInt(lineMatch[3], 10),
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Validate line-citation bounds before emitting a "line" token.

At Line 102–109, parsed values are accepted without enforcing the 1-based inclusive contract (startLine >= 1 and endLine >= startLine). Invalid citations can propagate bad highlight ranges.

Suggested fix
 		const lineMatch = LINE_CITATION_REGEX.exec(captured);
 		if (lineMatch) {
+			const documentId = Number.parseInt(lineMatch[1], 10);
+			const startLine = Number.parseInt(lineMatch[2], 10);
+			const endLine = Number.parseInt(lineMatch[3], 10);
+			if (startLine < 1 || endLine < startLine) {
+				segments.push(match[0]);
+				lastIndex = match.index + match[0].length;
+				match = CITATION_REGEX.exec(text);
+				continue;
+			}
 			segments.push({
 				kind: "line",
-				documentId: Number.parseInt(lineMatch[1], 10),
-				startLine: Number.parseInt(lineMatch[2], 10),
-				endLine: Number.parseInt(lineMatch[3], 10),
+				documentId,
+				startLine,
+				endLine,
 			});
 		} else if (captured.startsWith("http://") || captured.startsWith("https://")) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@surfsense_web/lib/citations/citation-parser.ts` around lines 102 - 109, In
the code block where lineMatch is checked and a line segment is pushed to the
segments array, add validation to ensure the parsed line-citation bounds follow
the 1-based inclusive contract. Before pushing the segment object in the
segments.push() call, validate that startLine is greater than or equal to 1 and
that endLine is greater than or equal to startLine. Only push the segment to the
array if both validation conditions pass, otherwise skip it to prevent invalid
citations from propagating bad highlight ranges.

@Glodykajabika Glodykajabika marked this pull request as draft June 22, 2026 22:25
@Glodykajabika Glodykajabika marked this pull request as ready for review June 22, 2026 22:59
@Glodykajabika Glodykajabika marked this pull request as draft June 22, 2026 23:31
@Glodykajabika Glodykajabika force-pushed the test/blocknote-to-markdown-unit-tests branch from 21dfdd5 to 5625ca1 Compare June 22, 2026 23:44
@Glodykajabika Glodykajabika marked this pull request as ready for review June 22, 2026 23:45
@CREDO23

CREDO23 commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

@Glodykajabika can you please target the dev branch ?

@Glodykajabika Glodykajabika changed the base branch from main to dev June 23, 2026 06:06
@MODSetter MODSetter merged commit cb9da1b into MODSetter:dev Jun 23, 2026
1 of 2 checks passed
@Glodykajabika Glodykajabika deleted the test/blocknote-to-markdown-unit-tests branch June 23, 2026 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants