Releases · mohsin1218/LongParser · GitHub

05 May 07:08

v0.1.5 Latest

Latest

Added

Semantic chunking — all-MiniLM-L6-v2 embedding-based boundary detection in HybridChunker (optional via use_semantic_chunking).
Cross-reference resolution — Highly efficient $O(N)$ resolution for explicit ("Figure 3") and implicit ("the table below") references via spatial proximity.
Summary chunks — Asynchronous ARQ background worker (enrich_summaries_job) to auto-generate LLM section summaries for hierarchical RAG retrieval.
Chunk quality scorer — Zero-ML, heuristic-based chunk scoring using block token confidences, Dictionary Word Coverage (/usr/share/dict/words), and fastText Lang-ID validation.
PII redaction — Hybrid approach using fast Regex+Luhn (Emails, Phones, SSNs, CCs, IPs) and optional spaCy NER (en_core_web_sm) for names, organizations, and locations. Preserves original values in secure block metadata for HITL.

Changed

Bumped marker-pdf version support in dependencies.
Added ner optional dependency group (spacy>=3.7.0) in pyproject.toml.
Expanded ChunkingConfig and ProcessingConfig with new semantic, summary, and PII toggle options.
Marked Phase 1 as officially complete in Roadmap.

Assets 2

22 Apr 12:44

v0.1.4

Release v0.1.4: Add fast PDF extractor, auto-language detection, AGPL…

Assets 2