Releases: mohsin1218/LongParser
Releases · mohsin1218/LongParser
v0.1.5
Added
-
Semantic chunking —
all-MiniLM-L6-v2embedding-based boundary detection inHybridChunker(optional viause_semantic_chunking). -
Cross-reference resolution — Highly efficient
$O(N)$ resolution for explicit ("Figure 3") and implicit ("the table below") references via spatial proximity. -
Summary chunks — Asynchronous ARQ background worker (
enrich_summaries_job) to auto-generate LLM section summaries for hierarchical RAG retrieval. -
Chunk quality scorer — Zero-ML, heuristic-based chunk scoring using block token confidences, Dictionary Word Coverage (
/usr/share/dict/words), and fastText Lang-ID validation. -
PII redaction — Hybrid approach using fast Regex+Luhn (Emails, Phones, SSNs, CCs, IPs) and optional spaCy NER (
en_core_web_sm) for names, organizations, and locations. Preserves original values in secure block metadata for HITL.
Changed
- Bumped
marker-pdfversion support in dependencies. - Added
neroptional dependency group (spacy>=3.7.0) inpyproject.toml. - Expanded
ChunkingConfigandProcessingConfigwith new semantic, summary, and PII toggle options. - Marked Phase 1 as officially complete in Roadmap.
v0.1.4
Release v0.1.4: Add fast PDF extractor, auto-language detection, AGPL…