Releases: ENDEVSOLS/LongParser
Releases · ENDEVSOLS/LongParser
v0.1.5
Added
-
Semantic chunking —
all-MiniLM-L6-v2embedding-based boundary detection inHybridChunker(optional viause_semantic_chunking). -
Cross-reference resolution — Highly efficient
$O(N)$ resolution for explicit ("Figure 3") and implicit ("the table below") references via spatial proximity. -
Summary chunks — Asynchronous ARQ background worker (
enrich_summaries_job) to auto-generate LLM section summaries for hierarchical RAG retrieval. -
Chunk quality scorer — Zero-ML, heuristic-based chunk scoring using block token confidences, Dictionary Word Coverage (
/usr/share/dict/words), and fastText Lang-ID validation. -
PII redaction — Hybrid approach using fast Regex+Luhn (Emails, Phones, SSNs, CCs, IPs) and optional spaCy NER (
en_core_web_sm) for names, organizations, and locations. Preserves original values in secure block metadata for HITL.
Changed
- Bumped
marker-pdfversion support in dependencies. - Added
neroptional dependency group (spacy>=3.7.0) inpyproject.toml. - Expanded
ChunkingConfigandProcessingConfigwith new semantic, summary, and PII toggle options. - Marked Phase 1 as officially complete in Roadmap.
v0.1.4
Release v0.1.4: Add fast PDF extractor, auto-language detection, AGPL…
v0.1.3
Fixed
- Source code: Added
DocumentPipelineas a public alias forPipelineOrchestrator—
docs, quickstart, and all examples now use this name consistently - Documentation: Fixed wrong coverage path
long_parser→longparserinCONTRIBUTING.md - Documentation: Replaced stale
cleanrag-apireference in Docker deployment docs - Documentation: Standardized Gemini API key env var to
GOOGLE_API_KEYacross all docs - Source code: Updated default LLM model fallback from
gpt-4otogpt-5.3in
schemas.py,llm_chain.py, andengine.py - Source code: Renamed stale
cleanrag:Redis key prefix tolongparser:in embeddings
Changed
- Python 3.13 added to CI matrix, badges, and installation docs
SECURITY.mdupdated with Redis rate-limiting and CORS threat mitigations
v0.1.2
Changed
- Project logo added to documentation site, README, and PyPI page
- Documentation site header updated — logo replaces text title
- Installation guide restructured for clarity
v0.1.1
Added
- CPU / GPU install separation — dedicated
[cpu]and[gpu]meta-extras for clean one-command installs faiss-gpuextra (faiss-gpu>=1.7) as a distinct option fromfaiss-cpu- Granular torch-based extras —
embeddings-cpu,embeddings-gpu,latex-ocr-cpu,latex-ocr-gpufor fine-grained dependency control
Fixed
- Package metadata: license field updated to SPDX expression format per PEP 639
- Documentation site build reliability improvements
Changed
[gpu]is now the recommended default install — one command, works on both GPU and CPU machines[cpu]documented as the advanced path for size-constrained environments (Docker, edge, CI)[all]now resolves to[cpu]as a safe, dependency-minimal default
v0.1.0
🎉 Initial Public Release
LongParser is the open-source document intelligence engine built by ENDEVSOLS
for production RAG pipelines.
Added
- 5-stage extraction pipeline —
Extract → Validate → HITL Review → Chunk → Embed → Index - Multi-format extraction — PDF, DOCX, PPTX, XLSX, CSV via Docling
HybridChunker— token-aware, heading-hierarchy-aware, table-aware chunking- Human-in-the-Loop (HITL) review — approve / edit / reject blocks and chunks
via LangGraphinterrupt()before embedding - 3-layer memory chat — short-term turns + rolling summary + long-term facts,
powered by LCEL chains - Multi-provider LLM support — OpenAI (
gpt-4o), Gemini (gemini-2.0-flash),
Groq (llama-3.3-70b-versatile), OpenRouter - Multi-backend vector stores — Chroma, FAISS, Qdrant
- Async-first REST API — FastAPI + Motor (MongoDB) + ARQ (Redis job queue)
LongParserRetriever— drop-in LangChainBaseRetrieveradapterLongParserLoader— LangChain document loader integrationLongParserReader— LlamaIndexBaseReaderintegrationLongParserCallbackHandler— observability callbacks for LangChain chains- Built-in citation validation — chunk IDs verified against retrieved set
before any answer is returned - Privacy-first — all processing runs locally; no data leaves your infrastructure
py.typedmarker — full PEP 561 typing support- Unit test suite —
test_schemas.py(22 passing),test_llm_chain.py,
test_chat_utils.py - GitHub Actions CI — lint (
ruff), tests across Python 3.10 / 3.11 / 3.12,
coverage reporting - GitHub Actions publish — PyPI trusted publishing triggered on GitHub releases
pyproject.tomlwithserver,langchain,llamaindex,embeddings,
chroma,faiss,qdrantoptional extrasDockerfileanddocker-compose.ymlfor one-command local deploymentCONTRIBUTING.md,SECURITY.md,.env.example— full OSS scaffolding