Releases · ENDEVSOLS/LongParser

05 May 07:20

github-actions

v0.1.5

58c688f

v0.1.5 Latest

Latest

Added

Semantic chunking — all-MiniLM-L6-v2 embedding-based boundary detection in HybridChunker (optional via use_semantic_chunking).
Cross-reference resolution — Highly efficient $O(N)$ resolution for explicit ("Figure 3") and implicit ("the table below") references via spatial proximity.
Summary chunks — Asynchronous ARQ background worker (enrich_summaries_job) to auto-generate LLM section summaries for hierarchical RAG retrieval.
Chunk quality scorer — Zero-ML, heuristic-based chunk scoring using block token confidences, Dictionary Word Coverage (/usr/share/dict/words), and fastText Lang-ID validation.
PII redaction — Hybrid approach using fast Regex+Luhn (Emails, Phones, SSNs, CCs, IPs) and optional spaCy NER (en_core_web_sm) for names, organizations, and locations. Preserves original values in secure block metadata for HITL.

Changed

Bumped marker-pdf version support in dependencies.
Added ner optional dependency group (spacy>=3.7.0) in pyproject.toml.
Expanded ChunkingConfig and ProcessingConfig with new semantic, summary, and PII toggle options.
Marked Phase 1 as officially complete in Roadmap.

Assets 2

23 Apr 06:19

github-actions

v0.1.4

ae967c9

v0.1.4

Release v0.1.4: Add fast PDF extractor, auto-language detection, AGPL…

Assets 2

13 Apr 05:07

github-actions

v0.1.3

d4ae5c1

v0.1.3

Fixed

Source code: Added DocumentPipeline as a public alias for PipelineOrchestrator —
docs, quickstart, and all examples now use this name consistently
Documentation: Fixed wrong coverage path long_parser → longparser in CONTRIBUTING.md
Documentation: Replaced stale cleanrag-api reference in Docker deployment docs
Documentation: Standardized Gemini API key env var to GOOGLE_API_KEY across all docs
Source code: Updated default LLM model fallback from gpt-4o to gpt-5.3 in
schemas.py, llm_chain.py, and engine.py
Source code: Renamed stale cleanrag: Redis key prefix to longparser: in embeddings

Changed

Python 3.13 added to CI matrix, badges, and installation docs
SECURITY.md updated with Redis rate-limiting and CORS threat mitigations

Assets 2

05 Apr 05:19

github-actions

v0.1.2

8d4af41

v0.1.2

Changed

Project logo added to documentation site, README, and PyPI page
Documentation site header updated — logo replaces text title
Installation guide restructured for clarity

Assets 2

04 Apr 18:17

github-actions

v0.1.1

51dbd6b

v0.1.1

Added

CPU / GPU install separation — dedicated [cpu] and [gpu] meta-extras for clean one-command installs
faiss-gpu extra (faiss-gpu>=1.7) as a distinct option from faiss-cpu
Granular torch-based extras — embeddings-cpu, embeddings-gpu, latex-ocr-cpu, latex-ocr-gpu for fine-grained dependency control

Fixed

Package metadata: license field updated to SPDX expression format per PEP 639
Documentation site build reliability improvements

Changed

[gpu] is now the recommended default install — one command, works on both GPU and CPU machines
[cpu] documented as the advanced path for size-constrained environments (Docker, edge, CI)
[all] now resolves to [cpu] as a safe, dependency-minimal default

Assets 2

04 Apr 16:35

github-actions

v0.1.0

849ad54

v0.1.0

🎉 Initial Public Release

LongParser is the open-source document intelligence engine built by ENDEVSOLS
for production RAG pipelines.

Added

5-stage extraction pipeline — Extract → Validate → HITL Review → Chunk → Embed → Index
Multi-format extraction — PDF, DOCX, PPTX, XLSX, CSV via Docling
HybridChunker — token-aware, heading-hierarchy-aware, table-aware chunking
Human-in-the-Loop (HITL) review — approve / edit / reject blocks and chunks
via LangGraph interrupt() before embedding
3-layer memory chat — short-term turns + rolling summary + long-term facts,
powered by LCEL chains
Multi-provider LLM support — OpenAI (gpt-4o), Gemini (gemini-2.0-flash),
Groq (llama-3.3-70b-versatile), OpenRouter
Multi-backend vector stores — Chroma, FAISS, Qdrant
Async-first REST API — FastAPI + Motor (MongoDB) + ARQ (Redis job queue)
LongParserRetriever — drop-in LangChain BaseRetriever adapter
LongParserLoader — LangChain document loader integration
LongParserReader — LlamaIndex BaseReader integration
LongParserCallbackHandler — observability callbacks for LangChain chains
Built-in citation validation — chunk IDs verified against retrieved set
before any answer is returned
Privacy-first — all processing runs locally; no data leaves your infrastructure
py.typed marker — full PEP 561 typing support
Unit test suite — test_schemas.py (22 passing), test_llm_chain.py,
test_chat_utils.py
GitHub Actions CI — lint (ruff), tests across Python 3.10 / 3.11 / 3.12,
coverage reporting
GitHub Actions publish — PyPI trusted publishing triggered on GitHub releases
pyproject.toml with server, langchain, llamaindex, embeddings,
chroma, faiss, qdrant optional extras
Dockerfile and docker-compose.yml for one-command local deployment
CONTRIBUTING.md, SECURITY.md, .env.example — full OSS scaffolding

Assets 2

Uh oh!

Releases: ENDEVSOLS/LongParser

v0.1.5

Added

Changed

Uh oh!

v0.1.4

Uh oh!

v0.1.3

Fixed

Changed

Uh oh!

v0.1.2

Changed

Uh oh!

v0.1.1

Added

Fixed

Changed

Uh oh!

v0.1.0

🎉 Initial Public Release

Added

Uh oh!