Document Intake -> Review -> Intelligent Fill -> Completeness
Submission note: this repository update is an AI-based additional implementation pass for the take-home, and is not intended to replace prior baseline work/history.
A Docker-first proof of concept for medical/regulatory document workflows:
- Upload one or more PDF/DOCX/TXT files
- Detect document type
- Extract structured fields with evidence/provenance
- Review/edit fields
- Accept/edit/reject suggestions
- Compute completeness (document + package + output-template matrix)
- Generate draft outputs (TXT/DOCX)
- Maintain audit trail and evolving ontology
- 1. Quick Start
- 2. What This Project Covers
- 3. Architecture
- 4. Data Model and Ontology
- 5. Extraction Pipeline (Section + RAG)
- 6. Completeness Model
- 7. API Endpoints
- 8. UI Walkthrough
- 9. Running End-to-End Validation
- 10. AI vs Deterministic Logic
- 11. Tradeoffs and Assumptions
- 12. Scaling Bottlenecks
- 13. Troubleshooting
- 14. Project Structure
- 15. GitHub Push Steps
- Docker Desktop / Docker Engine running
- Ports
3000and8000free - Optional: Groq API key for better extraction quality
cd ~/Docintelligent_POC
cp .env.example .env
# edit .env and set GROQ_API_KEY=... (optional but recommended)
docker compose up --build -ddocker compose down- Dockerized full-stack app (React + FastAPI)
- Ingestion with deduplication (SHA-256)
- Extraction with Groq + fallback heuristics
- Section-aware extraction and section-level coverage
- Suggestion workflow: accept / edit / reject
- Document/package/output-template completeness
- Output generation and download (
.txt,.docx) - Audit log and schema/ontology registry
- Authentication/authorization
- Load/performance testing
- Production-grade OCR for image-only scans
- Formal graph DB (uses JSON graph store)
Browser (React)
-> Upload, field review, suggestions, generation, traceability
-> Calls REST API
FastAPI Backend
-> Ingestion + dedup
-> Type detection
-> Section/chunking + retrieval + extraction
-> Suggestions + completeness + output generation
-> Audit logging
-> JSON ontology store
Storage
-> ontology_data/ontology_db.json
-> logs/audit.jsonl
-> uploads/, outputs/
Groq is used when available. If Groq errors or rate-limits, the backend fails open to heuristic extraction (upload still succeeds).
Ontology is a self-evolving JSON graph in ontology_data/ontology_db.json.
documents: uploaded docs and summary metadataelements: extracted fields with provenance/evidencetemplates: ingested sample/output templatesrelationships:has_field(document -> element)template_defines_field(template -> field name)field_related(element -> element semantic relation)
schema_registry: observed/evolving field catalog
source(e.g.groq_llm,groq_rag,heuristic_regex,heuristic_rag,suggestion_accept)confidenceevidencesection(where it came from)
Implemented in modular backend package backend/meddoc:
schemas.py: doc schemas + section field hintschunking.py: heading detection + section extraction + chunkingrag.py: in-memory lexical retrieval scoringpipeline.py: section-aware extraction orchestration
- Parse document text into sections
- Create chunks per section
- Run broad extraction (Groq + heuristic fallback)
- For missing fields, retrieve top relevant chunks and run targeted extraction
- Build section summary:
- expected required coverage
- actual extracted fields by section
- misaligned required fields
Added fields include:
biocompatibility_standardbiological_endpoints_evaluatedbiocompatibility_assessmentcytotoxicity_resultsensitization_resultirritation_result
For selected document type:
required_present / required_total * 100
- Output doc type coverage across uploaded package
Per template row:
- matched doc
- required fields present/total
- missing field count/list
completeness_pctmatched_doc_filename,matched_doc_uploaded_at
UI supports Show only incomplete rows in package card.
GET /api/health
POST /api/documents/uploadGET /api/documentsGET /api/documents/{id}PATCH /api/documents/{id}/fieldsPOST /api/documents/{id}/reprocess
POST /api/documents/{id}/suggestions/{sid}
POST /api/documents/{id}/generateGET /api/documents/{id}/downloadPOST /api/documents/{id}/generate-docxGET /api/documents/{id}/download-docx
GET /api/auditGET /api/schemaGET /api/package/completenessGET /api/templatesPOST /api/templates/ingest
POST /api/demo/load-sample
- Upload zone
- Package completeness card + output-template rows
- Document list (newest first)
- Audit panel
Fields: extracted values + confidence + source + sectionSections: section coverage and extracted-here listsSuggestions: pending + resolved (accept/edit/reject)Missing: missing required fieldsTemplates: template-derived field coverageTraceability: hash, status, provenance
Behavior improvements:
- Auto-selects first/newest document
- New upload auto-focuses on newest document
- Start stack:
docker compose up --build -d - Check health (
/api/health) - Upload a document in UI
- Open document -> inspect fields/suggestions/sections/missing
- Action suggestions (accept/edit/reject)
- Edit at least one field
- Generate output and download
- Verify audit events
- Re-upload same file to verify dedup
curl -s http://localhost:8000/api/health
curl -s http://localhost:8000/api/documents
curl -s http://localhost:8000/api/package/completeness
curl -s http://localhost:8000/api/schema- Groq extraction for structured field values
- Groq-based missing field suggestion generation
- RAG-targeted extraction for missing fields from relevant chunks
- SHA-256 dedup
- regex/heuristic extraction fallback
- completeness scoring arithmetic
- output assembly
- audit logging
- ontology append/update semantics
- JSON file store is simple but not high-scale
- In-memory retrieval is lightweight but not semantic vector retrieval
- Template mining from raw PDFs can be noisy without stronger normalization
- No auth (POC)
- English-language docs
- Extractable text is available (not image-only scan)
- One file primarily maps to one doc type
- JSON ontology file growth (I/O contention)
- Groq rate limits / cost
- No async job queue for heavy extraction
- In-memory retrieval not cross-document semantic search
- PDF table extraction still heuristic for complex layouts
Suggested production path:
- PostgreSQL + vector DB (pgvector/Qdrant)
- Worker queue (Celery/RQ)
- robust OCR/table parsing layer
- auth + multi-tenant scoping
- Hard refresh
Cmd+Shift+R - Ensure newest doc appears in left list
- Check dedup response (same bytes -> no new doc)
- Backend now falls back automatically
- Check
logs/audit.jsonlforgroq_request_error
- Click a document in left list
- Main tabs appear when a doc is selected
docker compose ps
docker compose logs backend --tail=120
docker compose logs frontend --tail=120Docintelligent_POC/
├── backend/
│ ├── app.py
│ ├── requirements.txt
│ └── meddoc/
│ ├── __init__.py
│ ├── routes/
│ │ ├── system.py
│ │ └── meta.py
│ ├── storage.py
│ ├── models.py
│ ├── schemas.py
│ ├── chunking.py
│ ├── rag.py
│ └── pipeline.py
├── frontend/
│ └── src/
│ ├── App.js
│ ├── App.css
│ └── ...
├── sample_docs/
├── uploads/
├── outputs/
├── logs/
├── Dockerfile.backend
├── Dockerfile.frontend
├── docker-compose.yml
└── README.md
Run from project root:
cd ~/Docintelligent_POC
git status
git add .
git commit -m "Finalize MedDoc intake POC: modular extraction, completeness matrix, UI fixes, README"If remote is not set:
git remote add origin <YOUR_GITHUB_REPO_URL>Push:
git branch -M main
git push -u origin mainIf you prefer a feature branch:
git checkout -b codex/readme-final
git push -u origin codex/readme-final