MedDoc Intake POC (AI-Based)

Document Intake -> Review -> Intelligent Fill -> Completeness

Submission note: this repository update is an AI-based additional implementation pass for the take-home, and is not intended to replace prior baseline work/history.

A Docker-first proof of concept for medical/regulatory document workflows:

Upload one or more PDF/DOCX/TXT files
Detect document type
Extract structured fields with evidence/provenance
Review/edit fields
Accept/edit/reject suggestions
Compute completeness (document + package + output-template matrix)
Generate draft outputs (TXT/DOCX)
Maintain audit trail and evolving ontology

1. Quick Start
2. What This Project Covers
3. Architecture
4. Data Model and Ontology
5. Extraction Pipeline (Section + RAG)
6. Completeness Model
7. API Endpoints
8. UI Walkthrough
9. Running End-to-End Validation
10. AI vs Deterministic Logic
11. Tradeoffs and Assumptions
12. Scaling Bottlenecks
13. Troubleshooting
14. Project Structure
15. GitHub Push Steps

1. Quick Start

Prerequisites

Docker Desktop / Docker Engine running
Ports 3000 and 8000 free
Optional: Groq API key for better extraction quality

Start

cd ~/Docintelligent_POC
cp .env.example .env
# edit .env and set GROQ_API_KEY=... (optional but recommended)
docker compose up --build -d

Open

UI: http://localhost:3000
Health: http://localhost:8000/api/health

Stop

docker compose down

2. What This Project Covers

In scope

Dockerized full-stack app (React + FastAPI)
Ingestion with deduplication (SHA-256)
Extraction with Groq + fallback heuristics
Section-aware extraction and section-level coverage
Suggestion workflow: accept / edit / reject
Document/package/output-template completeness
Output generation and download (.txt, .docx)
Audit log and schema/ontology registry

Out of scope

Authentication/authorization
Load/performance testing
Production-grade OCR for image-only scans
Formal graph DB (uses JSON graph store)

3. Architecture

Browser (React)
  -> Upload, field review, suggestions, generation, traceability
  -> Calls REST API

FastAPI Backend
  -> Ingestion + dedup
  -> Type detection
  -> Section/chunking + retrieval + extraction
  -> Suggestions + completeness + output generation
  -> Audit logging
  -> JSON ontology store

Storage
  -> ontology_data/ontology_db.json
  -> logs/audit.jsonl
  -> uploads/, outputs/

Groq is used when available. If Groq errors or rate-limits, the backend fails open to heuristic extraction (upload still succeeds).

4. Data Model and Ontology

Ontology is a self-evolving JSON graph in ontology_data/ontology_db.json.

Core entities

documents: uploaded docs and summary metadata
elements: extracted fields with provenance/evidence
templates: ingested sample/output templates
relationships:
- has_field (document -> element)
- template_defines_field (template -> field name)
- field_related (element -> element semantic relation)
schema_registry: observed/evolving field catalog

Provenance tracked per field

source (e.g. groq_llm, groq_rag, heuristic_regex, heuristic_rag, suggestion_accept)
confidence
evidence
section (where it came from)

5. Extraction Pipeline (Section + RAG)

Implemented in modular backend package backend/meddoc:

schemas.py: doc schemas + section field hints
chunking.py: heading detection + section extraction + chunking
rag.py: in-memory lexical retrieval scoring
pipeline.py: section-aware extraction orchestration

Flow

Parse document text into sections
Create chunks per section
Run broad extraction (Groq + heuristic fallback)
For missing fields, retrieve top relevant chunks and run targeted extraction
Build section summary:
- expected required coverage
- actual extracted fields by section
- misaligned required fields

Biocompatibility pack

Added fields include:

biocompatibility_standard
biological_endpoints_evaluated
biocompatibility_assessment
cytotoxicity_result
sensitization_result
irritation_result

6. Completeness Model

A. Document completeness

For selected document type:

required_present / required_total * 100

B. Package completeness

Output doc type coverage across uploaded package

C. Output-template completeness matrix

Per template row:

matched doc
required fields present/total
missing field count/list
completeness_pct
matched_doc_filename, matched_doc_uploaded_at

UI supports Show only incomplete rows in package card.

7. API Endpoints

Health

GET /api/health

Documents

POST /api/documents/upload
GET /api/documents
GET /api/documents/{id}
PATCH /api/documents/{id}/fields
POST /api/documents/{id}/reprocess

Suggestions

POST /api/documents/{id}/suggestions/{sid}

Output

POST /api/documents/{id}/generate
GET /api/documents/{id}/download
POST /api/documents/{id}/generate-docx
GET /api/documents/{id}/download-docx

Audit / Schema / Templates

GET /api/audit
GET /api/schema
GET /api/package/completeness
GET /api/templates
POST /api/templates/ingest

Demo

POST /api/demo/load-sample

8. UI Walkthrough

Left sidebar

Upload zone
Package completeness card + output-template rows
Document list (newest first)
Audit panel

Main document view tabs

Fields: extracted values + confidence + source + section
Sections: section coverage and extracted-here lists
Suggestions: pending + resolved (accept/edit/reject)
Missing: missing required fields
Templates: template-derived field coverage
Traceability: hash, status, provenance

Behavior improvements:

Auto-selects first/newest document
New upload auto-focuses on newest document

9. Running End-to-End Validation

Manual flow

Start stack: docker compose up --build -d
Check health (/api/health)
Upload a document in UI
Open document -> inspect fields/suggestions/sections/missing
Action suggestions (accept/edit/reject)
Edit at least one field
Generate output and download
Verify audit events
Re-upload same file to verify dedup

Quick API smoke examples

curl -s http://localhost:8000/api/health
curl -s http://localhost:8000/api/documents
curl -s http://localhost:8000/api/package/completeness
curl -s http://localhost:8000/api/schema

10. AI vs Deterministic Logic

AI-driven

Groq extraction for structured field values
Groq-based missing field suggestion generation
RAG-targeted extraction for missing fields from relevant chunks

Deterministic

SHA-256 dedup
regex/heuristic extraction fallback
completeness scoring arithmetic
output assembly
audit logging
ontology append/update semantics

11. Tradeoffs and Assumptions

Tradeoffs

JSON file store is simple but not high-scale
In-memory retrieval is lightweight but not semantic vector retrieval
Template mining from raw PDFs can be noisy without stronger normalization
No auth (POC)

Assumptions

English-language docs
Extractable text is available (not image-only scan)
One file primarily maps to one doc type

12. Scaling Bottlenecks

JSON ontology file growth (I/O contention)
Groq rate limits / cost
No async job queue for heavy extraction
In-memory retrieval not cross-document semantic search
PDF table extraction still heuristic for complex layouts

Suggested production path:

PostgreSQL + vector DB (pgvector/Qdrant)
Worker queue (Celery/RQ)
robust OCR/table parsing layer
auth + multi-tenant scoping

13. Troubleshooting

Upload seems static / UI unchanged

Hard refresh Cmd+Shift+R
Ensure newest doc appears in left list
Check dedup response (same bytes -> no new doc)

Upload fails with Groq rate limit

Backend now falls back automatically
Check logs/audit.jsonl for groq_request_error

Only package card visible

Click a document in left list
Main tabs appear when a doc is selected

Containers healthy?

docker compose ps
docker compose logs backend --tail=120
docker compose logs frontend --tail=120

14. Project Structure

Docintelligent_POC/
├── backend/
│   ├── app.py
│   ├── requirements.txt
│   └── meddoc/
│       ├── __init__.py
│       ├── routes/
│       │   ├── system.py
│       │   └── meta.py
│       ├── storage.py
│       ├── models.py
│       ├── schemas.py
│       ├── chunking.py
│       ├── rag.py
│       └── pipeline.py
├── frontend/
│   └── src/
│       ├── App.js
│       ├── App.css
│       └── ...
├── sample_docs/
├── uploads/
├── outputs/
├── logs/
├── Dockerfile.backend
├── Dockerfile.frontend
├── docker-compose.yml
└── README.md

15. GitHub Push Steps

Run from project root:

cd ~/Docintelligent_POC
git status
git add .
git commit -m "Finalize MedDoc intake POC: modular extraction, completeness matrix, UI fixes, README"

If remote is not set:

git remote add origin <YOUR_GITHUB_REPO_URL>

Push:

git branch -M main
git push -u origin main

If you prefer a feature branch:

git checkout -b codex/readme-final
git push -u origin codex/readme-final

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
backend		backend
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Dockerfile.backend		Dockerfile.backend
Dockerfile.frontend		Dockerfile.frontend
PROJECT_FULL_CODE.txt		PROJECT_FULL_CODE.txt
README.md		README.md
cep_all_fields_first6000_dummy.txt		cep_all_fields_first6000_dummy.txt
cep_fields_after6000_dummy.txt		cep_fields_after6000_dummy.txt
docker-compose.yml		docker-compose.yml
dvt_test_plan_100pct_dummy.txt		dvt_test_plan_100pct_dummy.txt
main.py		main.py
nginx.conf		nginx.conf
pyproject.toml		pyproject.toml
risk_management_dummy_v1.txt		risk_management_dummy_v1.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

MedDoc Intake POC (AI-Based)

Contents

1. Quick Start

Prerequisites

Start

Open

Stop

2. What This Project Covers

In scope

Out of scope

3. Architecture

4. Data Model and Ontology

Core entities

Provenance tracked per field

5. Extraction Pipeline (Section + RAG)

Flow

Biocompatibility pack

6. Completeness Model

A. Document completeness

B. Package completeness

C. Output-template completeness matrix

7. API Endpoints

Health

Documents

Suggestions

Output

Audit / Schema / Templates

Demo

8. UI Walkthrough

Left sidebar

Main document view tabs

9. Running End-to-End Validation

Manual flow

Quick API smoke examples

10. AI vs Deterministic Logic

AI-driven

Deterministic

11. Tradeoffs and Assumptions

Tradeoffs

Assumptions

12. Scaling Bottlenecks

13. Troubleshooting

Upload seems static / UI unchanged

Upload fails with Groq rate limit

Only package card visible

Containers healthy?

14. Project Structure

15. GitHub Push Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages