A robust, evidence-grounded evaluation system for non-canonical medical journal extraction using Semantic N-gram IoU and LLMs.
๐ฏ Goal
Build an objective, reproducible evaluation harness for extracting messy, non-canonical health data (Symptoms, Food, Emotion, Mind) from user journals. This pipeline uses Semantic Jaccard (Soft-IoU) scoring with multilingual embeddings to handle synonyms, Hinglish, and free-text evidence spans while strictly penalizing hallucinations.
- Evidence-Grounded Extraction: The LLM extracts
evidence_span(verbatim quotes) alongside structured attributes so every prediction is traceable to text. - Semantic N-gram Scorer: Soft-IoU using multilingual sentence-transformers to match semantically equivalent phrases (e.g.,
dardโpain). - Restraint Mechanism: Penalizes verbose or hallucinated outputs by increasing the union in the Jaccard denominator.
- Safety First: Polarity accuracy (present vs absent) is tracked separately to avoid false positives.
- Observability: Produces per-journal JSONL reports for fine-grained debugging.
graph TD
classDef input fill:#ff9,stroke:#333,stroke-width:2px,color:black;
classDef process fill:#00f7ff,stroke:#000,stroke-width:2px,color:black;
classDef logic fill:#39ff14,stroke:#000,stroke-width:2px,color:black;
classDef output fill:#ff00ff,stroke:#000,stroke-width:2px,color:white;
Input(["๐ Input Data<br/>(journals.jsonl + gold.jsonl)"]) --> Stage1
subgraph Stage1 ["Stage 1: Extraction"]
LLM["๐ค LLM <br/>(src/components/llm.py)"]
Parser["๐ Pydantic Parser"]
LLM --> Parser
end
Stage1 --> Extracted["๐ extracted_gold.jsonl"]
subgraph Stage2 ["Stage 2: Scoring"]
Extracted --> Scorer["โ๏ธ Scorer Engine<br/>(src/components/scorer.py)"]
Gold["๐ Gold Reference"] --> Scorer
subgraph Logic ["Core Logic"]
Ngram["โ๏ธ Bi-gram Decomposition"]
BERT["๐ง Multilingual BERT"]
Jaccard["๐งฎ Semantic Jaccard (Soft-IoU)"]
Ngram --> BERT --> Jaccard
end
Scorer --> Logic
end
subgraph Stage3 ["Stage 3: Reporting"]
Logic --> Metrics["๐ Final Metrics"]
end
Metrics --> Summary["โ
score_summary.json"]
Metrics --> Detail["๐ per_journal_scores.jsonl"]
class Input input;
class Stage1,Stage2,Stage3 process;
class LLM,Parser,Scorer,Logic logic;
class Summary,Detail output;
- Python 3.10+
- A modern CPU (GPU optional but recommended for faster embeddings/LLM)
- Hugging Face token (for any hosted models) if using remote endpoints
Dependencies are listed in requirements.txt.
# Clone the repository
git clone <https://github.com/Nossks/EviScore>
cd EviScore
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtCreate a .env file in the project root with your Hugging Face token (if required):
HUGGINGFACEHUB_API_TOKEN=hf_xxxxxxxxxxxxxxxxxOther configuration options (model selection, scoring thresholds) live in ASSUMPTIONS.md and in the src/components/scorer.py constants.
Execute the full extraction + scoring loop from the CLI entrypoint:
python main.pyAll outputs are written to out/:
score_summary.jsonโ high-level metrics
{
"f1_score": 0.88,
"precision": 0.92,
"recall": 0.85,
"polarity_accuracy": 1.0,
"bucket_accuracy": 0.78
}per_journal_scores.jsonlโ row-level debug data
{"journal_id": "J006", "f1": 0.45, "note": "Failed on Hindi text"}
{"journal_id": "J007", "f1": 1.0, "note": "Perfect extraction"}Why not exact IoU? Simple word overlap fails for semantically equivalent phrases such as stomach ache vs tummy pain.
Let G be gold N-grams and P predicted N-grams. Using precomputed multilingual embeddings and cosine similarity, define Soft-IoU:
S-IoU = sum_{g in G} max_{p in P} cosine_sim(g, p) / (|G| + |P| - Intersection)
- N-Gram decomposition uses bi-grams (N=2) to preserve local context.
- Embeddings are produced with
paraphrase-multilingual-MiniLM-L12-v2(or equivalent). - Intersection is approximated by the sum of matched similarities above a threshold.
- Restraint Mechanism: Hallucinated tokens enlarge |P|, increasing denominator and lowering score.
Polarity (present/absent) is scored separately: a wrong polarity yields a strict penalty.
โโโ out/ # Generated outputs (JSON/JSONL)
โโโ data/ # Input journals and gold labels
โโโ src/
โ โโโ components/
โ โ โโโ llm.py # llm extraction logic
โ โ โโโ scorer.py # N-gram Semantic Jaccard logic
โโโpipelines/
| โโโextraction.py
| โโโevaluation.py
โโโ logger.py # Custom logging setup
โ โโโ exception.py # Error handling
โโโ requirements.txt # Project dependencies
โโโ main.py # combination of pipelines
โโโ ASSUMPTIONS.md # Detailed logic explanation
โโโ README.md # This file
- Model choices: The scorer is model-agnostic. For embeddings prefer sentence-transformers multilingual variants.
- Thresholds: Tune cosine thresholds for "match" and the restraint penalty using a held-out validation set.
- Edge cases: Short spans (1โ2 tokens) can produce noisy embeddings; consider backing off to exact-string matching for extremely short tokens.
- Performance: Use batch embedding calls (sentence-transformers supports batching) to speed scoring.
- Reproducibility: Seed all RNGs and log model versions and tokenizer checksums in
score_summary.json.
ModuleNotFoundErrorโ ensurevenvis activated andpip install -r requirements.txtcompleted successfully.- Embedding calls too slow โ use batching or a local ONNX/accelerated model.
- Low polarity accuracy โ check label format in
gold.jsonland whether polarity is annotated aspresent/absentor booleans.
- Fork the repo
- Create a feature branch
- Open a PR with tests and a clear description
Please follow the existing code style; aim for minimal diffs if the change affects scorer.py or llm.py.
This project is released under the MIT License. See LICENSE for details.
Aryan