LLM-based evidence quality scoring for INDRA biomedical text-mining extractions.
INDRA's NLP readers extract structured biological relationships from scientific papers. For example, from the sentence:
"The kinase-dead RSK1 mutant, however, was unable to phosphorylate YB-1 at S102."
a reader might extract: RPS6KA1 [Phosphorylation] YBX1 @S102
This scorer judges whether such extractions are correct. Here, the extraction is incorrect — the sentence describes a negative result (the mutant was unable to phosphorylate).
Native INDRA Statement + Evidence objects, resolved through ScoringRecord:
| Field | Example | Source |
|---|---|---|
| Claim | RPS6KA1 [Phosphorylation] YBX1 @S102 |
Statement type + agents + modification site |
| Evidence | "The kinase-dead RSK1 mutant..." | Source sentence from paper |
| Entity aliases | RSK1, YB-1, p90Rsk... | Gilda grounding via GroundedEntity.resolve() |
{"verdict": "correct", "confidence": "high"}Mapped to a continuous score: {correct+high: 0.95, correct+medium: 0.80, ..., incorrect+high: 0.05}.
Model: gemma-4-26b (Ollama remote or local MLX 8-bit).
The CLI default is the monolithic scorer: one deterministic LLM call per
(Statement, Evidence) pair with type-adaptive contrastive examples. The
decomposed four-probe scorer remains available for ablations with
--arch decomposed.
Tier 1: Deterministic grounding (no LLM call)
| Status | Action | Example |
|---|---|---|
| MISMATCH | Auto-reject | "RhoA" → RHOA != ARHGEF25 |
| PSEUDOGENE + AMBIGUOUS | Auto-reject | "DVL" → DVL1P1 (pseudogene) |
| AMBIGUOUS | Pass to Tier 2 | "9G8" → SRSF7/SLU7 (tied scores) |
| MATCH | Pass to Tier 2 | "FAK" → PTK2 (confirmed alias) |
Tier 2: LLM text comprehension
- Six-rule system prompt (negation, hedging, family/member equivalence, etc.)
- Seven adaptive contrastive pairs (14 examples) selected by statement type
- Single deterministic call at low temperature
The example bank has type-specific contrastive pairs. For each record, 7 pairs are selected by priority:
- Own type from bank (e.g., Activation pairs for an Activation claim)
- Adjacent types from
TYPE_ADJACENCYmap (e.g., IncreaseAmount for Activation) - Universal patterns (logical inversion, hedging scope)
- Fill from the base contrastive pair set
Types with bank examples: Activation (2 pairs), Inhibition (2), Phosphorylation, Complex, IncreaseAmount, DecreaseAmount, Dephosphorylation, Autophosphorylation, Translocation, Ubiquitination.
PYTHONPATH=src python -m indra_belief.scorers.scorer \
--model gemma-remote \
--arch monolithicEarlier iterations measured the following approaches and rejected them. If you're considering a change that resembles one of these, check the data before re-proposing:
| Approach | Outcome | Why it fails |
|---|---|---|
| Decomposed multi-call scorer | Strictly dominated on holdout_cc (F1 0.657 vs 0.751 monolithic) | Natural-language extraction can't bridge INDRA's soft ontology boundaries — requires multiple LLM probes to agree on a fuzzy contract |
| Native tool-calling (agentic lookup) | 84.9%, below baseline | Model ignores tool results after committing to a verdict in its first pass |
| Structured provenance, full population | -6.7pp accuracy | Attention dilution on 26B model outweighs disambiguation benefit — selectively enabling provenance only for flagged-grounding records preserves the signal without the cost |
| Graduated warnings for every grounding quirk | 3 regressions per 1 fix | Redirects attention from sentence comprehension; now limited to PSEUDOGENE and LOW_CONFIDENCE |
| Indirect-evidence marker in the prompt | +5pp false negatives | Prejudices model toward rejection; removed |
| LOW_CONFIDENCE auto-reject (blanket) | 53.6% precision at scale (32 false rejections on 3,754 records) | The gilda score threshold is too noisy to gate on deterministically; the signal is still available to the LLM as context |
Headline baselines measured during iteration: gemma-4-26b + adaptive bank + voting reaches ~84% accuracy on the 501-record stratified sample. Small-holdout numbers (200 records) overstate by ~4-5pp relative to large-scale evaluation (3,000+ records) — check the larger set before celebrating.
pip install gilda indra openai
# Download the benchmark corpus (460MB, not included in repo)
# Place at data/benchmark/indra_benchmark_corpus.json.gz
# Source: https://doi.org/10.5281/zenodo.7559353The scorer calls an LLM via ModelClient(model_name). Model names map to
entries in model_client.py's LOCAL_MODELS dict, or to Anthropic model
IDs (any string starting with claude-).
Local Ollama (recommended for getting started):
# Install Ollama: https://ollama.com
ollama pull gemma3:27b # or any model you prefer
ollama serve # starts on localhost:11434Then add an entry to LOCAL_MODELS in src/indra_belief/model_client.py:
"ollama-local": {
"base_url": "http://localhost:11434/v1",
"model_id": "gemma3:27b",
"reasoning_in_content": False,
"max_tokens": 1000,
"timeout": 120,
},Use it: ModelClient("ollama-local") or --model ollama-local from the CLI.
Remote Ollama (e.g., a beefy server on your network):
Same as above but point base_url at the remote host. The gemma-remote
entry in the registry shows this pattern — it targets an Ollama instance
over Tailscale.
Anthropic API:
export ANTHROPIC_API_KEY=sk-...client = ModelClient("claude-sonnet-4-20250514")Any claude-* model name routes to the Anthropic backend automatically.
Key LOCAL_MODELS fields:
| Field | Purpose |
|---|---|
base_url |
OpenAI-compatible endpoint (Ollama serves this at /v1) |
model_id |
Model name as known to the server (ollama list to check) |
reasoning_in_content |
True if CoT appears in content (Qwen CRACK); False for models with a separate reasoning_content field (Gemma 4) or no reasoning |
max_tokens |
Completion token budget — reasoning models need more (8000+) |
num_ctx |
Ollama-specific: context window size (passed via extra_body) |
timeout |
Seconds before retry — increase for large models or slow hardware |
An INDRA Statement bundles a list of Evidence objects. score_statement
mirrors that abstraction: one per-sentence verdict per evidence, returned
in order.
from indra.statements import Phosphorylation, Agent, Evidence
from indra_belief import ModelClient, score_statement
stmt = Phosphorylation(
Agent("RPS6KA1"), Agent("YBX1"),
residue="S", position="102",
)
stmt.evidence = [
Evidence(source_api="reach",
text="RSK1 phosphorylates YB-1 at S102 in response to stress."),
Evidence(source_api="sparser",
text="The kinase-dead RSK1 mutant was unable to phosphorylate YB-1 at S102."),
]
client = ModelClient("gemma-remote")
verdicts = score_statement(stmt, client)
# verdicts is list[dict], one per evidence:
# verdicts[i]["verdict"] → "correct" | "incorrect" | None
# verdicts[i]["score"] → 0.95 (correct+high) … 0.05 (incorrect+high)
# verdicts[i]["confidence"] → "high" | "medium" | "low"
# verdicts[i]["tier"] → which scoring path produced the verdictTo score just one evidence of a Statement (skipping the rest of stmt.evidence), use score_evidence(stmt, ev, client).
score_statement is the per-sentence comprehension layer. The edge-level
question — given all evidence for a statement, what is the belief? — is
answered by composing per-sentence verdicts with INDRA's parametric noise
model. The two layers chain directly:
from indra_belief import score_statement
from indra_belief.composed_scorer import ComposedBeliefScorer, EvidenceRecord
from indra_belief.noise_model import RECALIBRATED_PRIORS
verdicts = score_statement(stmt, client) # list[dict], one per stmt.evidence
records = [
EvidenceRecord(source_api=ev.source_api, verdict=v["verdict"])
for ev, v in zip(stmt.evidence, verdicts)
]
belief = ComposedBeliefScorer(priors=RECALIBRATED_PRIORS).score_edge(records)
# belief.belief → composed edge belief
# belief.parametric_only → belief before LLM gating (for ablation)
# belief.n_gated → evidence removed by the gateGate semantics: verdict="correct" passes; unscored evidence
(verdict=None) passes by default (gate_unscored=True to tighten);
"incorrect" and any other string — including "ambiguous" or parse
failures — are removed. Priors live in noise_model.py (INDRA_PRIORS,
RECALIBRATED_PRIORS). See scripts/benchmark_composition.py for the
benchmark used to pick them.
For corpora larger than a single Statement (e.g. an INDRA-native JSON dump
from rasmachine), indra_belief.corpus persists ingest, scoring, and
validity to a DuckDB file; the viewer/ SvelteKit app both browses and
drives it (via per-card [ingest] / [score] / [register truth_set]
actions that spawn the indra_belief.worker subprocess and stream SSE
progress). The Python entry points below remain canonical — the viewer
calls them under the hood.
import duckdb
from indra.statements import stmts_from_json_file
from indra_belief import ModelClient
from indra_belief.corpus import (
apply_schema, ingest_statements,
score_corpus, export_beliefs, model_card,
)
con = duckdb.connect("data/corpus.duckdb")
apply_schema(con)
stmts = stmts_from_json_file("data/corpora/latest_statements_rasmachine.json")
ingest_statements(con, stmts, source_dump_id="rasmachine_emmaa")
client = ModelClient("claude-sonnet-4-6")
run_id = score_corpus(con, stmts, client=client,
scorer_version="prod-v1", decompose=True,
cost_threshold_usd=350) # raises before spend
export_beliefs(con, run_id, f"data/exports/{run_id}_indra.json")
model_card(con, run_id, out_path=f"data/exports/{run_id}_card.json")
con.close()Estimate cost first: from indra_belief.corpus import estimate_cost
returns LLM-call counts and projected USD per model. Truth-set support
(gold pools, INDRA epistemics, source-DB curation) is foundational —
register_truth_set + load_truth_labels light up the dashboard's
P/R/F1-vs-gold panel automatically.
Browse + drive via the viewer:
cd viewer && npm install && npm run dev # http://127.0.0.1:5173The dashboard discovers files in data/corpora/ and data/benchmark/ and exposes per-card actions: [ingest], [register tag as truth_set], cost preflight, and [score] against the loaded LLM. Long ingests / scores stream SSE progress with a [cancel] button; the closeInstance coordination in viewer/src/lib/db.ts releases the cached DuckDB instance before each spawn so the worker can acquire the file lock. Dashboard reads issued during an active write fail closed with HTTP 503 + a "writer in progress" page (+error.svelte).
PYTHONPATH=src python -m indra_belief.scorers.scorer \
--model gemma-remote \
--arch monolithic \
--holdout data/benchmark/holdout_large.jsonl \
--output data/results/run.jsonl \
--resume data/results/run.jsonl # resume interrupted runsContributor-facing rules to keep the repository legible:
mainis the canonical state. Every "ship" decision ends withgit push. Local ship decisions don't count.- Version labels don't belong in source. Version numbers appear in PR titles, CHANGELOG entries, and benchmark-run output filenames (
data/results/<run>.jsonl). They do not appear in source comments, docstrings, or identifier names.scripts/check_no_version_labels.pyenforces this. - Public API is
score_statement(statement, client)+score_evidence(statement, evidence, client).score_statementmirrors INDRA's abstraction (a Statement owns a list of Evidence) and returns one dict per evidence.score_evidenceis the atomic per-sentence call.score(client, record, …)is the benchmark-harness path used byindra_belief.scorers.scorer.main; treat it as internal. - Comments explain current constraints, not past versions. If a reader needs history,
git logis the source of truth. "Provenance is selectively enabled because full-population provenance dilutes attention" is legitimate. "Removed in v12" is not.
src/indra_belief/
model_client.py # Model transport (OpenAI-compat + Anthropic)
noise_model.py # INDRA SimpleScorer (parametric belief from source priors)
composed_scorer.py # LLM verdict → hard gate over the parametric noise model
worker.py # Viewer-spawned worker: ingest / estimate-cost / score / register-truth-set
scorers/ # Monolithic default plus decomposed probe orchestrator
corpus/ # Corpus persistence + scoring orchestration (DuckDB)
schema.py # 10-table schema (statement / evidence / agent / truth_set / metric / …)
loader.py # from_indra_json + ingest_statements + register_truth_set
scoring.py # score_corpus(con, stmts, *, decompose, with_validity, cost_threshold_usd)
validity.py # compute_validity → calibration + 4a P/R/F1 + stratified MAE
export.py # aggregate_beliefs + export_beliefs + model_card
cost.py # estimate_cost + MODEL_PRICES_PER_M_TOKENS
data/
entity.py # GroundedEntity: single gilda resolution per entity
scoring_record.py # ScoringRecord: wraps INDRA Statement + Evidence
corpus.py # CorpusIndex: source_hash → Statement lookup
example_bank.json # Type-specific contrastive pairs
viewer/ # SvelteKit dashboard — browses + drives the corpus DuckDB
src/lib/
db.ts # DuckDB connection + closeInstance() for writer coordination
datasets.ts # Filesystem discovery of data/corpora + data/benchmark
pathGuard.ts # Path-arg validation (must resolve under <repoRoot>/data/)
format.ts # Cue extraction, verdict rendering, sentence formatting
probeAttribution.ts # Probe-source attribution model (substrate vs LLM)
residuals.ts # Residual histogram bucket logic
components/ # BeliefPrimitive, HeuristicCoverage, Validity
src/routes/
+page.svelte # Dashboard: focus + findings + validity + datasets + runs feed
+error.svelte # 503 writer-in-progress fallback + generic 4xx/5xx page
runs/[run_id]/+page.svelte # Per-run detail with compare-against dropdown
statements/+page.svelte # Matrix (paginated, URL-stated)
statements/[stmt_hash]/+page.svelte # Per-stmt deep-dive (9-step rail, evidence cards, truth panel)
export/[run_id]/[kind]/+server.ts # Belief + model-card download endpoints
api/datasets/ingest/+server.ts # SSE-streaming ingest (handles .json + .json.gz)
api/truth-sets/+server.ts # Truth-set registration + validity recompute
api/runs/estimate-cost/+server.ts # Per-model cost preflight
api/runs/score/+server.ts # SSE-streaming score with AbortController kill
data/
benchmark/
holdout.jsonl # 200-record balanced evaluation set
holdout_large.jsonl # 4,625-record half-corpus evaluation
example_pairs.json # Entity pairs excluded from holdouts
results/ # Evaluation results
scripts/
check_contamination.py # Pre-eval gate: examples must not overlap holdout
check_no_version_labels.py # CI guard: no v{n} labels in src, tests, scripts
.github/workflows/
ci.yml # pytest + both guards on every push and PR
- Gyori et al. (2023). "Automated assembly of molecular mechanisms at scale from text mining and curated databases." Molecular Systems Biology, e11325. Benchmark corpus: Zenodo 7559353
- Gilda — Biomedical entity grounding
- INDRA — Integrated Network and Dynamical Reasoning Assembler