indra-belief-model

LLM-based evidence quality scoring for INDRA biomedical text-mining extractions.

What this does

INDRA's NLP readers extract structured biological relationships from scientific papers. For example, from the sentence:

"The kinase-dead RSK1 mutant, however, was unable to phosphorylate YB-1 at S102."

a reader might extract: RPS6KA1 [Phosphorylation] YBX1 @S102

This scorer judges whether such extractions are correct. Here, the extraction is incorrect — the sentence describes a negative result (the mutant was unable to phosphorylate).

Input

Native INDRA Statement + Evidence objects, resolved through ScoringRecord:

Field	Example	Source
Claim	`RPS6KA1 [Phosphorylation] YBX1 @S102`	Statement type + agents + modification site
Evidence	"The kinase-dead RSK1 mutant..."	Source sentence from paper
Entity aliases	RSK1, YB-1, p90Rsk...	Gilda grounding via `GroundedEntity.resolve()`

Output

{"verdict": "correct", "confidence": "high"}

Mapped to a continuous score: {correct+high: 0.95, correct+medium: 0.80, ..., incorrect+high: 0.05}.

How it works

Model: gemma-4-26b (Ollama remote or local MLX 8-bit).

Production scoring architecture

The CLI default is the monolithic scorer: one deterministic LLM call per (Statement, Evidence) pair with type-adaptive contrastive examples. The decomposed four-probe scorer remains available for ablations with --arch decomposed.

Two-tier monolithic path

Tier 1: Deterministic grounding (no LLM call)

Status	Action	Example
MISMATCH	Auto-reject	"RhoA" → RHOA != ARHGEF25
PSEUDOGENE + AMBIGUOUS	Auto-reject	"DVL" → DVL1P1 (pseudogene)
AMBIGUOUS	Pass to Tier 2	"9G8" → SRSF7/SLU7 (tied scores)
MATCH	Pass to Tier 2	"FAK" → PTK2 (confirmed alias)

Tier 2: LLM text comprehension

Six-rule system prompt (negation, hedging, family/member equivalence, etc.)
Seven adaptive contrastive pairs (14 examples) selected by statement type
Single deterministic call at low temperature

Adaptive few-shot selection

The example bank has type-specific contrastive pairs. For each record, 7 pairs are selected by priority:

Own type from bank (e.g., Activation pairs for an Activation claim)
Adjacent types from TYPE_ADJACENCY map (e.g., IncreaseAmount for Activation)
Universal patterns (logical inversion, hedging scope)
Fill from the base contrastive pair set

Types with bank examples: Activation (2 pairs), Inhibition (2), Phosphorylation, Complex, IncreaseAmount, DecreaseAmount, Dephosphorylation, Autophosphorylation, Translocation, Ubiquitination.

Run the scorer

PYTHONPATH=src python -m indra_belief.scorers.scorer \
    --model gemma-remote \
    --arch monolithic

Design decisions we already paid for

Earlier iterations measured the following approaches and rejected them. If you're considering a change that resembles one of these, check the data before re-proposing:

Approach	Outcome	Why it fails
Decomposed multi-call scorer	Strictly dominated on holdout_cc (F1 0.657 vs 0.751 monolithic)	Natural-language extraction can't bridge INDRA's soft ontology boundaries — requires multiple LLM probes to agree on a fuzzy contract
Native tool-calling (agentic lookup)	84.9%, below baseline	Model ignores tool results after committing to a verdict in its first pass
Structured provenance, full population	-6.7pp accuracy	Attention dilution on 26B model outweighs disambiguation benefit — selectively enabling provenance only for flagged-grounding records preserves the signal without the cost
Graduated warnings for every grounding quirk	3 regressions per 1 fix	Redirects attention from sentence comprehension; now limited to PSEUDOGENE and LOW_CONFIDENCE
Indirect-evidence marker in the prompt	+5pp false negatives	Prejudices model toward rejection; removed
LOW_CONFIDENCE auto-reject (blanket)	53.6% precision at scale (32 false rejections on 3,754 records)	The gilda score threshold is too noisy to gate on deterministically; the signal is still available to the LLM as context

Headline baselines measured during iteration: gemma-4-26b + adaptive bank + voting reaches ~84% accuracy on the 501-record stratified sample. Small-holdout numbers (200 records) overstate by ~4-5pp relative to large-scale evaluation (3,000+ records) — check the larger set before celebrating.

Setup

Dependencies

pip install gilda indra openai

# Download the benchmark corpus (460MB, not included in repo)
# Place at data/benchmark/indra_benchmark_corpus.json.gz
# Source: https://doi.org/10.5281/zenodo.7559353

Model configuration

The scorer calls an LLM via ModelClient(model_name). Model names map to entries in model_client.py's LOCAL_MODELS dict, or to Anthropic model IDs (any string starting with claude-).

Local Ollama (recommended for getting started):

# Install Ollama: https://ollama.com
ollama pull gemma3:27b          # or any model you prefer
ollama serve                    # starts on localhost:11434

Then add an entry to LOCAL_MODELS in src/indra_belief/model_client.py:

"ollama-local": {
    "base_url": "http://localhost:11434/v1",
    "model_id": "gemma3:27b",
    "reasoning_in_content": False,
    "max_tokens": 1000,
    "timeout": 120,
},

Use it: ModelClient("ollama-local") or --model ollama-local from the CLI.

Remote Ollama (e.g., a beefy server on your network):

Same as above but point base_url at the remote host. The gemma-remote entry in the registry shows this pattern — it targets an Ollama instance over Tailscale.

Anthropic API:

export ANTHROPIC_API_KEY=sk-...

client = ModelClient("claude-sonnet-4-20250514")

Any claude-* model name routes to the Anthropic backend automatically.

Key LOCAL_MODELS fields:

Field	Purpose
`base_url`	OpenAI-compatible endpoint (Ollama serves this at `/v1`)
`model_id`	Model name as known to the server (`ollama list` to check)
`reasoning_in_content`	`True` if CoT appears in `content` (Qwen CRACK); `False` for models with a separate `reasoning_content` field (Gemma 4) or no reasoning
`max_tokens`	Completion token budget — reasoning models need more (8000+)
`num_ctx`	Ollama-specific: context window size (passed via `extra_body`)
`timeout`	Seconds before retry — increase for large models or slow hardware

Usage

Score a Statement's evidence

An INDRA Statement bundles a list of Evidence objects. score_statement mirrors that abstraction: one per-sentence verdict per evidence, returned in order.

from indra.statements import Phosphorylation, Agent, Evidence
from indra_belief import ModelClient, score_statement

stmt = Phosphorylation(
    Agent("RPS6KA1"), Agent("YBX1"),
    residue="S", position="102",
)
stmt.evidence = [
    Evidence(source_api="reach",
             text="RSK1 phosphorylates YB-1 at S102 in response to stress."),
    Evidence(source_api="sparser",
             text="The kinase-dead RSK1 mutant was unable to phosphorylate YB-1 at S102."),
]

client = ModelClient("gemma-remote")
verdicts = score_statement(stmt, client)
# verdicts is list[dict], one per evidence:
#   verdicts[i]["verdict"]    → "correct" | "incorrect" | None
#   verdicts[i]["score"]      → 0.95 (correct+high) … 0.05 (incorrect+high)
#   verdicts[i]["confidence"] → "high" | "medium" | "low"
#   verdicts[i]["tier"]       → which scoring path produced the verdict

To score just one evidence of a Statement (skipping the rest of stmt.evidence), use score_evidence(stmt, ev, client).

Composition with INDRA belief

score_statement is the per-sentence comprehension layer. The edge-level question — given all evidence for a statement, what is the belief? — is answered by composing per-sentence verdicts with INDRA's parametric noise model. The two layers chain directly:

from indra_belief import score_statement
from indra_belief.composed_scorer import ComposedBeliefScorer, EvidenceRecord
from indra_belief.noise_model import RECALIBRATED_PRIORS

verdicts = score_statement(stmt, client)  # list[dict], one per stmt.evidence
records = [
    EvidenceRecord(source_api=ev.source_api, verdict=v["verdict"])
    for ev, v in zip(stmt.evidence, verdicts)
]
belief = ComposedBeliefScorer(priors=RECALIBRATED_PRIORS).score_edge(records)
# belief.belief           → composed edge belief
# belief.parametric_only  → belief before LLM gating (for ablation)
# belief.n_gated          → evidence removed by the gate

Gate semantics: verdict="correct" passes; unscored evidence (verdict=None) passes by default (gate_unscored=True to tighten); "incorrect" and any other string — including "ambiguous" or parse failures — are removed. Priors live in noise_model.py (INDRA_PRIORS, RECALIBRATED_PRIORS). See scripts/benchmark_composition.py for the benchmark used to pick them.

Score a whole corpus + browse the results

For corpora larger than a single Statement (e.g. an INDRA-native JSON dump from rasmachine), indra_belief.corpus persists ingest, scoring, and validity to a DuckDB file; the viewer/ SvelteKit app both browses and drives it (via per-card [ingest] / [score] / [register truth_set] actions that spawn the indra_belief.worker subprocess and stream SSE progress). The Python entry points below remain canonical — the viewer calls them under the hood.

import duckdb
from indra.statements import stmts_from_json_file
from indra_belief import ModelClient
from indra_belief.corpus import (
    apply_schema, ingest_statements,
    score_corpus, export_beliefs, model_card,
)

con = duckdb.connect("data/corpus.duckdb")
apply_schema(con)
stmts = stmts_from_json_file("data/corpora/latest_statements_rasmachine.json")
ingest_statements(con, stmts, source_dump_id="rasmachine_emmaa")

client = ModelClient("claude-sonnet-4-6")
run_id = score_corpus(con, stmts, client=client,
                      scorer_version="prod-v1", decompose=True,
                      cost_threshold_usd=350)  # raises before spend

export_beliefs(con, run_id, f"data/exports/{run_id}_indra.json")
model_card(con, run_id, out_path=f"data/exports/{run_id}_card.json")
con.close()

Estimate cost first: from indra_belief.corpus import estimate_cost returns LLM-call counts and projected USD per model. Truth-set support (gold pools, INDRA epistemics, source-DB curation) is foundational — register_truth_set + load_truth_labels light up the dashboard's P/R/F1-vs-gold panel automatically.

Browse + drive via the viewer:

cd viewer && npm install && npm run dev  # http://127.0.0.1:5173

The dashboard discovers files in data/corpora/ and data/benchmark/ and exposes per-card actions: [ingest], [register tag as truth_set], cost preflight, and [score] against the loaded LLM. Long ingests / scores stream SSE progress with a [cancel] button; the closeInstance coordination in viewer/src/lib/db.ts releases the cached DuckDB instance before each spawn so the worker can acquire the file lock. Dashboard reads issued during an active write fail closed with HTTP 503 + a "writer in progress" page (+error.svelte).

Benchmark evaluation against a holdout file

PYTHONPATH=src python -m indra_belief.scorers.scorer \
    --model gemma-remote \
    --arch monolithic \
    --holdout data/benchmark/holdout_large.jsonl \
    --output data/results/run.jsonl \
    --resume data/results/run.jsonl  # resume interrupted runs

How we iterate

Contributor-facing rules to keep the repository legible:

main is the canonical state. Every "ship" decision ends with git push. Local ship decisions don't count.
Version labels don't belong in source. Version numbers appear in PR titles, CHANGELOG entries, and benchmark-run output filenames (data/results/<run>.jsonl). They do not appear in source comments, docstrings, or identifier names. scripts/check_no_version_labels.py enforces this.
Public API is score_statement(statement, client) + score_evidence(statement, evidence, client). score_statement mirrors INDRA's abstraction (a Statement owns a list of Evidence) and returns one dict per evidence. score_evidence is the atomic per-sentence call. score(client, record, …) is the benchmark-harness path used by indra_belief.scorers.scorer.main; treat it as internal.
Comments explain current constraints, not past versions. If a reader needs history, git log is the source of truth. "Provenance is selectively enabled because full-population provenance dilutes attention" is legitimate. "Removed in v12" is not.

Project structure

src/indra_belief/
  model_client.py          # Model transport (OpenAI-compat + Anthropic)
  noise_model.py           # INDRA SimpleScorer (parametric belief from source priors)
  composed_scorer.py       # LLM verdict → hard gate over the parametric noise model
  worker.py                # Viewer-spawned worker: ingest / estimate-cost / score / register-truth-set
  scorers/                 # Monolithic default plus decomposed probe orchestrator
  corpus/                  # Corpus persistence + scoring orchestration (DuckDB)
    schema.py              # 10-table schema (statement / evidence / agent / truth_set / metric / …)
    loader.py              # from_indra_json + ingest_statements + register_truth_set
    scoring.py             # score_corpus(con, stmts, *, decompose, with_validity, cost_threshold_usd)
    validity.py            # compute_validity → calibration + 4a P/R/F1 + stratified MAE
    export.py              # aggregate_beliefs + export_beliefs + model_card
    cost.py                # estimate_cost + MODEL_PRICES_PER_M_TOKENS
  data/
    entity.py              # GroundedEntity: single gilda resolution per entity
    scoring_record.py      # ScoringRecord: wraps INDRA Statement + Evidence
    corpus.py              # CorpusIndex: source_hash → Statement lookup
    example_bank.json      # Type-specific contrastive pairs

viewer/                    # SvelteKit dashboard — browses + drives the corpus DuckDB
  src/lib/
    db.ts                  # DuckDB connection + closeInstance() for writer coordination
    datasets.ts            # Filesystem discovery of data/corpora + data/benchmark
    pathGuard.ts           # Path-arg validation (must resolve under <repoRoot>/data/)
    format.ts              # Cue extraction, verdict rendering, sentence formatting
    probeAttribution.ts    # Probe-source attribution model (substrate vs LLM)
    residuals.ts           # Residual histogram bucket logic
    components/            # BeliefPrimitive, HeuristicCoverage, Validity
  src/routes/
    +page.svelte           # Dashboard: focus + findings + validity + datasets + runs feed
    +error.svelte          # 503 writer-in-progress fallback + generic 4xx/5xx page
    runs/[run_id]/+page.svelte             # Per-run detail with compare-against dropdown
    statements/+page.svelte                # Matrix (paginated, URL-stated)
    statements/[stmt_hash]/+page.svelte    # Per-stmt deep-dive (9-step rail, evidence cards, truth panel)
    export/[run_id]/[kind]/+server.ts      # Belief + model-card download endpoints
    api/datasets/ingest/+server.ts         # SSE-streaming ingest (handles .json + .json.gz)
    api/truth-sets/+server.ts              # Truth-set registration + validity recompute
    api/runs/estimate-cost/+server.ts      # Per-model cost preflight
    api/runs/score/+server.ts              # SSE-streaming score with AbortController kill

data/
  benchmark/
    holdout.jsonl          # 200-record balanced evaluation set
    holdout_large.jsonl    # 4,625-record half-corpus evaluation
    example_pairs.json     # Entity pairs excluded from holdouts
  results/                 # Evaluation results

scripts/
  check_contamination.py        # Pre-eval gate: examples must not overlap holdout
  check_no_version_labels.py    # CI guard: no v{n} labels in src, tests, scripts

.github/workflows/
  ci.yml                        # pytest + both guards on every push and PR

References

Gyori et al. (2023). "Automated assembly of molecular mechanisms at scale from text mining and curated databases." Molecular Systems Biology, e11325. Benchmark corpus: Zenodo 7559353
Gilda — Biomedical entity grounding
INDRA — Integrated Network and Dynamical Reasoning Assembler

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
.githooks		.githooks
.github/workflows		.github/workflows
data		data
research		research
scripts		scripts
src		src
tests		tests
viewer		viewer
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

indra-belief-model

What this does

Input

Output

How it works

Production scoring architecture

Two-tier monolithic path

Adaptive few-shot selection

Run the scorer

Design decisions we already paid for

Setup

Dependencies

Model configuration

Usage

Score a Statement's evidence

Composition with INDRA belief

Score a whole corpus + browse the results

Benchmark evaluation against a holdout file

How we iterate

Project structure

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

indra-belief-model

What this does

Input

Output

How it works

Production scoring architecture

Two-tier monolithic path

Adaptive few-shot selection

Run the scorer

Design decisions we already paid for

Setup

Dependencies

Model configuration

Usage

Score a Statement's evidence

Composition with INDRA belief

Score a whole corpus + browse the results

Benchmark evaluation against a holdout file

How we iterate

Project structure

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages