MoCA-Fin (Market-of-Claims Agent) is a multi-agent, code-generating system for financial and tabular reasoning. Instead of free-form multi-agent debate, MoCA-Fin runs a claim market: a question is decomposed into atomic, tradable claims; specialist trader agents buy/sell those claims; the market clears a price and a confidence per claim; and a synthesizer writes an executable Python program from the market-supported claims. A sandboxed executor runs the program to produce the final answer.
Backbone. All headline results use a single Qwen/Qwen3.6-27B backbone served via vLLM (OpenAI-compatible) for every role: the trader panel, the synthesizer, the verifier, and the selector committee. The prompts are identical across datasets — only the data loader changes.
question + table/context
│
▼
┌──────────────────┐ atomic, typed, tradable claims
│ catalog builder │ ───────────────────────────────────┐
└──────────────────┘ ▼
┌─────────────────────────────┐
trader panel (buy/sell orders on claims): │ claim market │
• extractor (reads cells/values) │ clears price + confidence │
• formula (proposes the computation) │ per claim (accept/reject) │
• accountant (sign / unit / scale) └─────────────────────────────┘
• skeptic (shorts over-confident claims) │
▼
┌──────────────────────────┐
│ synthesizer │ Python program
│ (uses supported claims) │ ─────────────┐
└──────────────────────────┘ ▼
┌───────────────────┐
one market-aware repair round on failure ◀───────────────────────── │ sandboxed executor│
└───────────────────┘
hybrid routing: a baseline proposer + a multi-lens selector committee + a
conflict arbiter keep the strongest grounded candidate.
Exact-match / execution accuracy (%). Bold = best overall; prev. best is the strongest previously published number for that track (see the paper for the full leaderboards and citations).
| Dataset | MoCA-Fin | Prev. best |
|---|---|---|
| FinQA | 78.29 | 74.18 (Fino1-14B) |
| DocMath-Simplong | 70.00 | 60.00 (GPT-4o) |
| DocMath-Complong | 50.67 | 42.33 (DeepSeek-V3) |
| FinanceMath | 76.00 | 67.0 (GPT-4o PoT) |
| Dataset | MoCA-Fin | Prev. best |
|---|---|---|
| TabMWP | 96.00 | 94.70 (CREATOR) |
| WikiTableQuestions | 81.40 | 80.80 (ARTEMIS-DA) |
| HiTab | 77.27 | 79.10 (SS-CoT) |
| MultiHiertt | 71.17 | 56.78 (Fortune RL w/ CS) |
| Dataset | MoCA-Fin | Prev. best |
|---|---|---|
| ESGenius | 86.88 | 83.80 (Gemma-3 12B + IT + RAG) |
| Subtask | MoCA-Fin | Prev. best |
|---|---|---|
| True/False | 97.27 | 97.86 (o3) |
| Multiple-Choice | 84.85 | 92.17 (o3) |
| Open QA | 74.65 | 63.59 (Claude Sonnet 4) |
| Average | 85.59 | 84.32 (Claude Sonnet 4) |
| Cat. | Dataset | --dataset key |
Eval N | Split | Modality |
|---|---|---|---|---|---|
| C1 | FinQA | finqa |
1,147 | test | text + table |
| C1 | DocMath-Simplong | dm_simplong |
100 | testmini | text + multi-table |
| C1 | DocMath-Complong | dm_complong |
300 | testmini | text + multi-table |
| C1 | FinanceMath | financemath |
200 | validation | text + table |
| C2 | HiTab | hitab |
1,584 | test | hierarchical table |
| C2 | MultiHiertt | multihiertt |
1,044 | dev | text + multi-hier. table |
| C2 | TabMWP | tabmwp |
1,000 | test (1k) | semi-structured table |
| C2 | WikiTableQuestions | wikitq |
4,344 | test | semi-structured table |
| C3 | ESGenius | esgenius |
1,136 | full | text (MCQ) |
| C4 | FinChart-Bench (TF/MC/QA) | finchart_bench |
2,384 / 2,350 / 2,284 | full | chart image |
See docs/datasets.md for the per-dataset data paths,
sources, and exact run commands.
git clone <this-repo> mocafin-fin && cd mocafin-fin
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# To self-host the backbone on a GPU node, also:
pip install -r requirements-serve.txt # vLLMPython 3.10+ is recommended. The agent itself only needs an OpenAI-compatible
endpoint (vLLM or Azure OpenAI); vllm is required only on the machine that
serves the model.
Nothing is redistributed here — the helper fetches each benchmark from its original source (GitHub / Hugging Face).
# everything
python scripts/download_data.py --datasets all
# or a subset
python scripts/download_data.py --datasets finqa tabmwp finchart_benchNotes:
- FinanceMath is a gated HF dataset — accept the terms on its HF page and
run
huggingface-cli login(orexport HF_TOKEN=...) first. - DocMath (
dm_simplong+dm_complong) both come from the singleyale-nlp/DocMath-Rsnapshot.
Data lands under data/raw/<dataset>/ exactly where the loaders expect it
(mocafin/data.py::COMMON_DATA_PATHS).
On a 4×H100 node:
export LLM_BACKEND=vllm
export VLLM_MODEL_NAME=Qwen/Qwen3.6-27B
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Qwen/Qwen3.6-27B \
--port 8000 --tensor-parallel-size 4 --max-model-len 65536 \
--gpu-memory-utilization 0.85 --reasoning-parser qwen3 \
--trust-remote-code --limit-mm-per-prompt '{"image": 8}'On SLURM, use scripts/serve_vllm.slurm. The same
multimodal server handles both text and the FinChart vision pass. Full notes
(cache redirection, long context, smoke tests) are in
docs/vllm_setup.md.
export LLM_BACKEND=vllm
export VLLM_BASE_URL=http://localhost:8000/v1
export VLLM_API_KEY=EMPTY
export VLLM_MODEL_NAME=Qwen/Qwen3.6-27BSingle-example demo:
python -m mocafin.demo \
--question "What is the YoY revenue growth from 2022 to 2023?" \
--table "| Year | Revenue |
|------|--------|
| 2022 | 5200 |
| 2023 | 6100 |"python -m mocafin.evaluate \
--dataset finqa \
--data_path data/raw/finqa/test.json \
--output results/finqa.jsonl \
--workers 16This writes per-example records to results/finqa.jsonl and a summary to
results/finqa_metrics.json (answer accuracy + the three internal diagnostics:
code-execution rate, code–answer consistency, mean market confidence). Runs are
resumable — re-running the same command continues from the existing file.
FinChart-Bench (multimodal) additionally needs the vision pass:
python -m mocafin.evaluate --dataset finchart_bench \
--data_path data/raw/finchart_bench_qa_only \
--vision_transcribe question_conditioned \
--output results/finchart_qa.jsonl --workers 4scripts/run_eval.slurm launches vLLM, waits for it,
runs the requested benchmarks, and shuts the server down:
mkdir -p logs results
# edit the SBATCH account/partition for your cluster, then:
sbatch --account=YOUR_ACCOUNT scripts/run_eval.slurm # all text tracks
sbatch --account=YOUR_ACCOUNT \
--export=ALL,DATASETS="finqa tabmwp" scripts/run_eval.slurm # just twoTunable env vars: MODEL PORT TP MAXLEN WORKERS DATASETS DATA_ROOT OUT CACHE_ROOT.
All configuration is via environment variables — no secrets are stored in the source. The most important ones:
| Variable | Default | Meaning |
|---|---|---|
LLM_BACKEND |
azure |
vllm for a local OpenAI-compatible server |
VLLM_BASE_URL |
http://localhost:8000/v1 |
server URL |
VLLM_MODEL_NAME |
Qwen/Qwen3.6-27B |
served model id |
VLLM_VISION_BASE_URL / VLLM_VISION_MODEL_NAME |
= text values | vision endpoint (FinChart) |
MOCAFIN_MAX_CLAIMS |
10 |
max claims per question |
MOCAFIN_REPAIR_ROUNDS |
1 |
market-aware code repair rounds |
MOCAFIN_EXECUTION_TIMEOUT |
10 |
sandbox timeout (s) |
The full set (trader role weights, market thresholds, selector/arbiter
thresholds) is in mocafin/config.py; the paper settings
are the defaults.
To use Azure OpenAI instead of vLLM, set LLM_BACKEND=azure and provide
AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_DEPLOYMENT_NAME.
mocafin-release/
├── mocafin/ # the agent package (paper datasets only)
│ ├── agent.py # MoCAFinAgent orchestration
│ ├── market.py # claim market clearing
│ ├── models.py # claim / market / program dataclasses
│ ├── prompts.py # role prompts (catalog, traders, synth, verify)
│ ├── verifier.py # operation-aware structural checks
│ ├── executor.py # sandboxed Python execution + answer matching
│ ├── llm_client.py # Azure / vLLM OpenAI-compatible client
│ ├── data.py # loaders for the 10 paper benchmarks
│ ├── evaluate.py # evaluation CLI
│ ├── demo.py # single-example CLI
│ └── vision/ # image → text transcription for FinChart
├── scripts/
│ ├── download_data.py # fetch the 10 benchmarks from upstream
│ ├── serve_vllm.slurm # serve Qwen3.6-27B on 4×H100
│ └── run_eval.slurm # serve + evaluate + teardown, one job
├── docs/
│ ├── vllm_setup.md
│ └── datasets.md
├── requirements.txt
├── requirements-serve.txt # vLLM (server side only)
└── LICENSE
soonCode: MIT (see LICENSE). Each benchmark is governed by its own
upstream license — consult and comply with it before use. No dataset is
redistributed by this repository.

