Skip to content

UBC-NLP/MoCA-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MoCA-Fin — Market-of-Claims Agent for Financial Reasoning

MoCA-Fin (Market-of-Claims Agent) is a multi-agent, code-generating system for financial and tabular reasoning. Instead of free-form multi-agent debate, MoCA-Fin runs a claim market: a question is decomposed into atomic, tradable claims; specialist trader agents buy/sell those claims; the market clears a price and a confidence per claim; and a synthesizer writes an executable Python program from the market-supported claims. A sandboxed executor runs the program to produce the final answer.

MoCA-Fin

Backbone. All headline results use a single Qwen/Qwen3.6-27B backbone served via vLLM (OpenAI-compatible) for every role: the trader panel, the synthesizer, the verifier, and the selector committee. The prompts are identical across datasets — only the data loader changes.


Method at a glance

MoCA-Fin overview

question + table/context
        │
        ▼
 ┌──────────────────┐   atomic, typed, tradable claims
 │  catalog builder │ ───────────────────────────────────┐
 └──────────────────┘                                     ▼
                                            ┌─────────────────────────────┐
 trader panel (buy/sell orders on claims):  │        claim market          │
   • extractor   (reads cells/values)       │  clears price + confidence   │
   • formula     (proposes the computation) │  per claim (accept/reject)   │
   • accountant  (sign / unit / scale)      └─────────────────────────────┘
   • skeptic     (shorts over-confident claims)            │
                                                            ▼
                                              ┌──────────────────────────┐
                                              │       synthesizer        │  Python program
                                              │ (uses supported claims)  │ ─────────────┐
                                              └──────────────────────────┘              ▼
                                                                              ┌───────────────────┐
        one market-aware repair round on failure  ◀───────────────────────── │ sandboxed executor│
                                                                              └───────────────────┘
 hybrid routing: a baseline proposer + a multi-lens selector committee + a
 conflict arbiter keep the strongest grounded candidate.

Headline results

Exact-match / execution accuracy (%). Bold = best overall; prev. best is the strongest previously published number for that track (see the paper for the full leaderboards and citations).

C1 — Financial Numerical Reasoning

Dataset MoCA-Fin Prev. best
FinQA 78.29 74.18 (Fino1-14B)
DocMath-Simplong 70.00 60.00 (GPT-4o)
DocMath-Complong 50.67 42.33 (DeepSeek-V3)
FinanceMath 76.00 67.0 (GPT-4o PoT)

C2 — General Tabular Reasoning

Dataset MoCA-Fin Prev. best
TabMWP 96.00 94.70 (CREATOR)
WikiTableQuestions 81.40 80.80 (ARTEMIS-DA)
HiTab 77.27 79.10 (SS-CoT)
MultiHiertt 71.17 56.78 (Fortune RL w/ CS)

C3 — Domain Knowledge QA

Dataset MoCA-Fin Prev. best
ESGenius 86.88 83.80 (Gemma-3 12B + IT + RAG)

C4 — Multimodal Chart Reasoning (FinChart-Bench)

Subtask MoCA-Fin Prev. best
True/False 97.27 97.86 (o3)
Multiple-Choice 84.85 92.17 (o3)
Open QA 74.65 63.59 (Claude Sonnet 4)
Average 85.59 84.32 (Claude Sonnet 4)

The ten benchmarks

Cat. Dataset --dataset key Eval N Split Modality
C1 FinQA finqa 1,147 test text + table
C1 DocMath-Simplong dm_simplong 100 testmini text + multi-table
C1 DocMath-Complong dm_complong 300 testmini text + multi-table
C1 FinanceMath financemath 200 validation text + table
C2 HiTab hitab 1,584 test hierarchical table
C2 MultiHiertt multihiertt 1,044 dev text + multi-hier. table
C2 TabMWP tabmwp 1,000 test (1k) semi-structured table
C2 WikiTableQuestions wikitq 4,344 test semi-structured table
C3 ESGenius esgenius 1,136 full text (MCQ)
C4 FinChart-Bench (TF/MC/QA) finchart_bench 2,384 / 2,350 / 2,284 full chart image

See docs/datasets.md for the per-dataset data paths, sources, and exact run commands.


Installation

git clone <this-repo> mocafin-fin && cd mocafin-fin
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# To self-host the backbone on a GPU node, also:
pip install -r requirements-serve.txt   # vLLM

Python 3.10+ is recommended. The agent itself only needs an OpenAI-compatible endpoint (vLLM or Azure OpenAI); vllm is required only on the machine that serves the model.


1. Download the data

Nothing is redistributed here — the helper fetches each benchmark from its original source (GitHub / Hugging Face).

# everything
python scripts/download_data.py --datasets all
# or a subset
python scripts/download_data.py --datasets finqa tabmwp finchart_bench

Notes:

  • FinanceMath is a gated HF dataset — accept the terms on its HF page and run huggingface-cli login (or export HF_TOKEN=...) first.
  • DocMath (dm_simplong + dm_complong) both come from the single yale-nlp/DocMath-R snapshot.

Data lands under data/raw/<dataset>/ exactly where the loaders expect it (mocafin/data.py::COMMON_DATA_PATHS).

2. Serve the backbone with vLLM

On a 4×H100 node:

export LLM_BACKEND=vllm
export VLLM_MODEL_NAME=Qwen/Qwen3.6-27B
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Qwen/Qwen3.6-27B \
    --port 8000 --tensor-parallel-size 4 --max-model-len 65536 \
    --gpu-memory-utilization 0.85 --reasoning-parser qwen3 \
    --trust-remote-code --limit-mm-per-prompt '{"image": 8}'

On SLURM, use scripts/serve_vllm.slurm. The same multimodal server handles both text and the FinChart vision pass. Full notes (cache redirection, long context, smoke tests) are in docs/vllm_setup.md.

3. Point the agent at the server

export LLM_BACKEND=vllm
export VLLM_BASE_URL=http://localhost:8000/v1
export VLLM_API_KEY=EMPTY
export VLLM_MODEL_NAME=Qwen/Qwen3.6-27B

Single-example demo:

python -m mocafin.demo \
  --question "What is the YoY revenue growth from 2022 to 2023?" \
  --table "| Year | Revenue |
|------|--------|
| 2022 | 5200 |
| 2023 | 6100 |"

4. Run an evaluation

python -m mocafin.evaluate \
  --dataset finqa \
  --data_path data/raw/finqa/test.json \
  --output results/finqa.jsonl \
  --workers 16

This writes per-example records to results/finqa.jsonl and a summary to results/finqa_metrics.json (answer accuracy + the three internal diagnostics: code-execution rate, code–answer consistency, mean market confidence). Runs are resumable — re-running the same command continues from the existing file.

FinChart-Bench (multimodal) additionally needs the vision pass:

python -m mocafin.evaluate --dataset finchart_bench \
  --data_path data/raw/finchart_bench_qa_only \
  --vision_transcribe question_conditioned \
  --output results/finchart_qa.jsonl --workers 4

Reproduce everything in one SLURM job

scripts/run_eval.slurm launches vLLM, waits for it, runs the requested benchmarks, and shuts the server down:

mkdir -p logs results
# edit the SBATCH account/partition for your cluster, then:
sbatch --account=YOUR_ACCOUNT scripts/run_eval.slurm                  # all text tracks
sbatch --account=YOUR_ACCOUNT \
  --export=ALL,DATASETS="finqa tabmwp" scripts/run_eval.slurm         # just two

Tunable env vars: MODEL PORT TP MAXLEN WORKERS DATASETS DATA_ROOT OUT CACHE_ROOT.


Configuration

All configuration is via environment variables — no secrets are stored in the source. The most important ones:

Variable Default Meaning
LLM_BACKEND azure vllm for a local OpenAI-compatible server
VLLM_BASE_URL http://localhost:8000/v1 server URL
VLLM_MODEL_NAME Qwen/Qwen3.6-27B served model id
VLLM_VISION_BASE_URL / VLLM_VISION_MODEL_NAME = text values vision endpoint (FinChart)
MOCAFIN_MAX_CLAIMS 10 max claims per question
MOCAFIN_REPAIR_ROUNDS 1 market-aware code repair rounds
MOCAFIN_EXECUTION_TIMEOUT 10 sandbox timeout (s)

The full set (trader role weights, market thresholds, selector/arbiter thresholds) is in mocafin/config.py; the paper settings are the defaults.

To use Azure OpenAI instead of vLLM, set LLM_BACKEND=azure and provide AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_DEPLOYMENT_NAME.


Repository layout

mocafin-release/
├── mocafin/                 # the agent package (paper datasets only)
│   ├── agent.py             # MoCAFinAgent orchestration
│   ├── market.py            # claim market clearing
│   ├── models.py            # claim / market / program dataclasses
│   ├── prompts.py           # role prompts (catalog, traders, synth, verify)
│   ├── verifier.py          # operation-aware structural checks
│   ├── executor.py          # sandboxed Python execution + answer matching
│   ├── llm_client.py        # Azure / vLLM OpenAI-compatible client
│   ├── data.py              # loaders for the 10 paper benchmarks
│   ├── evaluate.py          # evaluation CLI
│   ├── demo.py              # single-example CLI
│   └── vision/              # image → text transcription for FinChart
├── scripts/
│   ├── download_data.py     # fetch the 10 benchmarks from upstream
│   ├── serve_vllm.slurm     # serve Qwen3.6-27B on 4×H100
│   └── run_eval.slurm       # serve + evaluate + teardown, one job
├── docs/
│   ├── vllm_setup.md
│   └── datasets.md
├── requirements.txt
├── requirements-serve.txt   # vLLM (server side only)
└── LICENSE

Citation

soon

License

Code: MIT (see LICENSE). Each benchmark is governed by its own upstream license — consult and comply with it before use. No dataset is redistributed by this repository.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors