Skip to content

Alee08/qrmax

Repository files navigation

QR-MAX

This repository contains the reproducibility package for:

Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes

Paper: https://arxiv.org/abs/2512.14617

Reusable implementation code lives in the companion library:

https://github.com/Alee08/multiagent-rl-rm

qrmax keeps only the experiment-facing pieces: pinned dependency metadata, OfficeWorld experiment matrices, launchers, reproducibility notes, and post-processing scripts.

Dependency Freeze

This repository is pinned to the companion-library commit:

  • package: multiagent-rl-rm
  • package version: 0.3.0
  • OfficeWorld IJCAI tag: v0.3.0-ijcai2026
  • pinned commit: c1c379ce57573f075de7c25e346d665b4475ded3

The default requirements.txt installs the companion library from this frozen commit. The pinned commit includes the OfficeWorld IJCAI code, the continuous-line checks, the continuous-corridor event-aligned and transition-probed Bucket QR-MAX modes, and the continuous Frozen Lake comparison suites.

Install

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Sanity Checks

Validate configured suite sizes:

python scripts/validate_config.py

Dry-run the smoke suite:

python scripts/reproduce_officeworld.py --suite smoke --dry-run

Run a short smoke experiment:

python scripts/reproduce_officeworld.py --suite smoke

Validate the continuous-line Bucket QR-MAX suites:

python scripts/validate_continuous_line_config.py

Validate the continuous-corridor Bucket QR-MAX suites:

python scripts/validate_continuous_corridor_config.py

Validate the continuous-FrozenLake Bucket QR-MAX suites:

python scripts/validate_continuous_frozen_lake_config.py

Run the continuous-line smoke experiment:

python scripts/reproduce_continuous_line.py --suite continuous_line_smoke

Run the continuous-corridor smoke experiment:

python scripts/reproduce_continuous_corridor.py --suite continuous_corridor_smoke

Run the continuous-FrozenLake smoke experiment:

python scripts/reproduce_continuous_frozen_lake.py --suite continuous_frozen_lake_smoke

Experiment Suites

Suites are defined in configs/officeworld_discrete.json.

Suite Runs Purpose
smoke 1 Fast local/CI sanity check.
paper_main 300 Three main OfficeWorld configurations.
paper_table6 500 Five configurations summarized in the paper table.
paper_appendix_15 4500 Appendix sweep over map1-map3, exp1-exp5, 30 seeds.
officeworld_discrete 2100 Full encoded OfficeWorld sweep.

Run a suite:

python scripts/reproduce_officeworld.py --suite paper_main

Filter a larger suite without editing JSON:

python scripts/reproduce_officeworld.py \
  --suite officeworld_discrete \
  --algorithms QRMAX QRMAXRM \
  --maps map1 map2 \
  --seeds 0 1 2

The configured algorithm identifiers are the ones exposed by the frozen OfficeWorld runner: QL, QRM, RMAX, RMAXRM, QRMAX, QRMAXRM, UCBVI-sB, UCBVI-B, UCBVI-H, and OPSRL.

Continuous-Line Bucket QR-MAX

The companion library also exposes a small continuous-state NMRDP sanity check for Bucket QR-MAX. The task uses a one-dimensional continuous state, two discrete actions, and a Reward Machine sequence A -> B.

Suites are defined in configs/continuous_line_bucket_qrmax.json.

Suite Runs Purpose
continuous_line_smoke 1 Fast event-aware QR-MAX sanity check.
continuous_algorithm_comparison 3 Compare QR-MAX, Q-learning, and R-MAX.
continuous_bucket_sweep 2 Sweep successful bucket granularities for QR-MAX.
continuous_noise_sweep 4 Sweep transition noise for QR-MAX.

Run the QR-MAX bucket sweep:

python scripts/reproduce_continuous_line.py --suite continuous_bucket_sweep

The included reference sweep is paper_results/continuous_line_bucket_qrmax_reference.csv.

Continuous-Corridor Bucket QR-MAX

The companion library also exposes a 2D continuous-state corridor check for Bucket QR-MAX. The task uses state (x, y), four discrete actions, low transition noise, and a Reward Machine sequence A -> B.

Suites are defined in configs/continuous_corridor_bucket_qrmax.json.

Suite Runs Purpose
continuous_corridor_smoke 1 Fast event-aware QR-MAX sanity check.
continuous_corridor_algorithm_comparison 4 Compare Q-learning, QRM, R-MAX, and QR-MAX.
continuous_corridor_hard_algorithm_comparison 4 Hard reset/noise algorithm comparison.
continuous_corridor_bottleneck_smoke 1 Deterministic bottleneck sanity check.
continuous_corridor_bottleneck_transition_probed 1 Noisy bottleneck solved with transition-probed buckets.
continuous_corridor_abc_algorithm_comparison 3 A-B-C sequence comparison with event-aligned buckets.

Run the QR-MAX corridor algorithm comparison:

python scripts/reproduce_continuous_corridor.py --suite continuous_corridor_algorithm_comparison

Run the A-B-C stress comparison:

python scripts/reproduce_continuous_corridor.py --suite continuous_corridor_abc_algorithm_comparison

Run the noisy bottleneck transition-probed suite:

python scripts/reproduce_continuous_corridor.py --suite continuous_corridor_bottleneck_transition_probed

The included reference sweep is paper_results/continuous_corridor_bucket_qrmax_reference.csv.

Continuous-FrozenLake Bucket QR-MAX

The companion library also exposes a continuous-coordinate Frozen Lake check. The task keeps the map1 Frozen Lake layout and Reward Machine sequence A -> B -> C, but the agent state is continuous (x, y) and is mapped to finite buckets for the tabular algorithms.

Suites are defined in configs/continuous_frozen_lake_bucket_qrmax.json.

Suite Runs Purpose
continuous_frozen_lake_smoke 1 Fast QR-MAX sanity check.
continuous_frozen_lake_abc_algorithm_comparison 4 Compare Q-learning, QRM, R-MAX, and QR-MAX on deterministic A-B-C.
continuous_frozen_lake_light_noise_comparison 3 Compare Q-learning, QRM, and QR-MAX with light continuous transition noise.

Run the A-B-C algorithm comparison:

python scripts/reproduce_continuous_frozen_lake.py --suite continuous_frozen_lake_abc_algorithm_comparison

Run the light-noise comparison:

python scripts/reproduce_continuous_frozen_lake.py --suite continuous_frozen_lake_light_noise_comparison

The included reference sweep is paper_results/continuous_frozen_lake_bucket_qrmax_reference.csv.

Continuous-Corridor Diagnostics

The continuous-corridor reference CSV includes controlled stress cases. The hard reset/noise and A-B-C settings use event-aligned buckets, which add bucket edges at RM event boundaries. The noisy bottleneck setting uses transition-probed buckets: the runner probes one-step transition outcomes, refines recurring dynamics boundaries by bisection, and trains QR-MAX on the resulting finite abstraction.

These modes avoid relying on a fixed uniform grid. QRM is included as a tabular Reward Machine baseline; deep baselines such as DQN are not part of the pinned reproducibility package because they require a separate neural training stack and tuning protocol.

Outputs

The upstream OfficeWorld runner writes raw logs under results/ and summary text files named results_<map>_<experiment>.txt. These files are ignored by Git.

Generate a compact aggregate CSV:

python scripts/summarize_officeworld.py \
  --results-dir results \
  --output paper_results/officeworld_summary.csv

The reported Table 6 OfficeWorld values from the paper are included as paper_results/table6_officeworld_steps.csv for reference. The extended 15-configuration OfficeWorld breakdown is included as paper_results/table4_officeworld_15_configs.csv. Continuous-line Bucket QR-MAX reference sweeps are included as paper_results/continuous_line_bucket_qrmax_reference.csv. Continuous-corridor Bucket QR-MAX reference sweeps are included as paper_results/continuous_corridor_bucket_qrmax_reference.csv. Continuous-FrozenLake Bucket QR-MAX reference sweeps are included as paper_results/continuous_frozen_lake_bucket_qrmax_reference.csv.

Table 4: OfficeWorld 15-Configuration Breakdown

Training steps to reach the optimal policy compared with Value Iteration (VI). The last two columns show the performance of the optimal solutions computed through Value Iteration, highlighting the increasing difficulty of the task.

Map Exp Q-Learning R-MAX QR-MAX QRM R-MAXRM QR-MAXRM Avg. Length (VI) Success Rate (VI %)
Map1 exp1 128,973 59,691 30,376 66,693 21,777 10,964 21.08 66.76
Map1 exp2 201,835 77,797 27,619 129,804 26,232 9,339 39.31 39.40
Map1 exp3 355,070 155,347 30,325 122,827 28,682 6,087 40.94 38.68
Map1 exp4 469,313 103,211 29,950 135,119 26,953 6,062 42.23 34.37
Map1 exp5 1,267,449 287,066 38,312 231,870 30,238 4,107 77.86 15.47
Map2 exp1 334,894 213,942 83,593 225,491 74,250 28,729 52.46 54.69
Map2 exp2 326,306 88,636 56,445 220,334 49,577 19,214 53.10 37.92
Map2 exp3 723,051 316,672 62,583 280,763 63,066 12,415 63.04 37.25
Map2 exp4 889,842 196,334 61,775 249,912 47,873 12,445 73.80 28.18
Map2 exp5 2,613,669 723,441 83,076 438,213 65,523 8,396 123.39 14.15
Map3 exp1 501,559 201,731 98,663 297,975 73,832 33,815 55.83 33.32
Map3 exp2 553,517 170,926 99,952 353,658 66,949 33,853 63.76 32.57
Map3 exp3 1,105,417 388,562 110,808 406,122 111,349 21,856 68.69 32.42
Map3 exp4 1,466,896 495,878 112,165 449,594 107,115 21,869 81.91 21.27
Map3 exp5 5,471,046 1,581,301 159,597 891,160 128,950 14,717 156.46 6.55

Paper Figures

Main OfficeWorld configurations used in the paper:

Configuration 1: map0 exp0 Configuration 2: map1 exp5 Configuration 3: map4 exp6

Base OfficeWorld maps used for the 15-configuration Table 4 sweep:

OfficeWorld map1 OfficeWorld map2 OfficeWorld map3

See REPRODUCIBILITY.md for the full workflow.

Citation

Use CITATION.cff for repository metadata. The final IJCAI citation can be added once the camera-ready bibliographic metadata is available.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages