This repository contains the reproducibility package for:
Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes
Paper: https://arxiv.org/abs/2512.14617
Reusable implementation code lives in the companion library:
https://github.com/Alee08/multiagent-rl-rm
qrmax keeps only the experiment-facing pieces: pinned dependency metadata,
OfficeWorld experiment matrices, launchers, reproducibility notes, and
post-processing scripts.
This repository is pinned to the companion-library commit:
- package:
multiagent-rl-rm - package version:
0.3.0 - OfficeWorld IJCAI tag:
v0.3.0-ijcai2026 - pinned commit:
c1c379ce57573f075de7c25e346d665b4475ded3
The default requirements.txt installs the companion library from this frozen
commit. The pinned commit includes the OfficeWorld IJCAI code, the
continuous-line checks, the continuous-corridor event-aligned and
transition-probed Bucket QR-MAX modes, and the continuous Frozen Lake
comparison suites.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtValidate configured suite sizes:
python scripts/validate_config.pyDry-run the smoke suite:
python scripts/reproduce_officeworld.py --suite smoke --dry-runRun a short smoke experiment:
python scripts/reproduce_officeworld.py --suite smokeValidate the continuous-line Bucket QR-MAX suites:
python scripts/validate_continuous_line_config.pyValidate the continuous-corridor Bucket QR-MAX suites:
python scripts/validate_continuous_corridor_config.pyValidate the continuous-FrozenLake Bucket QR-MAX suites:
python scripts/validate_continuous_frozen_lake_config.pyRun the continuous-line smoke experiment:
python scripts/reproduce_continuous_line.py --suite continuous_line_smokeRun the continuous-corridor smoke experiment:
python scripts/reproduce_continuous_corridor.py --suite continuous_corridor_smokeRun the continuous-FrozenLake smoke experiment:
python scripts/reproduce_continuous_frozen_lake.py --suite continuous_frozen_lake_smokeSuites are defined in configs/officeworld_discrete.json.
| Suite | Runs | Purpose |
|---|---|---|
smoke |
1 | Fast local/CI sanity check. |
paper_main |
300 | Three main OfficeWorld configurations. |
paper_table6 |
500 | Five configurations summarized in the paper table. |
paper_appendix_15 |
4500 | Appendix sweep over map1-map3, exp1-exp5, 30 seeds. |
officeworld_discrete |
2100 | Full encoded OfficeWorld sweep. |
Run a suite:
python scripts/reproduce_officeworld.py --suite paper_mainFilter a larger suite without editing JSON:
python scripts/reproduce_officeworld.py \
--suite officeworld_discrete \
--algorithms QRMAX QRMAXRM \
--maps map1 map2 \
--seeds 0 1 2The configured algorithm identifiers are the ones exposed by the frozen
OfficeWorld runner: QL, QRM, RMAX, RMAXRM, QRMAX, QRMAXRM,
UCBVI-sB, UCBVI-B, UCBVI-H, and OPSRL.
The companion library also exposes a small continuous-state NMRDP sanity check
for Bucket QR-MAX. The task uses a one-dimensional continuous state, two
discrete actions, and a Reward Machine sequence A -> B.
Suites are defined in configs/continuous_line_bucket_qrmax.json.
| Suite | Runs | Purpose |
|---|---|---|
continuous_line_smoke |
1 | Fast event-aware QR-MAX sanity check. |
continuous_algorithm_comparison |
3 | Compare QR-MAX, Q-learning, and R-MAX. |
continuous_bucket_sweep |
2 | Sweep successful bucket granularities for QR-MAX. |
continuous_noise_sweep |
4 | Sweep transition noise for QR-MAX. |
Run the QR-MAX bucket sweep:
python scripts/reproduce_continuous_line.py --suite continuous_bucket_sweepThe included reference sweep is
paper_results/continuous_line_bucket_qrmax_reference.csv.
The companion library also exposes a 2D continuous-state corridor check for
Bucket QR-MAX. The task uses state (x, y), four discrete actions, low
transition noise, and a Reward Machine sequence A -> B.
Suites are defined in configs/continuous_corridor_bucket_qrmax.json.
| Suite | Runs | Purpose |
|---|---|---|
continuous_corridor_smoke |
1 | Fast event-aware QR-MAX sanity check. |
continuous_corridor_algorithm_comparison |
4 | Compare Q-learning, QRM, R-MAX, and QR-MAX. |
continuous_corridor_hard_algorithm_comparison |
4 | Hard reset/noise algorithm comparison. |
continuous_corridor_bottleneck_smoke |
1 | Deterministic bottleneck sanity check. |
continuous_corridor_bottleneck_transition_probed |
1 | Noisy bottleneck solved with transition-probed buckets. |
continuous_corridor_abc_algorithm_comparison |
3 | A-B-C sequence comparison with event-aligned buckets. |
Run the QR-MAX corridor algorithm comparison:
python scripts/reproduce_continuous_corridor.py --suite continuous_corridor_algorithm_comparisonRun the A-B-C stress comparison:
python scripts/reproduce_continuous_corridor.py --suite continuous_corridor_abc_algorithm_comparisonRun the noisy bottleneck transition-probed suite:
python scripts/reproduce_continuous_corridor.py --suite continuous_corridor_bottleneck_transition_probedThe included reference sweep is
paper_results/continuous_corridor_bucket_qrmax_reference.csv.
The companion library also exposes a continuous-coordinate Frozen Lake check.
The task keeps the map1 Frozen Lake layout and Reward Machine sequence
A -> B -> C, but the agent state is continuous (x, y) and is mapped to
finite buckets for the tabular algorithms.
Suites are defined in configs/continuous_frozen_lake_bucket_qrmax.json.
| Suite | Runs | Purpose |
|---|---|---|
continuous_frozen_lake_smoke |
1 | Fast QR-MAX sanity check. |
continuous_frozen_lake_abc_algorithm_comparison |
4 | Compare Q-learning, QRM, R-MAX, and QR-MAX on deterministic A-B-C. |
continuous_frozen_lake_light_noise_comparison |
3 | Compare Q-learning, QRM, and QR-MAX with light continuous transition noise. |
Run the A-B-C algorithm comparison:
python scripts/reproduce_continuous_frozen_lake.py --suite continuous_frozen_lake_abc_algorithm_comparisonRun the light-noise comparison:
python scripts/reproduce_continuous_frozen_lake.py --suite continuous_frozen_lake_light_noise_comparisonThe included reference sweep is
paper_results/continuous_frozen_lake_bucket_qrmax_reference.csv.
The continuous-corridor reference CSV includes controlled stress cases. The hard reset/noise and A-B-C settings use event-aligned buckets, which add bucket edges at RM event boundaries. The noisy bottleneck setting uses transition-probed buckets: the runner probes one-step transition outcomes, refines recurring dynamics boundaries by bisection, and trains QR-MAX on the resulting finite abstraction.
These modes avoid relying on a fixed uniform grid. QRM is included as a tabular Reward Machine baseline; deep baselines such as DQN are not part of the pinned reproducibility package because they require a separate neural training stack and tuning protocol.
The upstream OfficeWorld runner writes raw logs under results/ and summary
text files named results_<map>_<experiment>.txt. These files are ignored by
Git.
Generate a compact aggregate CSV:
python scripts/summarize_officeworld.py \
--results-dir results \
--output paper_results/officeworld_summary.csvThe reported Table 6 OfficeWorld values from the paper are included as
paper_results/table6_officeworld_steps.csv for reference.
The extended 15-configuration OfficeWorld breakdown is included as
paper_results/table4_officeworld_15_configs.csv.
Continuous-line Bucket QR-MAX reference sweeps are included as
paper_results/continuous_line_bucket_qrmax_reference.csv.
Continuous-corridor Bucket QR-MAX reference sweeps are included as
paper_results/continuous_corridor_bucket_qrmax_reference.csv.
Continuous-FrozenLake Bucket QR-MAX reference sweeps are included as
paper_results/continuous_frozen_lake_bucket_qrmax_reference.csv.
Training steps to reach the optimal policy compared with Value Iteration (VI). The last two columns show the performance of the optimal solutions computed through Value Iteration, highlighting the increasing difficulty of the task.
| Map | Exp | Q-Learning | R-MAX | QR-MAX | QRM | R-MAXRM | QR-MAXRM | Avg. Length (VI) | Success Rate (VI %) |
|---|---|---|---|---|---|---|---|---|---|
| Map1 | exp1 | 128,973 | 59,691 | 30,376 | 66,693 | 21,777 | 10,964 | 21.08 | 66.76 |
| Map1 | exp2 | 201,835 | 77,797 | 27,619 | 129,804 | 26,232 | 9,339 | 39.31 | 39.40 |
| Map1 | exp3 | 355,070 | 155,347 | 30,325 | 122,827 | 28,682 | 6,087 | 40.94 | 38.68 |
| Map1 | exp4 | 469,313 | 103,211 | 29,950 | 135,119 | 26,953 | 6,062 | 42.23 | 34.37 |
| Map1 | exp5 | 1,267,449 | 287,066 | 38,312 | 231,870 | 30,238 | 4,107 | 77.86 | 15.47 |
| Map2 | exp1 | 334,894 | 213,942 | 83,593 | 225,491 | 74,250 | 28,729 | 52.46 | 54.69 |
| Map2 | exp2 | 326,306 | 88,636 | 56,445 | 220,334 | 49,577 | 19,214 | 53.10 | 37.92 |
| Map2 | exp3 | 723,051 | 316,672 | 62,583 | 280,763 | 63,066 | 12,415 | 63.04 | 37.25 |
| Map2 | exp4 | 889,842 | 196,334 | 61,775 | 249,912 | 47,873 | 12,445 | 73.80 | 28.18 |
| Map2 | exp5 | 2,613,669 | 723,441 | 83,076 | 438,213 | 65,523 | 8,396 | 123.39 | 14.15 |
| Map3 | exp1 | 501,559 | 201,731 | 98,663 | 297,975 | 73,832 | 33,815 | 55.83 | 33.32 |
| Map3 | exp2 | 553,517 | 170,926 | 99,952 | 353,658 | 66,949 | 33,853 | 63.76 | 32.57 |
| Map3 | exp3 | 1,105,417 | 388,562 | 110,808 | 406,122 | 111,349 | 21,856 | 68.69 | 32.42 |
| Map3 | exp4 | 1,466,896 | 495,878 | 112,165 | 449,594 | 107,115 | 21,869 | 81.91 | 21.27 |
| Map3 | exp5 | 5,471,046 | 1,581,301 | 159,597 | 891,160 | 128,950 | 14,717 | 156.46 | 6.55 |
Main OfficeWorld configurations used in the paper:
Base OfficeWorld maps used for the 15-configuration Table 4 sweep:
See REPRODUCIBILITY.md for the full workflow.
Use CITATION.cff for repository metadata. The final IJCAI citation can be
added once the camera-ready bibliographic metadata is available.





