From-scratch experiments on iterative neural Sudoku solvers. See post
- SOTA model:
iters/exp_baseline_lr2e3.py - Benchmark:
sapientinc/sudoku-extremeviaload_dataset(..., split="test") - Best result: 98.9% puzzle accuracy at 1024 test-time iterations
- Architecture: 4-layer shared-weight transformer, 2D RoPE, ~800K params
- Training setup: BS=2048, LR=2e-3, 16 training iterations, cosine decay, reverse curriculum
The headline number is a best-run result from an unseeded training run. Evaluation from a fixed checkpoint uses deterministic test subsampling in iters/eval_more_iters.py.
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Train the current SOTA model
python iters/exp_baseline_lr2e3.py
# Evaluate the trained checkpoint at 1024 test-time iterations
python -c "from iters.eval_more_iters import evaluate; evaluate('model_baseline_lr2e3.pt', exp_module='iters.exp_baseline_lr2e3', iter_counts=[1024])"If you want the published 98.9% result without retraining:
gh release download baseline-lr2e3-checkpoint --pattern model_baseline_lr2e3.pt
python -c "from iters.eval_more_iters import evaluate; evaluate('model_baseline_lr2e3.pt', exp_module='iters.exp_baseline_lr2e3', iter_counts=[1024])"Expected result: 24728/25000 solved, or 98.9%, at 1024 test-time iterations.
The core training code is provider-agnostic. For Modal:
pip install modal
modal token new
# Train on Modal and keep the job alive if your client disconnects
modal run --detach modal_run.py --exp iters.exp_baseline_lr2e3
# Inspect outputs on the volume
modal volume ls sudoku-outputs
modal volume get sudoku-outputs model_baseline_lr2e3.pt .
# Evaluate the saved model
modal run modal_eval.py --exp iters.exp_baseline_lr2e3 --model model_baseline_lr2e3.pt --iters 1024Experiments must expose train(output_dir="."). Modal-specific deps are in requirements-modal.txt.
Main figure entry points:
# Attention maps, confidence evolution, entropy, head specialization
python viz/visualize.py model_baseline_lr2e3.pt --exp iters.exp_baseline_lr2e3 --device cuda --n-iters 32
# Iteration-scaling summary plots from the recorded experiment tables
python viz/plot_iteration_scaling.py
# Collapse diagnostics comparing multiple checkpoints
python viz/plot_collapse_diagnostics.py \
model_baseline_lr2e3.pt model_baseline_lr3e3.pt model_baseline_lr1e3.pt \
--exps iters.exp_baseline_lr2e3 iters.exp_baseline_lr3e3 iters.exp_baseline_lr1e3 \
--output-dir viz/outputOutputs go to viz/output/.
For GPU-backed collapse diagnostics on Modal:
modal run --detach viz/modal_viz.py
modal volume get sudoku-outputs viz_diagnostics/ viz/output/iters/exp_baseline_lr2e3.py- current SOTA training scriptiters/eval_more_iters.py- canonical evaluation across test-time iteration countsanalyze_failures_new.py- per-iteration failure analysis for current modelscheckpoint_utils.py- checkpoint discovery and config-checked resumemodal_run.py- minimal Modal training wrappermodal_eval.py- Modal wrapper foriters/eval_more_iters.pymodal_analyze.py- Modal wrapper for analysis utilitiesviz/visualize.py- attention/confidence/head-specialization figures for current modelsviz/plot_collapse_diagnostics.py- hidden-state and prediction-stability diagnosticsviz/plot_iteration_scaling.py- static summary plots from the documented scaling tablesiters/EXPERIMENTS_ITERS.md- current source of truth for iteration-scaling results
| Model | Params | Training Time | Accuracy |
|---|---|---|---|
| exp_baseline_lr2e3 (1024 test iters) | 800K | ~2h40m (H200) | 98.9% |
| exp_baseline_lr2e3 (16 test iters) | 800K | ~2h40m (H200) | 81.8% |
| TRM (reference) | 7M | ~18h (L40S) | ~87% |
The model is sudoku-agnostic in the sense that it only assumes a 2D grid: no row/col/box constraint embedding, just 2D RoPE in attention. Running more test-time iterations than used during training is the key result: 16 training iterations scales cleanly to 1024 evaluation iterations. Full scaling tables, stability analysis, interventions, and ablations live in iters/EXPERIMENTS_ITERS.md.
test_data.py- comparison helper for loadingtest.csvdirectly; not the canonical benchmark pathlogs_to_tensorboard.py- historical log conversion helpertensorboard_utils.py- lightweight TensorBoard logger used by a few older experimentsviz/- plotting and visualization scripts for model behavior
Older Kaggle and pre-sudoku-extreme experiments are preserved for reference, but they are not the current public path:
STALE_EXPERIMENTS_DOC.md- archived chronological experiment logarch/,recur/,curriculum/,misc/- older experiment familiespos_embedding/EXPERIMENTS_POS.md- 2D RoPE introduction and positional-encoding ablationsmuon/EXPERIMENTS_MUON.md- Muon optimizer experimentsrrn/RRN_EXPERIMENTS.md- RRN experiments- root-level scripts such as
eval_extreme.py,eval_only.py, andeval_difficulties.py- archival only