This repo hosts a general-purpose ML project template designed for running experiments with different methods and datasets. The codebase revolves around three configs — pipeline_config, eval_config, and management_config — where the first defines a method, the second defines an evaluation scheme, and the third defines output folder structure settings.
This repo is designed with three key principles in mind:
- Trackability and reproducibility: It is not uncommon to feel disconnected from the settings, code implementation, and results of an experiment run. This repo tries to provide meticulous logging, saves carbon copies of its input configs, and even copies the key code pieces directly into the output folder, alongside the results, so that you always know what's being run.
- Just-right modularity: On components where self-containedness is often appreciated (e.g., methods), we offer slight code duplication for readability; yet, on components where strict fairness is needed (e.g., evaluations), we strive for high modularity.
- Rigorous folder structures as guardrails: We define clear separation between methods, evaluations, etc. This is particularly helpful if you a) have multiple contributors and/or b) are a fan of AI-assisted coding, where the AI service is often intelligent enough to realize your intended goal but might not implement it the way you envisioned. This structure provides a quick sanity check on whether the AI is on track. Similarly, this
README.mdmight also be worth adding to AI instruction docs likeCLAUDE.mdor be made into a Claude Skill.
.
├── configs/
│ ├── global_setting.py # Global settings (SEED, timezone, etc.)
│ ├── eval_config/ # Evaluation configs
│ │ └── <dataset>/
│ │ └── <config>.json # e.g., default.json
│ ├── pipeline_config/ # Pipeline configs (one per model)
│ │ └── <model>.json
│ └── management_config/ # Output folder structure settings
│ └── default.json
│
├── pipeline/ # Method implementations
│ ├── pipeline_utils.py # Arg parsing, config loading, shared utilities
│ └── <method>/ # e.g., vanilla_huggingface
│ ├── main.py # Entry point
│ ├── inference.py # Model loading, batch generation
│ └── <dataset>/ # Dataset-specific eval logic
│ └── eval.py
│
├── eval/ # Evaluation utilities
│ ├── eval_utils.py # Generic metrics (exact_match, etc.)
│ └── <dataset>/ # Dataset-specific utils
│ ├── main.py # Prepare inputs
│ └── <dataset>_utils.py # Format prompts, extract answers
│
├── data/ # Raw datasets
│ └── <dataset>/
│ └── raw_data.json
│
├── scripts/ # Bash scripts to run experiments
│ ├── <method>/ # Full experiment scripts
│ │ └── <model>/
│ │ └── <dataset>.sh
│ └── quick_start/ # Quick verification scripts
│ └── <method>/
│ └── <model>/
│ └── <dataset>.sh
│
├── utils/ # Shared utilities
│ ├── config_utils.py # Result registration, output saving
│ ├── general_utils.py # Seed locking, misc helpers
│ └── logger_utils.py # Logger setup
│
├── requirements/ # Python dependencies
│ └── basic_requirements.txt
│
├── example_output/ # Example experiment outputs (for reference)
│ └── <dataset>/<method>/<model>/
│ ├── input_configs/ # Carbon copies of input configs
│ ├── raw_results/ # Fine-grain results
│ ├── backup/ # Code snapshot at experiment time
│ ├── exp.log # Real-time log file
│ └── output.json # Main results with processed_results
│
└── output/ # Experiment outputs (generated, gitignored)
└── <dataset>/<method>/<model>/
└── ... # Same structure as example_output
pip install -r requirements/basic_requirements.txt
Optionally, set up configs/global_setting.py with customized information like timezone, external storage dirs, access tokens, etc. You may consider running git update-index --skip-worktree configs/global_setting.py so that Git will no longer track this file, avoiding your locally stored tokens accidentally getting synced to upstream.
We supply all scripts in the scripts folder, with a folder structure that clearly indicates which script is for which experiment. E.g., should one want to run the dummy_pokemon_qa dataset (a dummy, 10-question dataset I made out of this Quiiiz post) vanilla_huggingface method using the Qwen3-0.6B model, one may do so via:
bash scripts/quick_start/vanilla_huggingface/Qwen3-0.6B/dummy_pokemon_qa.sh <output_folder_root_dir>Scripts under
scripts/quick_startare designed to conclude quickly. They serve as proxies to confirm the code and environment are (likely) running fine before committing to bigger runs. For full experiments, add scripts following a similar structure, e.g.,scripts/<method>/<model>/<dataset>.sh.
The pipeline entry points (pipeline/<method>/main.py) accept the following arguments:
| Argument | Required | Default | Description |
|---|---|---|---|
--exp_desc |
No | experiment |
Experiment description for logging |
--pipeline_config_dir |
Yes | — | Path to pipeline config JSON |
--eval_config_dir |
Yes | — | Path to eval config JSON |
--management_config_dir |
Yes | — | Path to management config JSON |
--output_folder_dir |
No | default_output_dir from global_setting.py |
Path to output folder |
--overwrite |
No | allowed |
Overwrite behavior if output folder exists |
--job_post_via |
No | terminal |
Job submission method |
Controls behavior when the output folder already exists with contents:
| Value | Behavior |
|---|---|
allowed (default) |
Logs warning and proceeds, overwriting existing files |
disabled |
Fails with FileExistsError if <output_folder_dir> already has files |
If <output_folder_dir> exists but is empty, the pipeline proceeds regardless of this setting.
Indicates how the job was submitted, affecting metadata logging:
| Value | Behavior |
|---|---|
terminal (default) |
Standard terminal execution |
slurm_sbatch |
Registers SLURM job info (job ID, node, etc.) in config |
Once an experiment is running — and suppose you are using the configs/management_config/default.json as the management_config — one may monitor real-time printouts in the terminal as well as the exp.log file under <output_folder_dir>. Once an experiment is concluded, the final results can be found in the same <output_folder_dir> folder.
Under <output_folder_dir>, one may expect the following components:
input_configs/folder: Containsinput_pipeline_config.json,input_eval_config.json, andinput_management_config.json. These are carbon copies of the configs supplied to the script. Such configs are copied here for easy replication purposes as these configs basically define an experiment.backup/folder: Contains all files defined under thebackup_scopekey under the passedmanagement_config. This directly connects the experiment with the code implementation it is running on. The backup logic: only files/directories listed ininclusion_listare backed up. When backing up a directory, its contents are filtered byexclusion_list— any path matching an exclusion pattern (as prefix likeconfigs/or path component like__pycache__/) is skipped. The one exception: a path listed exactly ininclusion_listis always included, even if it matches an exclusion pattern.output.json: This file provides a fuse of the above input configs and some management information (e.g., start/end time of a job, arguments passed, etc.). Most importantly, it highlights the main reported metrics under the keyprocessed_results. We purposely store the main results alongside the configs so that there is no mistake in connecting the results with its experiment setting.raw_results/raw_results.json: This file registers the fine-grain results of the concluded experiment, including individual scoring of each output and all newly generated tokens upon each input for monitoring/debugging purposes. Theraw_results/folder is also the place to dump all additional materials for storage/inspection purposes, but not necessarily something you'd check for every experiment.exp.log: This is a carbon copy of the real-time printouts to the terminal. This file is also updated in real-time during the experiment run.
Our codebase mainly revolves around three configs — pipeline_config, eval_config, and management_config — where the first defines a method, the second defines an evaluation scheme, and the third defines output folder structure settings. Their code implementations can be respectively found under the pipeline and eval folders. We keep the eval implementations "singleton" for each dataset to ensure they are fairly tested among different pipelines. Yet, for pipeline implementation, we intentionally repeat some code components as we want to keep each pipeline/method folder relatively self-contained for better readability; and to make everything loosely coupled to support intervention requests coming at different angles.
Generally speaking:
- For modules that can be reused without hurting necessary self-containedness, they will be abstracted into a
_utils.pyfile, likepipeline/pipeline_utils.py,eval/eval_utils.py, etc. - Every task should be executed via running the
main.pyfile under a certainpipeline/<method>/folder.
(We provide an exemplary experiment output under the example_output folder for your reference.)
- Add raw data to
dataset_dirunderconfigs/global_setting.py. - Create a new
configs/eval_config/<dataset>/default.jsonor something similarly reflective. Define moving factors (like prompt templates) under thepromptskey rather than hardcoding them in code. - Create
eval/<dataset>/withmain.py(prepare inputs) and<dataset>_utils.py(format prompts using templates from config, extract answers, etc.). - Create
pipeline/<method>/<dataset>/eval.pyfor each method you'd care to evaluate this dataset against. - Register the dataset in
pipeline/<method>/main.py.
- Create
pipeline/<method>/folder withmain.py(entry point). Note this/<method>/folder can be under different layers, e.g., we may havepipeline/quantization/kv_cache/kivi/. This would allow for different levels of_utils.pymodularization. - Optionally, create
inference.py(model loading, generation) underpipeline/<method>/if it is an inference-only task. If it is a training job, atrain.pyis recommended to host the main training loop, etc. - Add
<method>/<dataset>/eval.pyfiles for each dataset you'd care to train/evaluate against. - Create
configs/pipeline_config/<method>/<model>/<dataset>.jsonfor each task configuration. - Add scripts to
scripts/, typically following the<method>/<model>/<dataset>.shstructure.