A framework for comparing multi-agent PyTorch optimization systems, along with multiple optimization strategy implementations.
These components are collectively defined as PyTorch Inference Kernel Evolution (PIKE).
See the paper preprint here: https://arxiv.org/abs/2511.16964
This is a fork of KernelBench by Anne Ouyang, Simon Guo, and Azalia Mirhoseini. Benchmark additions and modifications are included from KernelBenchFiltered by METR.
This repository contains:
- a refined set of KernelBench benchmarks
- our evaluator setup
- PIKE-B, a multi-agent evolutionary branching strategy for PyTorch optimization
The implementation for PIKE-O can be found in the pike-openevolve repository. PIKE-O is an OpenEvolve-based PyTorch optimization strategy. It makes use of the evaluator in this repository.
The simplest PIKE setup involves two components:
- A containerized evaluator which runs PyTorch/kernel code on the target GPU
- A script running on the host which sends code for evaluation to the evaluator via filesystem. This is either:
- An LLM-driven search script
- A baseline script to evaluate pre-existing baseline PyTorch/kernel code
Set up your host environment using uv (uv installation guide)
Clone this repository, then do the following:
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install -r requirements.txt
uv pip install -e .Save API key environment variables to ~/.bashrc:
export OPENAI_API_KEY=<...>
export GEMINI_API_KEY=<...>
export OPENROUTER_API_KEY=<...>Then source the changes via:
source ~/.bashrcTo run the Eval Worker: install Docker and the NVIDIA Container Toolkit (installation guide).
As noted above, running a PIKE search requires two components running simultaneously:
- Eval Worker — runs the evaluator in a container
- Search/Baseline process — The optimization search process, or baseline manager script
Start the Eval Worker first, then run the search in a second terminal.
Try a dry run first to test the host components (does not require the Eval Worker to be running):
./tools/dry_run.shIf everything worked correctly, you should figures with bogus data in data/dry-run/pike-out/h100_level_3-pike/results/figs.
Ensure Docker and NVIDIA Container Toolkit are installed, then start the containerized Eval Worker (this script will fetch the container from online if not yet installed):
python -u sandbox/tools/start_worker_container.py --engine docker --arch <Ampere/Hopper> --max-active-tasks 10 --pull-imageIncreasing --max-active-tasks will increase task throughput, but we do not recommend going beyond half the available CPU threads with this parameter, as increasing this parameter can increase noise.
If you need to restart the worker in the middle of a run, clear any outstanding messages via rm -rf worker_io.
This step collects runtimes for the original PyTorch code, allowing calculation of speedups. This step can be run before or after the search, but both steps MUST happen before generating figures.
python scripts/eval_baselines.py --output-dir data/pike-data --level 3-pikeThis will take some time (potentially multiple hours), as it evaluates each task sequentially to minimize noise.
This script and the run_search script below run an eval HTTP server that is for internal use only. The default port for this internal server is 8000, but this can be adjusted with the --port flag (any available port should work fine).
Keep the Eval Worker running for this. The search process submits PyTorch/kernel code to the evaluator, just like in the Evaluate Baselines step.
Important: use the same --output-dir here as you used for baseline evaluation, so that generate_figs.py can find both the search results and the baseline runtimes in one place.
python scripts/run_search.py --run-name <run_name> --output-dir data/pike-data --strategy pike-b --level 3-pike --server-type google --model-name gemini-2.5-pro --task-start 1 --task-end 50This can take 12+ hours with the default query budget per task of 300, and it will start hitting your LLM provider heavily with requests, so use it with caution. Try conservative parameters first, like lower query budget (e.g. --query-budget 50) or a single task (e.g. --task-start 13 --task-end 13).
Set desired server type (e.g. google, openai, openrouter), and model name (e.g. gemini-2.5-pro, gpt-oss-120b)
You can select any run name for your run, passed in via --run-name. The output for the run will then appear in <output-dir>/full-pike-runs/level_<level>/<run_name>. If a run fails or you kill a run early, it is highly recommended to rename/remove that failed run, or change the --run-name value before restarting the run.
For PIKE-O, pass --strategy pike-o. To use this strategy, you must first run the following pike-openevolve install script:
python scripts/install_pike_openevolve.pyAfter the search and the baseline evaluation complete, generate figures for the run:
python scripts/generate_figs.py --input-dir data/pike-data --output-dir data/pike-outThe original data from the paper is available here: https://huggingface.co/datasets/knagaitsev/pike-data-compressed
The main original figures from the paper can easily be generated by fetching this data, then running the figure generation script on the data:
# fetching will take some time and requires ~80 GB on disk
python scripts/fetch_paper_data.py
# the fetch script places the data in data/paper-data
python scripts/generate_figs.py --input-dir data/paper-data/pike-data --output-dir data/paper-data/pike-out --paperThe --paper option should only be used on original paper data, as it only includes a subset of results in some plots, adds additional markings to plots, and adds a money-budget plot.
Additional documentation is available in the docs/ directory, covering the eval worker, containers, HPC cluster setup, LLM API setup, profiling, and troubleshooting.
For advanced setups (running components separately, remote eval server), see docs/advanced_setup.md.
@misc{nagaitsev2025pike,
title={Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems},
author={Kirill Nagaitsev and Luka Grbcic and Samuel Williams and Costin Iancu},
year={2025},
eprint={2511.16964},
archivePrefix={arXiv},
primaryClass={cs.MA},
url={https://arxiv.org/abs/2511.16964},
}