PIKE

A framework for comparing multi-agent PyTorch optimization systems, along with multiple optimization strategy implementations.

These components are collectively defined as PyTorch Inference Kernel Evolution (PIKE).

See the paper preprint here: https://arxiv.org/abs/2511.16964

About

This is a fork of KernelBench by Anne Ouyang, Simon Guo, and Azalia Mirhoseini. Benchmark additions and modifications are included from KernelBenchFiltered by METR.

This repository contains:

a refined set of KernelBench benchmarks
our evaluator setup
PIKE-B, a multi-agent evolutionary branching strategy for PyTorch optimization

The implementation for PIKE-O can be found in the pike-openevolve repository. PIKE-O is an OpenEvolve-based PyTorch optimization strategy. It makes use of the evaluator in this repository.

Setup

The simplest PIKE setup involves two components:

A containerized evaluator which runs PyTorch/kernel code on the target GPU
A script running on the host which sends code for evaluation to the evaluator via filesystem. This is either:
- An LLM-driven search script
- A baseline script to evaluate pre-existing baseline PyTorch/kernel code

Set up your host environment using uv (uv installation guide)

Clone this repository, then do the following:

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install -r requirements.txt
uv pip install -e .

Save API key environment variables to ~/.bashrc:

export OPENAI_API_KEY=<...>
export GEMINI_API_KEY=<...>
export OPENROUTER_API_KEY=<...>

Then source the changes via:

source ~/.bashrc

To run the Eval Worker: install Docker and the NVIDIA Container Toolkit (installation guide).

Running PIKE

As noted above, running a PIKE search requires two components running simultaneously:

Eval Worker — runs the evaluator in a container
Search/Baseline process — The optimization search process, or baseline manager script

Start the Eval Worker first, then run the search in a second terminal.

Dry Run

Try a dry run first to test the host components (does not require the Eval Worker to be running):

./tools/dry_run.sh

If everything worked correctly, you should figures with bogus data in data/dry-run/pike-out/h100_level_3-pike/results/figs.

Start Eval Worker

Ensure Docker and NVIDIA Container Toolkit are installed, then start the containerized Eval Worker (this script will fetch the container from online if not yet installed):

python -u sandbox/tools/start_worker_container.py --engine docker --arch <Ampere/Hopper> --max-active-tasks 10 --pull-image

Increasing --max-active-tasks will increase task throughput, but we do not recommend going beyond half the available CPU threads with this parameter, as increasing this parameter can increase noise.

If you need to restart the worker in the middle of a run, clear any outstanding messages via rm -rf worker_io.

Evaluate Baselines

This step collects runtimes for the original PyTorch code, allowing calculation of speedups. This step can be run before or after the search, but both steps MUST happen before generating figures.

python scripts/eval_baselines.py --output-dir data/pike-data --level 3-pike

This will take some time (potentially multiple hours), as it evaluates each task sequentially to minimize noise.

This script and the run_search script below run an eval HTTP server that is for internal use only. The default port for this internal server is 8000, but this can be adjusted with the --port flag (any available port should work fine).

Run the Search

Keep the Eval Worker running for this. The search process submits PyTorch/kernel code to the evaluator, just like in the Evaluate Baselines step.

Important: use the same --output-dir here as you used for baseline evaluation, so that generate_figs.py can find both the search results and the baseline runtimes in one place.

python scripts/run_search.py --run-name <run_name> --output-dir data/pike-data --strategy pike-b --level 3-pike --server-type google --model-name gemini-2.5-pro --task-start 1 --task-end 50

This can take 12+ hours with the default query budget per task of 300, and it will start hitting your LLM provider heavily with requests, so use it with caution. Try conservative parameters first, like lower query budget (e.g. --query-budget 50) or a single task (e.g. --task-start 13 --task-end 13).

Set desired server type (e.g. google, openai, openrouter), and model name (e.g. gemini-2.5-pro, gpt-oss-120b)

You can select any run name for your run, passed in via --run-name. The output for the run will then appear in <output-dir>/full-pike-runs/level_<level>/<run_name>. If a run fails or you kill a run early, it is highly recommended to rename/remove that failed run, or change the --run-name value before restarting the run.

For PIKE-O, pass --strategy pike-o. To use this strategy, you must first run the following pike-openevolve install script:

python scripts/install_pike_openevolve.py

Generate Figures

After the search and the baseline evaluation complete, generate figures for the run:

python scripts/generate_figs.py --input-dir data/pike-data --output-dir data/pike-out

Original Paper Figure Generation

The original data from the paper is available here: https://huggingface.co/datasets/knagaitsev/pike-data-compressed

The main original figures from the paper can easily be generated by fetching this data, then running the figure generation script on the data:

# fetching will take some time and requires ~80 GB on disk
python scripts/fetch_paper_data.py

# the fetch script places the data in data/paper-data
python scripts/generate_figs.py --input-dir data/paper-data/pike-data --output-dir data/paper-data/pike-out --paper

The --paper option should only be used on original paper data, as it only includes a subset of results in some plots, adds additional markings to plots, and adds a money-budget plot.

Documentation

Additional documentation is available in the docs/ directory, covering the eval worker, containers, HPC cluster setup, LLM API setup, profiling, and troubleshooting.

For advanced setups (running components separately, remote eval server), see docs/advanced_setup.md.

Citation

@misc{nagaitsev2025pike,
    title={Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems}, 
    author={Kirill Nagaitsev and Luka Grbcic and Samuel Williams and Costin Iancu},
    year={2025},
    eprint={2511.16964},
    archivePrefix={arXiv},
    primaryClass={cs.MA},
    url={https://arxiv.org/abs/2511.16964}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 866 Commits
KernelBench		KernelBench
agent-plans		agent-plans
assets		assets
best_agent_solutions/h100/level3-metr		best_agent_solutions/h100/level3-metr
docs		docs
examples		examples
figs-lrc/improvement		figs-lrc/improvement
results/timing		results/timing
sandbox		sandbox
scripts		scripts
src		src
test		test
tools		tools
.dockerignore		.dockerignore
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
dependencies.json		dependencies.json
notes.txt		notes.txt
requirements.txt		requirements.txt
requirements_eval_worker.txt		requirements_eval_worker.txt
requirements_eval_worker_level_5.txt		requirements_eval_worker_level_5.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PIKE

About

Setup

Running PIKE

Dry Run

Start Eval Worker

Evaluate Baselines

Run the Search

Generate Figures

Original Paper Figure Generation

Documentation

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PIKE

About

Setup

Running PIKE

Dry Run

Start Eval Worker

Evaluate Baselines

Run the Search

Generate Figures

Original Paper Figure Generation

Documentation

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages