🔍Overview | 📦Installation | 🚀Quick Start | ⚙️Usage | 🤝Contributing | 📖Docs |
SREGym is an AI-native platform to enable the design, development, and evaluation of AI agents for Site Reliability Engineering (SRE). The core idea is to create live system environments for SRE agents to solve real-world SRE problems. SREGym provides a comprehensive SRE benchmark suite with a wide variety of problems for evaluating SRE agents and also for training next-generation AI agents.
SREGym is inspired by our prior work on AIOpsLab and ITBench. It is architectured with AI-native usability and extensibility as first-class principles. The SREGym benchmark suites contain 86 different SRE problems. It supports all the problems from AIOpsLab and ITBench, and includes new problems such as OS-level faults, metastable failures, and concurrent failures. See our problem set for a complete list of problems.
In this README.md, I will quickly explain how to run SREGym within the System Intelligence Framework.
For advanced use of System Intelligence and SREGym, please refer to the docs of System Intelligence and SREGym
SREGym has a decoupled design which complies with System Intelligence philosophy. Here is the correspondence of the components in System Intelligence and SREGym:
The Executor is the agent in SREGym, which is decoupled from the framework functionality. We have a baseline agent implementation in sregym_core/clients/stratus/stratus_agent/ and it is run by default. If you want to bring your own agent, please follow the Running Your Own Agent guide.
The Evaluator is the evaluation oracles in SREGym, which is decoupled from the agent implementation.
TheSREGym's Conductor serves as the Environment in System Intelligence.
- Environment Setup: SREGym Conductor will inject faults into the environment and lead to failures.
- Diagnosis: The agent will be asked to diagnose the root cause of the failure.
- Mitigation: The agent will be asked to mitigate the failure.
- Evaluation: The RCA result will be evaluated by the LLM as a judge oracle, and the mitigation result will be evaluated by specifically-designed mitigation oracles.
- Prepare
sregym_core/.envfor the configurations. You can make a copy ofsregym_core/.env.exampleintosregym_core/.envand set the keys in the.envfile. For System Intelligence, you need to set the API keys for all the models you want to test, like below:
GEMINI_API_KEY="XXXXXX"
OPENAI_API_KEY="XXXXXX"
ANTHROPIC_API_KEY="XXXXXX"
MOONSHOT_API_KEY="XXXXXX"
AZURE_API_KEY="XXXXXX"
AZURE_API_BASE="XXXXXX"
If you want more pre-defined model configurations, please refer to the
sregym_core/llm_backend/configs.yamlfile and add your own configurations there. Then you can select the backend with cli argument--model <model_id>.
For MS Azure and AWS Bedrock, you may need more configurations.
-
You need to make a
inventory.ymlfile in thesregym_core/scripts/ansibledirectory. You can make a copy ofinventory.yml.exampleintoinventory.ymland set the hosts in theinventory.ymlfile. You can follow the instructions here to get a cluster and set up the inventory file. -
Install the dependencies
cd benchmarks/sregym
./install.sh- Run the benchmark
cd benchmarks/sregym
./run.sh <model_name> <agent_name>Some tested available names are: "gemini/gemini-2.5-flash", "openai/gpt-4o", "anthropic/claude-sonnet-4-20250514", "moonshot/moonshot-v1-32k".
The wrapper executes python src/main.py --agent "stratus" --model_name "${MODEL_NAME}" to run the benchmark.
The results will be saved in the outputs/ directory.
outputs/sregym__<model>__<agent>__<timestamp>/
├── avg_score.json # Average score
└── result.jsonl # Detailed resultsTo orchestrate SysMoBench alongside other benchmarks:
cd cli
./run_all_local.sh <model_name> <agent_name>Please refer to the Adding New Components guide in the SREGym documentation.
We strongly welcome contributions to SREGym.
You can report bugs, suggest features, or contribute code to SREGym in the upstream repository SREGym.
