Shell runner for running the SkillsBench task suite with different agents.
run.sh is meant to be executed from the SkillsBench repository root, because it uses relative paths such as tasks/, jobs/, and tasks/<task>/environment/skills/.
bash run.sh codex
bash run.sh openclaw-
somes of the tasks (
EXCLUDE) are exclued from this sequential running, because they may need more api tokens from external models, such as the task of audio transcription. -
some of the tasks are inherently defective due to different causes, and they failed the oracle sanity check via
bench eval create -t tasks/${task} -a oracle -m oracle
bash run.sh codexRuns every non-excluded task twice with bench eval create:
- with skills:
-s tasks/<task>/environment/skills/ - without skills
Agent and model:
- agent harness:
codex-acp - model:
gpt-5.3-codex - jobs:
jobs/gpt-5.3-codex__withskills__...andjobs/gpt-5.3-codex__withoutskills__...
This mode activates the skillsbench conda environment and configures the Azure OpenAI-compatible endpoint.
Before running tasks, lauch the backend LLM via vllm in docker container:
sudo docker run -d --name qwen_gpus --runtime nvidia --gpus '"device=0,1"' \
--env "HUGGING_FACE_HUB_TOKEN=hf_****" \
-v /etc/localtime:/etc/localtime:ro \
-v /etc/timezone:/etc/timezone:ro \
-e TZ=America/Toronto \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v /etc/ssl/certs:/etc/ssl/certs:ro \
-e SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt \
-e REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt \
-p 1700:1700 \
--ipc=host \
vllm/vllm-openai:latest \
--model Qwen/Qwen3.6-35B-A3B \
--api-key "yyy" \
--port 1700 \
--trust-remote-code \
--gpu_memory_utilization 0.95 \
--tensor-parallel-size 2 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coderbash run.sh openclawRuns every non-excluded task twice with bench eval create:
- with skills
- without skills
Agent and model:
- agent harness:
openclaw - served vLLM model:
Qwen/Qwen3.6-35B-A3B - benchmark model name:
localvllm/Qwen/Qwen3.6-35B-A3B - jobs:
jobs/openclaw__Qwen3.6-35B-A3B__withskills__...andjobs/openclaw__Qwen3.6-35B-A3B__withoutskills__...
This mode activates the skillsbench conda environment, points OpenAI-compatible variables at the local vLLM endpoint, and warns if /models does not list the expected model.
Before running:
uv sync --lockedMake sure these are available:
benchcommand from the SkillsBench projectuvcurlfor OpenClaw model checkingOPENAI_API_KEYin~/.bashrcor the current shellAZURE_OPENAI_API_KEYfor Codex agent