run_skillsbench

Shell runner for running the SkillsBench task suite with different agents.

run.sh is meant to be executed from the SkillsBench repository root, because it uses relative paths such as tasks/, jobs/, and tasks/<task>/environment/skills/.

bash run.sh codex
bash run.sh openclaw

somes of the tasks (EXCLUDE) are exclued from this sequential running, because they may need more api tokens from external models, such as the task of audio transcription.
some of the tasks are inherently defective due to different causes, and they failed the oracle sanity check via bench eval create -t tasks/${task} -a oracle -m oracle

Modes

Codex

bash run.sh codex

Runs every non-excluded task twice with bench eval create:

with skills: -s tasks/<task>/environment/skills/
without skills

Agent and model:

agent harness: codex-acp
model: gpt-5.3-codex
jobs: jobs/gpt-5.3-codex__withskills__... and jobs/gpt-5.3-codex__withoutskills__...

This mode activates the skillsbench conda environment and configures the Azure OpenAI-compatible endpoint.

OpenClaw

Before running tasks, lauch the backend LLM via vllm in docker container:

sudo docker run -d --name qwen_gpus --runtime nvidia --gpus '"device=0,1"' \
      --env "HUGGING_FACE_HUB_TOKEN=hf_****" \
    -v /etc/localtime:/etc/localtime:ro \
    -v /etc/timezone:/etc/timezone:ro \
    -e TZ=America/Toronto \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -v /etc/ssl/certs:/etc/ssl/certs:ro \
    -e SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt \
    -e REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt \
    -p 1700:1700 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model Qwen/Qwen3.6-35B-A3B \
    --api-key "yyy" \
    --port 1700 \
    --trust-remote-code \
    --gpu_memory_utilization 0.95 \
    --tensor-parallel-size 2 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder

bash run.sh openclaw

Runs every non-excluded task twice with bench eval create:

with skills
without skills

Agent and model:

agent harness: openclaw
served vLLM model: Qwen/Qwen3.6-35B-A3B
benchmark model name: localvllm/Qwen/Qwen3.6-35B-A3B
jobs: jobs/openclaw__Qwen3.6-35B-A3B__withskills__... and jobs/openclaw__Qwen3.6-35B-A3B__withoutskills__...

This mode activates the skillsbench conda environment, points OpenAI-compatible variables at the local vLLM endpoint, and warns if /models does not list the expected model.

Requirements

Before running:

uv sync --locked

Make sure these are available:

bench command from the SkillsBench project
uv
curl for OpenClaw model checking
OPENAI_API_KEY in ~/.bashrc or the current shell
AZURE_OPENAI_API_KEY for Codex agent

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

run_skillsbench

Modes

Codex

OpenClaw

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

run_skillsbench

Modes

Codex

OpenClaw

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages