This directory contains command-line interface tools and utility scripts for the zera-agent project.
Command-line interface for running prompt tuning experiments.
Usage:
python scripts/run_prompt_tuning.py --dataset bbh --total_samples 20 --iteration_samples 5 --iterations 10 --model solar --evaluator solar --meta_model solar --output_dir ./resultsNo-API smoke run:
python3 scripts/run_prompt_tuning.py \
--dataset bbh \
--total_samples 5 \
--iteration_samples 2 \
--iterations 2 \
--mock \
--output_dir ./results/mock_smokeMinimal API smoke run:
pip install -r requirements.txt
cp .env.example .env
# Fill UPSTAGE_API_KEY in .env, then run:
python3 scripts/run_prompt_tuning.py \
--dataset bbh \
--total_samples 5 \
--iteration_samples 1 \
--iterations 1 \
--model solar \
--evaluator solar \
--meta_model solar \
--output_dir ./results/api_smoke_solarThe default API smoke path uses Solar for generation, evaluation, and meta-prompting, so UPSTAGE_API_KEY is enough. If you select other remote models, set the matching key: OPENAI_API_KEY for gpt4o, ANTHROPIC_API_KEY for claude, and SOLAR_STRAWBERRY_API_KEY for solar_strawberry.
Options:
--dataset: Dataset to use (bbh, mmlu, mmlu_pro, cnn, gsm8k, mbpp, xsum, truthfulqa, hellaswag, humaneval, samsum, meetingbank)--total_samples: Total number of samples to use--iteration_samples: Number of samples per iteration--iterations: Number of tuning iterations--model: Model to use for tuning--evaluator: Model to use for evaluation--meta_model: Model to use for meta prompt generation--output_dir: Directory to save results--mock: Run with built-in sample data and deterministic fake models; no API key or network call is required.
Runs the unit tests and a mock prompt-tuning execution end to end without real API calls.
Usage:
python3 scripts/smoke_test_no_api.pyScript to run multiple prompt tuning experiments in batch.
Usage:
python scripts/run_batch_experiments.py --config experiments_config.jsonFeatures:
- Batch execution of multiple experiments
- Configurable experiment parameters
- Progress tracking and logging
- Error handling and recovery
Utility script to update evaluation results.
Usage:
python scripts/update_results.pyPurpose:
- Re-evaluate existing results with updated models
- Update correctness scores
- Generate new result files
Shell script for running experiments in the background.
Usage:
bash scripts/run_background.shFeatures:
- Background process management
- Log file generation
- Process monitoring
All scripts should be run from the project root directory:
cd /path/to/zera-agent
python scripts/script_name.py [options]Make sure you have all required dependencies installed:
pip install -r requirements.txtSet up your environment variables in a .env file:
cp .env.example .env
# Edit .env with your API keys and settings# Run a simple BBH experiment
python scripts/run_prompt_tuning.py --dataset bbh --total_samples 10 --iterations 3 --model solar
# Run batch experiments
python scripts/run_batch_experiments.py --config experiments_config.json# Custom model configuration
python scripts/run_prompt_tuning.py \
--dataset mmlu \
--total_samples 50 \
--iterations 5 \
--model gpt4o \
--evaluator claude \
--meta_model solar \
--output_dir ./custom_resultspython scripts/run_prompt_tuning.py \
--dataset gsm8k \
--total_samples 100 \
--iterations 10 \
--model gpt4o \
--evaluator gpt4o \
--meta_model gpt4o \
--evaluation_threshold 0.9 \
--output_dir ./results/gsm8k_math_expertpython scripts/run_prompt_tuning.py \
--dataset mmlu \
--total_samples 200 \
--iterations 15 \
--model claude \
--evaluator claude \
--meta_model claude \
--evaluation_threshold 0.85 \
--output_dir ./results/mmlu_knowledge_expertpython scripts/run_prompt_tuning.py \
--dataset mbpp \
--total_samples 50 \
--iterations 8 \
--model solar \
--evaluator solar \
--meta_model solar \
--evaluation_threshold 0.8 \
--output_dir ./results/mbpp_code_expertpython scripts/run_prompt_tuning.py \
--dataset cnn \
--total_samples 80 \
--iterations 12 \
--model gpt4o \
--evaluator claude \
--meta_model solar \
--evaluation_threshold 0.75 \
--output_dir ./results/cnn_summary_expert# Use smaller sample sizes to reduce costs
python scripts/run_prompt_tuning.py \
--dataset bbh \
--total_samples 20 \
--iterations 5 \
--model solar \
--evaluation_threshold 0.7 \
--output_dir ./results/bbh_budget_test# Use larger samples for better results (higher cost)
python scripts/run_prompt_tuning.py \
--dataset mmlu \
--total_samples 500 \
--iterations 20 \
--model gpt4o \
--evaluation_threshold 0.95 \
--output_dir ./results/mmlu_high_performance