The repo supports two modes of testing:
- Local testing using vLLM
- Provider (OpenRouter) testing
# Start server with a small model
uv run python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2-1.5B-Instruct \
--port 8000Successful Verification
# Generate verification data
uv run logprob-sample \
--endpoint http://localhost:8000/v1 \
--model Qwen/Qwen2-1.5B-Instruct \
--prompt "Explain quantum computing in simple terms"
# Verify the response (should pass with same model)
uv run logprob-verify \
-f .debug/verification_data_*.json \
--verifier-endpoint http://localhost:8000/v1 \
--verifier-model Qwen/Qwen2-1.5B-InstructFailed Verification
# Generate data with one model
uv run logprob-sample \
--endpoint http://localhost:8000/v1 \
--model Qwen/Qwen2-1.5B-Instruct \
--prompt "Explain quantum computing"
# Switch to different model
uv run python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--port 8000
# Verify with different model (should fail)
uv run logprob-verify \
-f .debug/verification_data_*.json \
--verifier-endpoint http://localhost:8000/v1 \
--verifier-model meta-llama/Llama-3.1-8B-InstructTest verification with OpenRouter as the sample source and local vLLM as the verifier:
# Should pass - same model
uv run python tests/providers/test_openrouter.py \
--openrouter-model "meta-llama/llama-3.1-8b-instruct" \
--local-model "meta-llama/Llama-3.1-8B-Instruct" \
--test-sample-size 15 \
--provider "Fireworks"
# Should fail - different models
uv run python tests/providers/test_openrouter.py \
--openrouter-model "meta-llama/llama-3.1-8b-instruct" \
--local-model "mistralai/Mistral-7B-Instruct-v0.3" \
--test-sample-size 15 \
--provider "InferenceNet"This test:
- Generates verification data using OpenRouter (without token IDs)
- Starts a local vLLM server with the specified model
- Verifies using text-only matching mode
- Reports pass/fail based on same_model_probability threshold
Troubleshooting:
- Port in use:
pkill -f vllm.entrypoints.openai.api_server - Server timeout: Try smaller models or check GPU memory, ensure you have access to download the model from Hugging Face
- Uncertain results: Use more samples for evaluation (increase
--n-samplesor--test-sample-size) - Text matching issues: Enable strict mode with
--strict-text-matchingfor higher confidence