Clean experiment artifacts for a narrow code-security distillation study.
This repo packages the subset of code, data, benchmark outputs, and notes from a larger research workspace, centered on the strongest result from the project: a narrow benchmark where a distilled 7B model slightly outperformed GPT-5.5 on balanced accuracy.
The main result here is simple:
- a
7Bopen model slightly beatGPT-5.5on a real-world security benchmark - updated
2026-05-03: the same benchmark was rerun againstGPT-5.5, and the best student still came out ahead on the primary metric - the win came from task narrowing, public matched data, and frontier-model distillation
- the best student was still small enough to be a practical specialist, not a general replacement for frontier models
The benchmark task was:
- C/C++ numeric vulnerability triage
- only
CWE-190andCWE-191 - strict structured output:
vulnerablesubtypelocationreason
This is not a claim about general vulnerability detection or patch generation. It is a claim about a specific, repeated workflow where specialization mattered.
All models below were evaluated on the same frozen 140-example PrimeVul test set:
20true numeric vulnerabilities120negatives / distractors
Because the benchmark is negative-heavy, the most useful metric is:
balanced binary accuracy = (positive recall + negative accuracy) / 2
| Model | Balanced Binary Acc | Positive Recall | Negative Accuracy | Read |
|---|---|---|---|---|
Qwen + Juliet -> PrimeVul distilled |
73.8% | 95.0% | 52.5% | Best recall, most aggressive |
GPT-5.5 (Responses API, reasoning=none) |
70.8% | 85.0% | 56.7% | Current frontier comparison |
GPT-5.2 |
70.8% | 85.0% | 56.7% | Strong frontier baseline |
Qwen + PrimeVul distilled |
70.0% | 85.0% | 55.0% | Roughly tied with the frontier baselines |
Qwen base |
63.8% | 30.0% | 97.5% | Very conservative |
Qwen + Juliet stage 1 |
50.0% | 0.0% | 100.0% | Learned to always say NONE |
Short version:
- the best
7Bstudent slightly beatGPT-5.5on balanced accuracy GPT-5.5matched the oldGPT-5.2baseline on this benchmark when rerun with the current APIPrimeVul + GPT-5.2-distilled targetscreated the real liftJulietmay provide a small warm-start benefit before the real-world distilled stage
This repo supports a result that is stronger than "small models can be decent" and narrower than "small models beat frontier models at security":
- a small open model can become competitive with, and slightly outperform, a frontier model on a narrow code-security workflow
- public data was enough to get there
- distillation plus task matching mattered more than raw dataset size
- the useful end state is a cheaper specialist, not a better general reasoner
The refreshed frontier comparison matters because the benchmark is no longer pinned to an older model generation. The same narrow specialist that beat GPT-5.2 also stayed ahead of GPT-5.5 on the benchmark's main metric.
The important pattern is:
- broader task scopes are harder to move with small fine-tunes
- narrow real-world distillation worked
- specialization is the lever
Juliet alonewas not enough to produce the strongest resultPrimeVulreal-world examples plusGPT-5.2-distilled structured targets didJuliet -> PrimeVul distilledwas the best run, but the improvement overPrimeVul distilledwas small- the best student won mostly by increasing positive recall, not by becoming uniformly more accurate
So the core lesson is not "synthetic security data is enough." The core lesson is that a small model can inherit useful frontier behavior when the task, labels, and evaluation are all tightly aligned.
These caveats matter, but they do not erase the result.
-
This is a narrow task, not general vulnerability detection.
-
This is distillation, not independent general reasoning.
- the best student models were trained on
GPT-5.2-generated targets from the same task family
- the best student models were trained on
-
The benchmark has no exact train/test overlap:
- task ID overlap:
0 - commit overlap:
0 - exact prompt overlap:
0 - exact code overlap:
0
- task ID overlap:
-
There are still weaker overlap risks:
- same public corpus family (
PrimeVul) - same-project overlap
- some shared CVEs between train and eval
- same public corpus family (
-
The best-performing student is more aggressive than the base model.
- it catches more real positives
- it also produces more false positives
The right interpretation is:
- this is real evidence that a small open model can win on a narrow, structured workflow
- it is not evidence that a small model has better general reasoning than a frontier model
I also explored broader vulnerability-detection setups, but this repo is intentionally centered on the narrower benchmark where the result was clearest and most interesting:
- a fixed real-world eval set
- a tightly scoped task
- public matched data
- distilled structured targets
That is the setup that produced the most believable frontier-vs-small-model comparison in this project.
-
EXPERIMENT_NOTES_2026-03-26.mdFull writeup with methodology, result tables, contamination checks, and interpretation. -
benchmarks/Saved benchmark outputs for the cleaned broad benchmark and the numeric-triage runs. -
data/Frozen eval sets, manifests, and numeric training datasets. -
scripts/Dataset builders, distillation scripts, Modal train/eval entrypoints, and the numeric rebalancer. -
src/rl_secdef/The subset of package code needed to build and benchmark the cleaned experiments. -
tests/Focused tests for the numeric-triage pipeline.
benchmarks/qwen7b_primevul_numeric_base_eval.jsonbenchmarks/qwen7b_juliet_numeric_stage1_eval.jsonbenchmarks/gpt52_primevul_numeric_eval.jsonbenchmarks/gpt55_primevul_numeric_eval_responses_none.jsonbenchmarks/gpt55_primevul_numeric_eval_responses_low.jsonbenchmarks/qwen7b_primevul_numeric_distilled_eval.jsonbenchmarks/qwen7b_juliet_primevul_numeric_distilled_eval.json
data/primevul_numeric_triage_train.jsonldata/primevul_numeric_triage_train_distilled.jsonldata/primevul_numeric_triage_eval.jsonldata/primevul_numeric_triage_train.manifest.jsondata/juliet_numeric_triage.jsonldata/juliet_numeric_triage.manifest.json
scripts/build_primevul_numeric_triage.pyscripts/build_juliet_numeric_triage.pyscripts/distill_numeric_triage.pyscripts/eval_numeric_openai.pyscripts/modal_train_detect.pyscripts/modal_eval_numeric.pyscripts/rebalance_numeric_triage.pysrc/rl_secdef/data/primevul_numeric.pysrc/rl_secdef/runner/numeric_triage.pysrc/rl_secdef/benchmark_numeric.py
Focused tests:
python3 -m pytest tests/test_numeric_triage.py tests/test_primevul_numeric.py tests/test_rebalance_numeric.py -qThe extracted numeric experiment bundle is intentionally small. This repo is not meant to be a full mirror of the original workspace.
Read:
That file contains the full methodology, result tables, contamination checks, and interpretation.