[experiments] Daily Experiment Report — 2026-05-25 #34609

2026-05-25T09:26:51Z

github-actions[bot]
Bot May 25, 2026

🧪 Daily Experiment Report — 2026-05-25

5 experiments analysed across 4 workflows. 2 experiments reached sufficient sample size (smoke-copilot); both resulted in ABANDON (p ≥ 0.05, no detectable effect). 3 experiments are still collecting data (EXTEND), one with a critical ⚠️ guardrail concern.

⚡ Quick Stats

Metric	Value
Active experiments	5
Workflows with experiments	4
Ready for analysis (n ≥ min_samples)	2
Statistically significant (p < 0.05)	0
Recommendations	✅ PROMOTE: 0 · 🟡 EXTEND: 2 · ❌ ABANDON: 3

`output_format` · `daily-compiler-quality`

Status: 🟡 COLLECTING · ❌ GUARDRAIL FAILED
Variants: detailed vs concise · Window: last 10 runs · min_samples: 20 per variant
Tracking issue: #32390

H0: no change in discussion engagement. H1: concise variant achieves equal engagement with ≥30% fewer output tokens

📈 View Detailed Statistics

Sample Sizes & Progress

Variant	Runs	Progress
`detailed`	4	██░░░░░░░░ 4/20 (20%)
`concise`	6	███░░░░░░░ 6/20 (30%)

Experiment : output_format
Workflow   : daily-compiler-quality
Hypothesis : H0: no change in engagement. H1: concise achieves equal engagement with ≥30% fewer tokens
Window     : last 10 runs  |  Analysed: 10 runs
min_samples: 20 per variant

+------------------+------+----------+----------------+--------------------+-----------+------------------+
| Variant          |  n   | Succ %   | Mean dur (min) | 95% CI (min)       |  p-value  | min_samples      |
+------------------+------+----------+----------------+--------------------+-----------+------------------+
| detailed         |   4  |  75.0%   |    7.4         | [3.3 , 11.6]       |  (ref)    | ██░░░░░░░░ 20%   |
| concise          |   6  |  66.7%   |    8.5         | [5.3 , 11.6]       |  0.7782   | ███░░░░░░░ 30%   |
+------------------+------+----------+----------------+--------------------+-----------+------------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001

Guardrails:
  run_success_rate >=0.85 : detailed=FAIL(0.75), concise=FAIL(0.67) ← GUARDRAIL FAILED
  empty_output_rate <=0.05: (insufficient data to evaluate)

Recommendation: ABANDON
Rationale     : Both variants fail the run_success_rate >=0.85 guardrail (0.75 and 0.67); investigate before continuing.

Recommendation: ❌ ABANDON — Both variants violate the run_success_rate >=0.85 guardrail; fix infrastructure reliability before resuming the experiment.

`output_format` · `daily-issues-report`

Status: ⚠️ CRITICAL · 🔴 ALL RUNS FAILED
Variants: collapsible vs inline · Window: 19 runs since 2026-05-07 · min_samples: 30 per variant
Tracking issue: #30573

H0: no change in discussion engagement. H1: inline format produces ≥20% higher reactions+replies

📈 View Detailed Statistics

Sample Sizes & Progress

Variant	Runs	Progress
`collapsible`	7	██░░░░░░░░ 7/30 (23%)
`inline`	12	████░░░░░░ 12/30 (40%)

Experiment : output_format
Workflow   : daily-issues-report
Hypothesis : H0: no change. H1: inline ≥20% higher reactions+replies
Window     : last 30 runs  |  Analysed: 19 runs
min_samples: 30 per variant

+------------------+------+----------+----------------+--------------------+-----------+------------------+
| Variant          |  n   | Succ %   | Mean dur (min) | 95% CI (min)       |  p-value  | min_samples      |
+------------------+------+----------+----------------+--------------------+-----------+------------------+
| collapsible      |   7  |   0.0%   |    5.4         | [4.0 , 6.8]        |  (ref)    | ██░░░░░░░░ 23%   |
| inline           |  12  |   0.0%   |    6.0         | [4.5 , 7.4]        |  N/A      | ████░░░░░░ 40%   |
+------------------+------+----------+----------------+--------------------+-----------+------------------+
p-value is N/A — 0% success rate makes the z-test undefined.

Guardrails:
  empty_output_rate ==0 : both variants at 100% workflow failure rate — GUARDRAIL CONCERN ⚠️

Recommendation: ABANDON
Rationale     : 100% workflow failure across ALL 19 runs signals a broken environment, not a variant effect; investigate and fix before collecting valid experiment data.

Recommendation: ❌ ABANDON (infrastructure issue) — Every single run since the experiment started has failed. This is not a variant effect — the workflow environment must be investigated. Pause and fix before re-running.

`output_format` · `deep-report`

Status: 🟡 COLLECTING
Variants: full_briefing vs executive_brief vs annotated_brief · Window: 14 runs since 2026-05-06 · min_samples: 15 per variant

H0: no change in engagement/token cost. H1: executive_brief reduces token usage ≥20% without reducing engagement

📈 View Detailed Statistics

Sample Sizes & Progress

Variant	Runs	Progress
`full_briefing`	5	███░░░░░░░ 5/15 (33%)
`executive_brief`	4	███░░░░░░░ 4/15 (27%)
`annotated_brief`	5	███░░░░░░░ 5/15 (33%)

Experiment : output_format
Workflow   : deep-report
Hypothesis : H0: no change. H1: executive_brief reduces tokens ≥20% without reducing engagement
Window     : last 30 runs  |  Analysed: 14 runs
min_samples: 15 per variant

+--------------------+------+----------+----------------+--------------------+-----------+------------------+
| Variant            |  n   | Succ %   | Mean dur (min) | 95% CI (min)       |  p-value  | min_samples      |
+--------------------+------+----------+----------------+--------------------+-----------+------------------+
| full_briefing      |   5  | 100.0%   |   14.9         | [10.8, 19.1]       |  (ref)    | ███░░░░░░░ 33%   |
| executive_brief    |   4  | 100.0%   |   14.2         | [10.6, 17.9]       |  N/A      | ███░░░░░░░ 27%   |
| annotated_brief    |   5  |  80.0%   |   16.0         | [14.6, 17.4]       |  0.2918   | ███░░░░░░░ 33%   |
+--------------------+------+----------+----------------+--------------------+-----------+------------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001
Note: executive_brief vs full_briefing p-value is N/A because full_briefing has 100% success rate (0 failures).
Bonferroni α = 0.025 (3 variants)

Guardrails:
  empty_output_rate ==0         : all variants PASS
  issue_creation_success_rate >=0.8: full_briefing=PASS(1.0), executive_brief=PASS(1.0), annotated_brief=PASS(0.8)

Recommendation: EXTEND
Rationale     : All variants below min_samples (15); need more data before drawing conclusions.

Recommendation: 🟡 EXTEND — Guardrails pass. Need ~10+ more runs per variant (currently at ~4-5/15) to reach analysis threshold.

`caveman` · `smoke-copilot`

Status: 🟢 READY
Variants: no vs yes · Window: 56 runs since 2026-05-06 · min_samples: 20 per variant

📈 View Detailed Statistics

Sample Sizes & Progress

Variant	Runs	Progress
`no`	28	██████████ 28/20 (100%)
`yes`	28	██████████ 28/20 (100%)

Experiment : caveman
Workflow   : smoke-copilot
Hypothesis : (not specified)
Window     : last 56 runs  |  Analysed: 25 runs with API outcomes (31 older runs inferred)
min_samples: 20 per variant

+------------------+------+----------+----------------+--------------------+-----------+------------------+
| Variant          |  n   | Succ %   | Mean dur (min) | 95% CI (min)       |  p-value  | min_samples      |
+------------------+------+----------+----------------+--------------------+-----------+------------------+
| no (control)     |  28  |  25.0%   |   11.8         | [10.6, 13.1]       |  (ref)    | ██████████ 100%  |
| yes              |  28  |  21.4%   |   12.5         | [11.3, 13.7]       |  0.7516   | ██████████ 100%  |
+------------------+------+----------+----------------+--------------------+-----------+------------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001

Guardrails: none declared

Recommendation: ABANDON
Rationale     : p = 0.75 >> 0.05 with both variants above min_samples — no significant effect of caveman mode.

Recommendation: ❌ ABANDON — Both variants reached min_samples. Caveman mode has no detectable effect on success rate (p=0.75). Remove or redesign this experiment.

`subagent_model` · `smoke-copilot`

Status: 🟢 READY
Variants: large vs small · Window: 36 runs with model assignment · min_samples: 18 per variant (equal split)

📈 View Detailed Statistics

Sample Sizes & Progress

Variant	Runs	Progress
`large`	18	██████████ 18/18 (100%)
`small`	18	██████████ 18/18 (100%)

Experiment : subagent_model
Workflow   : smoke-copilot
Hypothesis : (not specified)
Window     : last 36 runs with model assignment
min_samples: 18 per variant

+------------------+------+----------+----------------+--------------------+-----------+------------------+
| Variant          |  n   | Succ %   | Mean dur (min) | 95% CI (min)       |  p-value  | min_samples      |
+------------------+------+----------+----------------+--------------------+-----------+------------------+
| large (control)  |  18  |  33.3%   |   12.9         | [11.5, 14.3]       |  (ref)    | ██████████ 100%  |
| small            |  18  |  38.9%   |   11.9         | [10.8, 13.1]       |  0.7286   | ██████████ 100%  |
+------------------+------+----------+----------------+--------------------+-----------+------------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001

Guardrails: none declared

Recommendation: ABANDON
Rationale     : p = 0.73 >> 0.05 with both variants at min_samples — no significant effect of subagent model size.

Recommendation: ❌ ABANDON — Both variants reached min_samples. Model size (large vs small) has no detectable effect on success rate (p=0.73). The slight 5.6pp difference in favour of small is not statistically significant.

🔬 Factorial Interaction: caveman × subagent_model (smoke-copilot)

36 runs have both caveman and subagent_model assignments.

Cell	n	Success Rate
caveman=yes × model=large	11	27.3%
caveman=yes × model=small	7	14.3%
caveman=no × model=large	7	42.9%
caveman=no × model=small	11	36.4%

Chi-square independence test: p ≈ 0.65 (no significant interaction). All cells have n ≥ 5; no sparse cell risk.

📊 Summary

View Full Experiments Table

Experiment	Workflow	Control	Variants	n (total)	min_samples	p-value	Guardrails	Recommendation
output_format	daily-compiler-quality	detailed	concise	10	20	0.7782	FAIL (success_rate)	❌ ABANDON
output_format	daily-issues-report	collapsible	inline	19	30	N/A	FAIL (100% failure)	❌ ABANDON
output_format	deep-report	full_briefing	executive_brief, annotated_brief	14	15	N/A / 0.29	PASS	🟡 EXTEND
caveman	smoke-copilot	no	yes	56	20	0.7516	N/A	❌ ABANDON
subagent_model	smoke-copilot	large	small	36	18	0.7286	N/A	❌ ABANDON

Analysis window: last 30 completed runs per workflow · Significance threshold: p < 0.05 (two-tailed)
Run: 26392702536

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

proxy.golang.org

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "proxy.golang.org"

See Network Configuration for more information.

Generated by 🧪 daily-experiment-report · sonnet46 8.5M · ◷

expires on May 28, 2026, 9:26 AM UTC

2026-05-26T09:10:36Z

github-actions[bot]
Bot May 26, 2026
Author

This discussion has been marked as outdated by daily-experiment-report.

A newer discussion is available at Discussion #34903.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experiments] Daily Experiment Report — 2026-05-25 #34609

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[experiments] Daily Experiment Report — 2026-05-25 #34609

Uh oh!

github-actions[bot] Bot May 25, 2026

🧪 Daily Experiment Report — 2026-05-25

⚡ Quick Stats

output_format · daily-compiler-quality

output_format · daily-issues-report

output_format · deep-report

caveman · smoke-copilot

subagent_model · smoke-copilot

🔬 Factorial Interaction: caveman × subagent_model (smoke-copilot)

📊 Summary

Replies: 1 comment

Uh oh!

github-actions[bot] Bot May 26, 2026 Author

github-actions[bot]
Bot May 25, 2026

`output_format` · `daily-compiler-quality`

`output_format` · `daily-issues-report`

`output_format` · `deep-report`

`caveman` · `smoke-copilot`

`subagent_model` · `smoke-copilot`

github-actions[bot]
Bot May 26, 2026
Author