[experiments] Daily Experiment Report — 2026-05-25 #34609
Closed
Replies: 1 comment
-
|
This discussion has been marked as outdated by daily-experiment-report. A newer discussion is available at Discussion #34903. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
🧪 Daily Experiment Report — 2026-05-25
5 experiments analysed across 4 workflows. 2 experiments reached sufficient sample size (⚠️ guardrail concern.
smoke-copilot); both resulted in ABANDON (p ≥ 0.05, no detectable effect). 3 experiments are still collecting data (EXTEND), one with a critical⚡ Quick Stats
output_format·daily-compiler-qualityH0: no change in discussion engagement. H1: concise variant achieves equal engagement with ≥30% fewer output tokens
📈 View Detailed Statistics
Sample Sizes & Progress
detailedconciseRecommendation: ❌ ABANDON — Both variants violate the
run_success_rate >=0.85guardrail; fix infrastructure reliability before resuming the experiment.output_format·daily-issues-reportH0: no change in discussion engagement. H1: inline format produces ≥20% higher reactions+replies
📈 View Detailed Statistics
Sample Sizes & Progress
collapsibleinlineRecommendation: ❌ ABANDON (infrastructure issue) — Every single run since the experiment started has failed. This is not a variant effect — the workflow environment must be investigated. Pause and fix before re-running.
output_format·deep-reportH0: no change in engagement/token cost. H1: executive_brief reduces token usage ≥20% without reducing engagement
📈 View Detailed Statistics
Sample Sizes & Progress
full_briefingexecutive_briefannotated_briefRecommendation: 🟡 EXTEND — Guardrails pass. Need ~10+ more runs per variant (currently at ~4-5/15) to reach analysis threshold.
caveman·smoke-copilot📈 View Detailed Statistics
Sample Sizes & Progress
noyesRecommendation: ❌ ABANDON — Both variants reached min_samples. Caveman mode has no detectable effect on success rate (p=0.75). Remove or redesign this experiment.
subagent_model·smoke-copilot📈 View Detailed Statistics
Sample Sizes & Progress
largesmallRecommendation: ❌ ABANDON — Both variants reached min_samples. Model size (
largevssmall) has no detectable effect on success rate (p=0.73). The slight 5.6pp difference in favour ofsmallis not statistically significant.🔬 Factorial Interaction: caveman × subagent_model (smoke-copilot)
36 runs have both
cavemanandsubagent_modelassignments.Chi-square independence test: p ≈ 0.65 (no significant interaction). All cells have n ≥ 5; no sparse cell risk.
📊 Summary
View Full Experiments Table
Warning
Firewall blocked 1 domain
The following domain was blocked by the firewall during workflow execution:
proxy.golang.orgSee Network Configuration for more information.
Beta Was this translation helpful? Give feedback.
All reactions