You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(evals): capture tool calls in eval runner and improve canhelp evals (#134)
The eval runner now uses stream-json to capture tool calls during
execution, giving the judge visibility into which scripts were actually
run. Also parses allowed-tools from skill frontmatter so skills that
require Bash scripts (like canhelp) can execute them during evals.
Canhelp eval improvements:
- Use obscure canisters (Neutrinite) instead of well-known ones
(ICP Ledger, NNS Governance) to prevent Claude answering from
training data instead of running the scripts
- Use a canister with wasm but no candid:service metadata (OpenChat
SNS canister r2pvs-tyaaa-aaaar-ajcwq-cai) for the missing metadata
eval instead of one with no wasm installed
- Fix local canister eval to match skill behavior (mainnet-only
guidance) instead of expecting a fetch attempt
- Remove redundant Large interface summarization eval that duplicated
Lookup by name and Output format evals
The assistant made the following tool calls during execution:
169
+
${toolList}
170
+
</tool_calls>
171
+
172
+
<output>
173
+
${output.text}
174
+
</output>`;
175
+
}else{
176
+
outputSection=`<output>
177
+
${isStructured ? output.text : output}
178
+
</output>`;
179
+
}
180
+
118
181
constjudgePrompt=`You are an evaluation judge. A coding assistant was given this task:
119
182
120
183
<task>
121
184
${evalCase.prompt}
122
185
</task>
123
186
124
-
The assistant produced this output:
125
-
126
-
<output>
127
-
${output}
128
-
</output>
187
+
${outputSection}
129
188
130
189
Score each expected behavior as PASS or FAIL. Be strict — the behavior must be clearly present, not just vaguely implied. Return ONLY a JSON array of objects with "behavior", "pass" (boolean), and "reason" (one sentence).
0 commit comments