vcoderun · fswair · Mar 19, 2026 · Mar 15, 2026 · Mar 15, 2026 · Mar 17, 2026
diff --git a/.env.sample b/.env.sample
@@ -11,4 +11,16 @@ LOGFIRE_ENABLED=false
 JUDGE_MODEL=openrouter:google/gemini-3-flash-preview
 
 # Default model used by Agents
-MODEL_NAME=openrouter:google/gemini-3-flash-preview
+MODEL_NAME=openrouter:google/gemini-3-flash-preview
+
+# Default spec & exploration models used by CodeMode pipeline
+# Spec agent generates tests
+# Exploration agent generates snippets to discover behaviors (code-execution)
+SPEC_MODEL=openrouter:anthropic/claude-opus-4.6
+EXPLORATION_MODEL=openrouter:anthropic/claude-sonnet-4.6
+
+# Default spec & exploration models used by CodeMode benchmark pipeline
+# NOTE: Models should be comma seperated, length of spec models must equals to exploration models
+# spec[i] will be mapped to exploration[i] (Case N)
+BENCHMARK_SPEC_MODELS=openrouter:anthropic/claude-opus-4.6
+BENCHMARK_EXPLORATION_MODELS=openrouter:anthropic/claude-sonnet-4.6
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -11,7 +11,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
+        python-version: ["3.11", "3.12", "3.13", "3.14"]
 
     steps:
       - uses: actions/checkout@v4
@@ -30,7 +30,7 @@ jobs:
       - name: Install dependencies
         run: |
           source venv/bin/activate
-          uv pip install -e ".[dev]"
+          uv pip install -e ".[all]"
 
       - name: Run tests
         run: |

diff --git a/.gitignore b/.gitignore
@@ -69,3 +69,7 @@ evaluations/
 # !!
 TODO
 docs/FIXTURE_GENERATION_RFC.md
+
+# Benchmarks
+benchmark*
+important-links.md
diff --git a/.gitmodules b/.gitmodules
@@ -4,3 +4,6 @@
 [submodule "skills/vowel-core"]
 	path = skills/vowel-core
 	url = https://github.com/fswair/vowel-core.git
+[submodule "codemode-benchmark"]
+	path = codemode-benchmark
+	url = https://github.com/fswair/codemode-benchmark
diff --git a/AGENTS.md b/AGENTS.md
@@ -30,5 +30,14 @@ This document contains concise rules for how agents should inspect and use this
 - If you have questions or uncertainty, consult `README.md` and the relevant docs pages.
 - Check `TODO` for pending tasks or known issues.
 
+## Critical Thinking & Intellectual Honesty
+
+- **Never defer to the user's idea just because they said it.** Evaluate every proposal — yours or the user's — on its own merits: trade-offs, costs, complexity, correctness.
+- **If the user's idea has flaws, say so.** Explain why with concrete reasoning (performance, token cost, latency, maintainability, correctness risk). Do not soften criticism to be agreeable.
+- **If your own idea has flaws, admit it first.** Don't wait for the user to find the holes. Present disadvantages upfront.
+- **When comparing approaches, use structured analysis:** list pros/cons for each, identify the real trade-offs, and state which you'd pick and why — before asking for input.
+- **"You're right" must be earned.** If you catch yourself agreeing immediately, stop and ask: "Did I actually evaluate this, or am I just being agreeable?" If the latter, go back and do the analysis.
+- **The user is a collaborator, not an authority.** Good ideas win regardless of who proposed them. Bad ideas lose regardless of who proposed them.
+
 These rules help agents use the project consistently and safely.
 
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,198 @@
+# CHANGELOG
+
+## codemode_driven_generation
+
+This document summarizes the main features added or improved on this branch.
+
+## 1) Executor and ExecutionSession protocols
+
+- The code execution interface was formalized using Protocols.
+- The Executor async/sync API was standardized:
+  - execute(...)
+  - execute_sync(...)
+  - create_session(...)
+- ExecutionSession now compiles/executes setup code once and supports multi-snippet feed execution.
+- This reduces repeated parse/compile overhead while exploring the same function.
+- The run_sync helper was hardened for running-loop environments via nest-asyncio.
+
+## 2) MontyExecutor, DefaultExecutor, MontySession, FallbackSession structures
+
+- MontyExecutor was added:
+  - sandboxed execution via pydantic-monty,
+  - ResourceLimits support (timeout/memory),
+  - stdout capture and normalized error typing/messages,
+- DefaultExecutor was added/improved:
+  - pure Python exec-based fallback execution,
+  - last-expression capture (__result__) and stdout capture.
+- MontyReplSession (MontySession role) was added:
+  - one-time setup load, reusable feed-run model.
+- FallbackSession was added:
+  - Session-level fallback: if Monty session initialization fails, switch entirely to DefaultSession.
+  - Snippet-level fallback: if Monty returns ModuleNotFoundError for a snippet, rerun that snippet via fallback executor.
+- Executor/fallback wiring was simplified through resolve_executors.
+
+## 3) Main implementation: CodeModeGenerator
+
+- Two-phase exploration-guided generation flow:
+  - Phase 1: behavior exploration (exploration snippets + error snippets)
+  - Phase 2: spec generation from verified observations
+- Lazy Agent architecture:
+  - explorer_agent (ExplorationPlan)
+  - spec_agent (EvalsSource or EvalsBundle)
+- Prompt layers were clearly separated:
+  - exploration prompt: coverage, diversity, duplicate prevention
+  - spec prompt: expected values from verified outputs only
+- A refinement loop was added:
+  - generate -> run -> failure_context -> regenerate
+- Optional duration injection and a final summary run were added at the end.
+
+## 4) Runtime hierarchy and utility usage
+
+CodeMode hierarchy:
+
+1. explore()
+2. generate_spec()
+3. validate_and_fix_spec()
+4. validate_expected_values()
+5. inject_missing_error_cases()
+6. inject_durations() (optional)
+7. validation/refinement with RunEvals
+
+Utilities used:
+
+- build_call_code
+- build_failure_context
+- validate_and_fix_spec
+- validate_expected_values
+- inject_missing_error_cases
+- inject_durations
+
+## 5) Cost Manager
+
+- Generation/run cost tracking was added for CodeMode.
+- Features:
+  - generation_id and run_id lifecycle management,
+  - step-level usage/cost recording,
+  - model price resolution (genai-prices or costs.yml),
+  - atomic/locked JSON persistence,
+  - generation-level and run-level totals,
+  - status tracking: running/completed/failed.
+- The CLI costs command now supports list/by-generation/by-run views.
+
+## 6) Serializer syntax and YAML-native serializer registry
+
+- Top-level serializers registry support was added at EvalsFile level.
+- Per-eval serializer references are now supported via serializer:.
+- SerializerSpec was clarified with one-of behavior:
+  - schema (string or dict)
+  - serializer (callable import path)
+  - not both at the same time.
+- Runtime resolver additions:
+  - import-path resolution,
+  - cached imports (_import_path_cached),
+  - per-eval resolution (_resolve_yaml_serializer_entry).
+- Precedence between programmatic serializer maps and YAML serializer registry was defined.
+
+## 7) Spec model / Exploration model separation
+
+- Model separation in CodeModeGenerator constructor was formalized:
+  - spec_model
+  - exploration_model
+- use_model_spec output mode was clarified:
+  - use_model_spec=True: structured output mode (schema/model output via EvalsBundle)
+  - use_model_spec=False: YAML string output mode (via EvalsSource.yaml_spec)
+- HIGHLY RECOMMENDED TO KEEP use_model_spec=False.
+- Model resolution order and env fallback logic were added.
+- Cost tracking now supports separate model usage across separate steps.
+
+## 8) Adding executor/fallback executor to utilities
+
+- Utility flows were updated to accept executor and fallback executor parameters.
+- Monty -> Default fallback behavior was generalized in execution-aware paths.
+- Executor behavior was centralized across run_evals and validation stages.
+
+## 9) YAML schema generator
+
+- Runtime-model-driven schema generation was improved:
+  - supports top-level fixtures + serializers,
+  - preserves function-level EvalsMapValue behavior.
+- Schema cache strategy was updated:
+  - content-hash-based filename (reduces stale editor cache issues).
+- File header updates are handled safely via materialize_yaml_with_schema_header.
+
+## 10) CLI komutları: schema, costs
+
+- vowel schema <file>:
+  - update schema header after YAML + pydantic validation
+- vowel schema --create [path]:
+  - direct schema JSON generation
+- vowel costs:
+  - --list
+  - --by-generation
+  - --by-run
+  - --generation <id>
+  - --run <id>
+
+## 11) module.function -> function alias support
+
+- Alias support was added for programmatic mapping resolution:
+  - function map
+  - serializer schema map
+  - serializer function map
+- Behavior:
+  - exact match first,
+  - short-name fallback,
+  - explicit error for ambiguous reverse short-name mapping.
+
+## 12) Feedback-guided exploration
+
+- A targeted Round-2 exploration flow was added:
+  - build cluster summaries from Round-1 results,
+  - generate snippets focused on uncovered behavior classes.
+- Duplicate/semantic repetition minimization was reinforced at prompt level.
+- Distinct failure-mode coverage was improved for error snippets.
+- Additional rounds now measure value via new-behavior counting.
+
+## 13) Assertion + serializer integration
+
+- AssertionEvaluator input context is now serializer-aware.
+- Assertions now see serialized input for schema, serial_fn, and nested/dict schema modes.
+- This behavior is covered by regression tests.
+
+## 14) LLM Judge env-ref improvements
+
+- create_llm_judge now supports $ENV_VAR resolution for rubric/model fields.
+- Missing env refs now produce clearer errors.
+
+## 15) Examples, documentation, and test coverage
+
+- A runnable native serializer + fixture example was added.
+- README and serializer docs were updated with serializer/assertion context notes.
+- Meaningful id fields were added to eval cases under examples.
+- New/updated tests include:
+  - test_schema
+  - test_llm_judge_env_refs
+  - serializer assertion regressions
+  - YAML/native serializer parsing tests
+
+## 16) Fixture scope alias support
+
+- Fixture scopes now support clearer canonical names:
+  - case
+  - eval
+  - file
+- Backward-compatible aliases are still accepted:
+  - function (alias of case)
+  - module (alias of eval)
+  - session (alias of file)
+- At parse time, canonical names are normalized to legacy internal runtime values:
+  - case -> function
+  - eval -> module
+  - file -> session
+- This keeps existing runtime lifecycle behavior unchanged while allowing more descriptive scope names in YAML.
+
+Note: Old names would be deprecated after v1.0.0
+
+## Note
+
+This changelog is based on features observed and validated in code on this branch, without using git history.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -30,5 +30,14 @@ Claude-type agents working with this repository should follow these steps:
 - If you have questions or uncertainty, consult `README.md` and the relevant docs pages.
 - Check `TODO` for pending tasks or known issues.
 
+## Critical Thinking & Intellectual Honesty
+
+- **Never defer to the user's idea just because they said it.** Evaluate every proposal — yours or the user's — on its own merits: trade-offs, costs, complexity, correctness.
+- **If the user's idea has flaws, say so.** Explain why with concrete reasoning (performance, token cost, latency, maintainability, correctness risk). Do not soften criticism to be agreeable.
+- **If your own idea has flaws, admit it first.** Don't wait for the user to find the holes. Present disadvantages upfront.
+- **When comparing approaches, use structured analysis:** list pros/cons for each, identify the real trade-offs, and state which you'd pick and why — before asking for input.
+- **"You're right" must be earned.** If you catch yourself agreeing immediately, stop and ask: "Did I actually evaluate this, or am I just being agreeable?" If the latter, go back and do the analysis.
+- **The user is a collaborator, not an authority.** Good ideas win regardless of who proposed them. Bad ideas lose regardless of who proposed them.
+
 These guidelines are intended to help Claude agents use the repository consistently.
 
diff --git a/README.md b/README.md
@@ -46,7 +46,7 @@ pip install -e ".[all]"
 ## Quick Start
 
 > **Note:**  
-> For a deeper understanding of how vowel handles fixtures, see the examples in [`db_fixture.yml`](./db_fixture.yml) and [`db.py`](./db.py). These files demonstrate the underlying mechanics of fixture setup and usage.
+> For a deeper understanding of how vowel handles fixtures, see the examples in [`examples/db_fixtures`](./examples/db_fixtures/). These example demonstrate the underlying mechanics of fixture setup and usage.
 
 > **Tip:**  
 > To enable YAML schema validation in your editor, place `vowel-schema.json` in your project directory.  
@@ -122,6 +122,8 @@ summary = (
 summary.print()
 ```
 
+> **Name matching note:** If your YAML uses `module.function`, programmatic mappings can use either the exact key (`module.function`) or the short function name (`function`) in `.with_functions(...)`.
+
 ---
 
 ## Features
@@ -181,6 +183,29 @@ def query_user(user_id: int, *, db: dict) -> dict | None:
     return db["users"].get(user_id)
 ```
 
+Fixture scope aliases:
+- Preferred scope names: `case`, `eval`, `file`
+- Backward-compatible aliases: `function`, `module`, `session`
+- Normalization mapping: `case -> function`, `eval -> module`, `file -> session`
+
+Example:
+
+```yaml
+fixtures:
+  temp_data:
+    setup: myapp.make_temp_data
+    scope: case
+
+  db:
+    setup: myapp.setup_db
+    teardown: myapp.close_db
+    scope: eval
+
+  cache:
+    setup: myapp.setup_cache
+    scope: file
+```
+
 > **Full reference:** [docs/FIXTURES.md](https://github.com/fswair/vowel/blob/main/docs/FIXTURES.md)
 
 ### Input Serializers
@@ -196,6 +221,26 @@ summary = (
 )
 ```
 
+> **Serializer key matching:** Serializer mappings follow the same rule as `.with_functions(...)` — both `module.function` and short `function` keys are accepted.
+
+> **Assertion context and serializers:** When a serializer is configured, assertion evaluators use the serialized value for `input` (not raw YAML). This applies to schema mode, `serial_fn`, and nested/dict schemas.
+
+Runnable example (YAML-native serializers + fixtures):
+
+```bash
+vowel examples/serializers/db_query_evals.yml
+```
+
+This example demonstrates:
+- top-level `serializers:` registry with both `schema` and `serializer` entries,
+- per-eval `serializer:` references,
+- fixture class lifecycle wiring with `cls` + `teardown`,
+- assertion checks that read serialized `input` values.
+
+See:
+- `examples/serializers/db_query_evals.yml`
+- `examples/serializers/util.py`
+
 > **Full reference:** [docs/SERIALIZERS.md](https://github.com/fswair/vowel/blob/main/docs/SERIALIZERS.md)
 
 ### AI-Powered Generation
@@ -259,6 +304,9 @@ vowel evals.yml --dry-run                # Show plan without running
 vowel evals.yml --export-json out.json   # Export results
 vowel evals.yml -v                       # Verbose summary
 vowel evals.yml -v --hide-report         # Verbose, hide pydantic_evals report
+vowel schema examples/serializers/db_query_evals.yml   # Validate + update schema header
+vowel schema --create                                   # Generate vowel-schema.json
+vowel costs --list                                      # List tracked generation/run costs
 ```
 
 > **Full reference:** [docs/CLI.md](https://github.com/fswair/vowel/blob/main/docs/CLI.md)

diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-0.3.5
+0.4.0