mverab
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 26 additions & 0 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 29 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 13 additions & 9 deletions b/‎CONTRIBUTING.md‎
Lines changed: 13 additions & 9 deletions
diff --git a/‎README.md‎
Lines changed: 20 additions & 9 deletions b/‎README.md‎
Lines changed: 20 additions & 9 deletions
diff --git a/‎cases/CORPUS.md‎
Lines changed: 43 additions & 0 deletions b/‎cases/CORPUS.md‎
Lines changed: 43 additions & 0 deletions
diff --git a/‎cases/diagnose/diagnose-002/case.yaml‎
Lines changed: 18 additions & 0 deletions b/‎cases/diagnose/diagnose-002/case.yaml‎
Lines changed: 18 additions & 0 deletions
diff --git a/‎cases/diagnose/diagnose-002/hints.yaml‎
Lines changed: 13 additions & 0 deletions b/‎cases/diagnose/diagnose-002/hints.yaml‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎cases/diagnose/diagnose-002/repo/README.md‎
Lines changed: 13 additions & 0 deletions b/‎cases/diagnose/diagnose-002/repo/README.md‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎cases/diagnose/diagnose-002/repo/requirements.txt‎
Lines changed: 1 addition & 0 deletions b/‎cases/diagnose/diagnose-002/repo/requirements.txt‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎cases/diagnose/diagnose-002/repo/src/cart.py‎
Lines changed: 40 additions & 0 deletions b/‎cases/diagnose/diagnose-002/repo/src/cart.py‎
Lines changed: 40 additions & 0 deletions
@@ -0,0 +1,26 @@
+name: CI
+
+on:
+  pull_request:
+    branches: [main]
+  push:
+    branches: [main]
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
+
+      - name: Install dependencies
+        run: pip install -e .
+
+      - name: Run tests
+        run: python -m pytest tests/ -v
+
+      - name: Validate case packs
+        run: reposcale validate cases/*/
@@ -0,0 +1,29 @@
+# Changelog
+
+## v0.1.0-alpha — Alpha Release
+
+### Added
+- **CI Pipeline**: GitHub Actions workflow running tests + case validation on PRs
+- **Batch command**: `reposcale batch` discovers and runs all valid case packs
+- **Judge stability**: `--repeat N` on score command; mean, stddev, and unstable dimension detection
+- **Expanded corpus**: 12 cases across 3 tracks (diagnose, intent, plan) and 3 difficulty levels
+- **Corpus manifest**: `cases/CORPUS.md` with inventory, coverage matrix, and case type reference
+- **Case authoring guide**: `docs/case-authoring.md` with templates and difficulty calibration
+- **Multi-run comparison**: `reposcale compare` command with side-by-side tables
+- **Baseline script**: `scripts/run-baselines.sh` for GPT-4o and Claude baselines
+- **Schema extensions**: evaluation schema now supports stability metadata, hallucinations, strengths, weaknesses
+
+### Changed
+- `case_type` enum extended with 6 new types for expanded corpus
+- README updated to reflect alpha status and new commands
+- CONTRIBUTING.md updated with case authoring guide reference
+
+## v0.0.1 — MVP Pipeline
+
+### Added
+- Core package: validate, run, score, summary pipeline
+- CLI: `reposcale validate`, `run`, `score`, `summary`
+- 3 seed cases (diagnose-001, intent-001, plan-001)
+- 3-layer scoring: structural, heuristic, LLM judge
+- JSON schemas for case, response, and evaluation
+- 31 tests
@@ -5,7 +5,7 @@ Thank you for your interest in contributing to RepoScale.
 ## Ways to contribute
 
 ### Case curation
-The most impactful contribution right now is **curating benchmark cases** — real repositories that test specific aspects of repo continuation intelligence. See `docs/dataset_format.md` for the case pack format and `cases/example/` for a reference.
+The most impactful contribution right now is **curating benchmark cases** — real repositories that test specific aspects of repo continuation intelligence. See `docs/case-authoring.md` for the complete guide, `docs/dataset_format.md` for the case pack format, and `cases/CORPUS.md` for the current inventory.
 
 ### Rubric and scoring
 Help define and refine evaluation criteria. See `docs/scoring.md` for the current scoring model.
@@ -14,7 +14,7 @@ Help define and refine evaluation criteria. See `docs/scoring.md` for the curren
 Improve task prompts in `prompts/` or the LLM judge protocol in `prompts/judge.md`.
 
 ### Tooling
-Improve `scripts/validate_case_pack.py` and `scripts/run_eval.py`, or build new runners and scorers.
+Improve the CLI (`src/reposcale/cli.py`), scoring layers, or build new runners and scorers.
 
 ## Development setup
 
@@ -24,10 +24,13 @@ git clone https://github.com/YOUR_ORG/reposcale.git
 cd reposcale
 
 # Install in development mode
-pip install -e ".[dev]"
+pip install -e .
 
-# Validate an example case
-python scripts/validate_case_pack.py cases/example/
+# Run tests
+python -m pytest tests/ -v
+
+# Validate all case packs
+reposcale validate cases/diagnose/diagnose-001/ cases/intent/intent-001/
 ```
 
 ## Conventions
@@ -39,10 +42,11 @@ python scripts/validate_case_pack.py cases/example/
 ## Submitting a case
 
 1. Fork the repo
-2. Create a new directory under `cases/<track>/` (e.g., `cases/diagnose/my-case/`)
-3. Include all required fields from the case schema
-4. Run `python scripts/validate_case_pack.py cases/<track>/my-case/`
-5. Submit a PR with a brief description of what the case tests
+2. Read `docs/case-authoring.md` for the complete guide
+3. Create a new directory under `cases/<track>/` (e.g., `cases/diagnose/diagnose-005/`)
+4. Include `case.yaml`, `hints.yaml`, `tree.txt`, and `repo/` with real code
+5. Run `reposcale validate cases/<track>/diagnose-005/`
+6. Submit a PR with a brief description of what the case tests
 
 ## Code of conduct
 
 
@@ -88,37 +88,48 @@ results/        evaluation outputs (gitignored)
 
 ## Project status
 
-RepoScale is in **Phase 1 — MVP pipeline**.
+RepoScale is in **Alpha** — core pipeline complete, corpus of 12 cases ready for baseline evaluation.
 
 Current goals:
 - [x] Define the capability model
 - [x] Publish the task taxonomy
 - [x] Define base schemas (case, response, evaluation)
 - [x] Ship validation, runner, scoring, and summary pipeline
 - [x] Seed cases for Diagnose, Intent, and Plan tracks
-- [ ] Release the first 10–15 curated cases
-- [ ] Establish baseline results across models
+- [x] Expand corpus to 12 cases across 3 tracks and 3 difficulty levels
+- [x] CI pipeline with GitHub Actions
+- [x] Judge stability measurement with repeat scoring
+- [x] Batch command for running all cases
+- [x] Multi-run comparison command
+- [ ] Establish baseline results across models (GPT-4o, Claude)
+- [ ] Add Extend, Implement, and Agent tracks
 
 ## Quickstart
 
 ```bash
 # Install
 pip install -e .
 
-# Validate case packs
-reposcale validate cases/diagnose/diagnose-001/
+# Validate all case packs
+reposcale validate cases/diagnose/diagnose-001/ cases/intent/intent-001/
 
-# Run evaluation (dry-run — prints the assembled prompt)
+# Run a single case (dry-run — prints the assembled prompt)
 reposcale run cases/diagnose/diagnose-001/ --model gpt-4o --dry-run
 
-# Run evaluation (requires LLM API key, e.g. OPENAI_API_KEY)
-reposcale run cases/diagnose/diagnose-001/ --model gpt-4o
+# Run all cases in batch (requires LLM API key, e.g. OPENAI_API_KEY)
+reposcale batch cases/ --model gpt-4o
 
 # Score responses (structural + heuristic; add --judge-model for LLM judge)
-reposcale score results/<run-id>/
+reposcale score results/<run-id>/ --judge-model gpt-4o
+
+# Score with stability measurement (runs judge 3 times)
+reposcale score results/<run-id>/ --judge-model gpt-4o --repeat 3
 
 # View summary
 reposcale summary results/<run-id>/
+
+# Compare multiple runs
+reposcale compare results/<run-a>/ results/<run-b>/
 ```
 
 ## Contributing
 
@@ -0,0 +1,43 @@
+# RepoScale Corpus v1 — 12 Cases
+
+## Inventory
+
+| ID | Track | Type | Difficulty | Description |
+|----|-------|------|-----------|-------------|
+| diagnose-001 | diagnose | mvp_incomplete | easy | Incomplete task CLI with unwired auth middleware |
+| diagnose-002 | diagnose | mvp_incomplete | medium | E-commerce cart with missing integration tests |
+| diagnose-003 | diagnose | single_file_buggy | easy | CSV processor with off-by-one and encoding bugs |
+| diagnose-004 | diagnose | multi_module_tangled | hard | Job system with conflicting error handling patterns |
+| intent-001 | intent | divergent_beta | medium | Analytics dashboard pivoted from real-time to batch |
+| intent-002 | intent | divergent_beta | medium | REST API pivoting from session-based to JWT auth |
+| intent-003 | intent | abandoned_rewrite | hard | CMS with abandoned v1→v2 plugin rewrite |
+| intent-004 | intent | readme_code_divergence | easy | Weather CLI where README overpromises features |
+| plan-001 | plan | functional_not_scalable | medium | URL shortener that works but won't scale |
+| plan-002 | plan | functional_not_scalable | medium | Monolith needing service extraction for scaling |
+| plan-003 | plan | multi_language_unscalable | hard | Multi-language platform with manual deploy, no DevOps |
+| plan-004 | plan | needs_ops_layer | easy | Clean Flask app needing CI/CD and monitoring |
+
+## Coverage
+
+| Track | Easy | Medium | Hard | Total |
+|-------|------|--------|------|-------|
+| diagnose | 2 | 1 | 1 | 4 |
+| intent | 1 | 2 | 1 | 4 |
+| plan | 1 | 2 | 1 | 4 |
+| **Total** | **4** | **5** | **3** | **12** |
+
+## Case types (8 unique)
+
+- `mvp_incomplete` — Project with missing features or unwired components
+- `single_file_buggy` — Single-file script with obvious bugs
+- `multi_module_tangled` — Multi-module project with inconsistent patterns
+- `divergent_beta` — Project that pivoted direction mid-development
+- `abandoned_rewrite` — Codebase with abandoned rewrite attempt
+- `readme_code_divergence` — README describes features that don't exist
+- `functional_not_scalable` — Works but has fundamental scaling limitations
+- `multi_language_unscalable` — Multi-language project lacking operational infrastructure
+- `needs_ops_layer` — Good code quality but missing operational tooling
+
+## Validation
+
+All 12 cases pass `reposcale validate` with zero errors.
@@ -0,0 +1,18 @@
+id: diagnose-002
+track: diagnose
+title: E-commerce cart service with missing integration tests
+description: >
+  A multi-module e-commerce backend with cart, inventory, and payment modules.
+  Unit tests exist per module but no integration tests verify cross-module
+  interactions. Payment webhook handler silently drops errors.
+case_type: mvp_incomplete
+difficulty: medium
+repo_source: snapshot
+supported_modes:
+  - prompt_only
+expected_sections:
+  - project_summary
+  - what_works
+  - what_is_broken
+  - recommendations
+version: "1.0"
@@ -0,0 +1,13 @@
+known_gaps:
+  - No integration tests for cart-to-payment flow
+  - Payment webhook handler catches and discards all exceptions
+  - Inventory decrement is not atomic (race condition under concurrent orders)
+  - No retry logic for failed payment callbacks
+original_intent: >
+  Full e-commerce cart service handling cart management, inventory checks,
+  and payment processing with Stripe-like webhook integration.
+key_files:
+  - src/cart.py
+  - src/payment.py
+  - src/inventory.py
+  - tests/test_cart.py
@@ -0,0 +1,13 @@
+# E-Commerce Cart Service
+
+Simple cart + inventory + payment backend.
+
+## Modules
+- `src/cart.py` — cart management
+- `src/inventory.py` — stock tracking
+- `src/payment.py` — payment processing + webhooks
+
+## Running tests
+```
+pytest tests/
+```
@@ -0,0 +1 @@
+pytest==7.4.0
@@ -0,0 +1,40 @@
+"""Shopping cart management."""
+
+from dataclasses import dataclass, field
+
+
+@dataclass
+class CartItem:
+    product_id: str
+    quantity: int
+    unit_price: float
+
+
+@dataclass
+class Cart:
+    user_id: str
+    items: list[CartItem] = field(default_factory=list)
+
+    def add_item(self, product_id: str, quantity: int, unit_price: float):
+        for item in self.items:
+            if item.product_id == product_id:
+                item.quantity += quantity
+                return
+        self.items.append(CartItem(product_id, quantity, unit_price))
+
+    def remove_item(self, product_id: str):
+        self.items = [i for i in self.items if i.product_id != product_id]
+
+    def total(self) -> float:
+        return sum(i.quantity * i.unit_price for i in self.items)
+
+    def checkout(self, inventory, payment_client):
+        """Validate inventory and initiate payment."""
+        for item in self.items:
+            if not inventory.check_stock(item.product_id, item.quantity):
+                raise ValueError(f"Insufficient stock for {item.product_id}")
+
+        for item in self.items:
+            inventory.decrement(item.product_id, item.quantity)
+
+        return payment_client.create_charge(self.user_id, self.total())