Skip to content

Commit 4535b84

Browse files
authored
Merge pull request #2 from mverab/feat/ship-alpha-release
Feat: Ship Alpha Release — v0.1.0-alpha
2 parents 55af12f + cab9c1f commit 4535b84

107 files changed

Lines changed: 2289 additions & 24 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci.yml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
name: CI
2+
3+
on:
4+
pull_request:
5+
branches: [main]
6+
push:
7+
branches: [main]
8+
9+
jobs:
10+
test:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- uses: actions/checkout@v4
14+
15+
- uses: actions/setup-python@v5
16+
with:
17+
python-version: "3.10"
18+
19+
- name: Install dependencies
20+
run: pip install -e .
21+
22+
- name: Run tests
23+
run: python -m pytest tests/ -v
24+
25+
- name: Validate case packs
26+
run: reposcale validate cases/*/

CHANGELOG.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Changelog
2+
3+
## v0.1.0-alpha — Alpha Release
4+
5+
### Added
6+
- **CI Pipeline**: GitHub Actions workflow running tests + case validation on PRs
7+
- **Batch command**: `reposcale batch` discovers and runs all valid case packs
8+
- **Judge stability**: `--repeat N` on score command; mean, stddev, and unstable dimension detection
9+
- **Expanded corpus**: 12 cases across 3 tracks (diagnose, intent, plan) and 3 difficulty levels
10+
- **Corpus manifest**: `cases/CORPUS.md` with inventory, coverage matrix, and case type reference
11+
- **Case authoring guide**: `docs/case-authoring.md` with templates and difficulty calibration
12+
- **Multi-run comparison**: `reposcale compare` command with side-by-side tables
13+
- **Baseline script**: `scripts/run-baselines.sh` for GPT-4o and Claude baselines
14+
- **Schema extensions**: evaluation schema now supports stability metadata, hallucinations, strengths, weaknesses
15+
16+
### Changed
17+
- `case_type` enum extended with 6 new types for expanded corpus
18+
- README updated to reflect alpha status and new commands
19+
- CONTRIBUTING.md updated with case authoring guide reference
20+
21+
## v0.0.1 — MVP Pipeline
22+
23+
### Added
24+
- Core package: validate, run, score, summary pipeline
25+
- CLI: `reposcale validate`, `run`, `score`, `summary`
26+
- 3 seed cases (diagnose-001, intent-001, plan-001)
27+
- 3-layer scoring: structural, heuristic, LLM judge
28+
- JSON schemas for case, response, and evaluation
29+
- 31 tests

CONTRIBUTING.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Thank you for your interest in contributing to RepoScale.
55
## Ways to contribute
66

77
### Case curation
8-
The most impactful contribution right now is **curating benchmark cases** — real repositories that test specific aspects of repo continuation intelligence. See `docs/dataset_format.md` for the case pack format and `cases/example/` for a reference.
8+
The most impactful contribution right now is **curating benchmark cases** — real repositories that test specific aspects of repo continuation intelligence. See `docs/case-authoring.md` for the complete guide, `docs/dataset_format.md` for the case pack format, and `cases/CORPUS.md` for the current inventory.
99

1010
### Rubric and scoring
1111
Help define and refine evaluation criteria. See `docs/scoring.md` for the current scoring model.
@@ -14,7 +14,7 @@ Help define and refine evaluation criteria. See `docs/scoring.md` for the curren
1414
Improve task prompts in `prompts/` or the LLM judge protocol in `prompts/judge.md`.
1515

1616
### Tooling
17-
Improve `scripts/validate_case_pack.py` and `scripts/run_eval.py`, or build new runners and scorers.
17+
Improve the CLI (`src/reposcale/cli.py`), scoring layers, or build new runners and scorers.
1818

1919
## Development setup
2020

@@ -24,10 +24,13 @@ git clone https://github.com/YOUR_ORG/reposcale.git
2424
cd reposcale
2525

2626
# Install in development mode
27-
pip install -e ".[dev]"
27+
pip install -e .
2828

29-
# Validate an example case
30-
python scripts/validate_case_pack.py cases/example/
29+
# Run tests
30+
python -m pytest tests/ -v
31+
32+
# Validate all case packs
33+
reposcale validate cases/diagnose/diagnose-001/ cases/intent/intent-001/
3134
```
3235

3336
## Conventions
@@ -39,10 +42,11 @@ python scripts/validate_case_pack.py cases/example/
3942
## Submitting a case
4043

4144
1. Fork the repo
42-
2. Create a new directory under `cases/<track>/` (e.g., `cases/diagnose/my-case/`)
43-
3. Include all required fields from the case schema
44-
4. Run `python scripts/validate_case_pack.py cases/<track>/my-case/`
45-
5. Submit a PR with a brief description of what the case tests
45+
2. Read `docs/case-authoring.md` for the complete guide
46+
3. Create a new directory under `cases/<track>/` (e.g., `cases/diagnose/diagnose-005/`)
47+
4. Include `case.yaml`, `hints.yaml`, `tree.txt`, and `repo/` with real code
48+
5. Run `reposcale validate cases/<track>/diagnose-005/`
49+
6. Submit a PR with a brief description of what the case tests
4650

4751
## Code of conduct
4852

README.md

Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -88,37 +88,48 @@ results/ evaluation outputs (gitignored)
8888

8989
## Project status
9090

91-
RepoScale is in **Phase 1 — MVP pipeline**.
91+
RepoScale is in **Alpha** — core pipeline complete, corpus of 12 cases ready for baseline evaluation.
9292

9393
Current goals:
9494
- [x] Define the capability model
9595
- [x] Publish the task taxonomy
9696
- [x] Define base schemas (case, response, evaluation)
9797
- [x] Ship validation, runner, scoring, and summary pipeline
9898
- [x] Seed cases for Diagnose, Intent, and Plan tracks
99-
- [ ] Release the first 10–15 curated cases
100-
- [ ] Establish baseline results across models
99+
- [x] Expand corpus to 12 cases across 3 tracks and 3 difficulty levels
100+
- [x] CI pipeline with GitHub Actions
101+
- [x] Judge stability measurement with repeat scoring
102+
- [x] Batch command for running all cases
103+
- [x] Multi-run comparison command
104+
- [ ] Establish baseline results across models (GPT-4o, Claude)
105+
- [ ] Add Extend, Implement, and Agent tracks
101106

102107
## Quickstart
103108

104109
```bash
105110
# Install
106111
pip install -e .
107112

108-
# Validate case packs
109-
reposcale validate cases/diagnose/diagnose-001/
113+
# Validate all case packs
114+
reposcale validate cases/diagnose/diagnose-001/ cases/intent/intent-001/
110115

111-
# Run evaluation (dry-run — prints the assembled prompt)
116+
# Run a single case (dry-run — prints the assembled prompt)
112117
reposcale run cases/diagnose/diagnose-001/ --model gpt-4o --dry-run
113118

114-
# Run evaluation (requires LLM API key, e.g. OPENAI_API_KEY)
115-
reposcale run cases/diagnose/diagnose-001/ --model gpt-4o
119+
# Run all cases in batch (requires LLM API key, e.g. OPENAI_API_KEY)
120+
reposcale batch cases/ --model gpt-4o
116121

117122
# Score responses (structural + heuristic; add --judge-model for LLM judge)
118-
reposcale score results/<run-id>/
123+
reposcale score results/<run-id>/ --judge-model gpt-4o
124+
125+
# Score with stability measurement (runs judge 3 times)
126+
reposcale score results/<run-id>/ --judge-model gpt-4o --repeat 3
119127

120128
# View summary
121129
reposcale summary results/<run-id>/
130+
131+
# Compare multiple runs
132+
reposcale compare results/<run-a>/ results/<run-b>/
122133
```
123134

124135
## Contributing

cases/CORPUS.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# RepoScale Corpus v1 — 12 Cases
2+
3+
## Inventory
4+
5+
| ID | Track | Type | Difficulty | Description |
6+
|----|-------|------|-----------|-------------|
7+
| diagnose-001 | diagnose | mvp_incomplete | easy | Incomplete task CLI with unwired auth middleware |
8+
| diagnose-002 | diagnose | mvp_incomplete | medium | E-commerce cart with missing integration tests |
9+
| diagnose-003 | diagnose | single_file_buggy | easy | CSV processor with off-by-one and encoding bugs |
10+
| diagnose-004 | diagnose | multi_module_tangled | hard | Job system with conflicting error handling patterns |
11+
| intent-001 | intent | divergent_beta | medium | Analytics dashboard pivoted from real-time to batch |
12+
| intent-002 | intent | divergent_beta | medium | REST API pivoting from session-based to JWT auth |
13+
| intent-003 | intent | abandoned_rewrite | hard | CMS with abandoned v1→v2 plugin rewrite |
14+
| intent-004 | intent | readme_code_divergence | easy | Weather CLI where README overpromises features |
15+
| plan-001 | plan | functional_not_scalable | medium | URL shortener that works but won't scale |
16+
| plan-002 | plan | functional_not_scalable | medium | Monolith needing service extraction for scaling |
17+
| plan-003 | plan | multi_language_unscalable | hard | Multi-language platform with manual deploy, no DevOps |
18+
| plan-004 | plan | needs_ops_layer | easy | Clean Flask app needing CI/CD and monitoring |
19+
20+
## Coverage
21+
22+
| Track | Easy | Medium | Hard | Total |
23+
|-------|------|--------|------|-------|
24+
| diagnose | 2 | 1 | 1 | 4 |
25+
| intent | 1 | 2 | 1 | 4 |
26+
| plan | 1 | 2 | 1 | 4 |
27+
| **Total** | **4** | **5** | **3** | **12** |
28+
29+
## Case types (8 unique)
30+
31+
- `mvp_incomplete` — Project with missing features or unwired components
32+
- `single_file_buggy` — Single-file script with obvious bugs
33+
- `multi_module_tangled` — Multi-module project with inconsistent patterns
34+
- `divergent_beta` — Project that pivoted direction mid-development
35+
- `abandoned_rewrite` — Codebase with abandoned rewrite attempt
36+
- `readme_code_divergence` — README describes features that don't exist
37+
- `functional_not_scalable` — Works but has fundamental scaling limitations
38+
- `multi_language_unscalable` — Multi-language project lacking operational infrastructure
39+
- `needs_ops_layer` — Good code quality but missing operational tooling
40+
41+
## Validation
42+
43+
All 12 cases pass `reposcale validate` with zero errors.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
id: diagnose-002
2+
track: diagnose
3+
title: E-commerce cart service with missing integration tests
4+
description: >
5+
A multi-module e-commerce backend with cart, inventory, and payment modules.
6+
Unit tests exist per module but no integration tests verify cross-module
7+
interactions. Payment webhook handler silently drops errors.
8+
case_type: mvp_incomplete
9+
difficulty: medium
10+
repo_source: snapshot
11+
supported_modes:
12+
- prompt_only
13+
expected_sections:
14+
- project_summary
15+
- what_works
16+
- what_is_broken
17+
- recommendations
18+
version: "1.0"
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
known_gaps:
2+
- No integration tests for cart-to-payment flow
3+
- Payment webhook handler catches and discards all exceptions
4+
- Inventory decrement is not atomic (race condition under concurrent orders)
5+
- No retry logic for failed payment callbacks
6+
original_intent: >
7+
Full e-commerce cart service handling cart management, inventory checks,
8+
and payment processing with Stripe-like webhook integration.
9+
key_files:
10+
- src/cart.py
11+
- src/payment.py
12+
- src/inventory.py
13+
- tests/test_cart.py
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# E-Commerce Cart Service
2+
3+
Simple cart + inventory + payment backend.
4+
5+
## Modules
6+
- `src/cart.py` — cart management
7+
- `src/inventory.py` — stock tracking
8+
- `src/payment.py` — payment processing + webhooks
9+
10+
## Running tests
11+
```
12+
pytest tests/
13+
```
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
pytest==7.4.0
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
"""Shopping cart management."""
2+
3+
from dataclasses import dataclass, field
4+
5+
6+
@dataclass
7+
class CartItem:
8+
product_id: str
9+
quantity: int
10+
unit_price: float
11+
12+
13+
@dataclass
14+
class Cart:
15+
user_id: str
16+
items: list[CartItem] = field(default_factory=list)
17+
18+
def add_item(self, product_id: str, quantity: int, unit_price: float):
19+
for item in self.items:
20+
if item.product_id == product_id:
21+
item.quantity += quantity
22+
return
23+
self.items.append(CartItem(product_id, quantity, unit_price))
24+
25+
def remove_item(self, product_id: str):
26+
self.items = [i for i in self.items if i.product_id != product_id]
27+
28+
def total(self) -> float:
29+
return sum(i.quantity * i.unit_price for i in self.items)
30+
31+
def checkout(self, inventory, payment_client):
32+
"""Validate inventory and initiate payment."""
33+
for item in self.items:
34+
if not inventory.check_stock(item.product_id, item.quantity):
35+
raise ValueError(f"Insufficient stock for {item.product_id}")
36+
37+
for item in self.items:
38+
inventory.decrement(item.product_id, item.quantity)
39+
40+
return payment_client.create_charge(self.user_id, self.total())

0 commit comments

Comments
 (0)