Skip to content

Commit 67ae47e

Browse files
authored
Merge pull request #1 from fswair/test_durations
v0.2.5 - improvements & fixes
2 parents 4dcde53 + 03080c6 commit 67ae47e

11 files changed

Lines changed: 546 additions & 108 deletions

File tree

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,9 @@ env/
3131
src/vowel/ai.py
3232
AGENT_*.md
3333
tests/
34+
functions/
3435
eval_gen.py
36+
demo.txt
3537

3638
# IDE
3739
.vscode/

README.md

Lines changed: 144 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -582,7 +582,150 @@ pattern: "^prefix"
582582
pattern: "suffix$"
583583
```
584584
585-
### 8. LLM Judge Evaluator
585+
### 8. Raises Evaluator (Exception Testing)
586+
587+
Tests that a function raises a specific exception. Similar to pytest's `pytest.raises`, this evaluator verifies both the exception type and optionally the exception message pattern. This is a **case-level only** evaluator.
588+
589+
```yaml
590+
function_name:
591+
dataset:
592+
- case:
593+
input: invalid_value
594+
raises: ExceptionType # Required: exception type name
595+
match: "pattern" # Optional: regex pattern for exception message
596+
```
597+
598+
**Important Notes:**
599+
600+
- `raises` is **case-level only** - cannot be used as a global evaluator
601+
- `match` can only be used together with `raises`
602+
- When `raises` is specified, the test expects an exception and will fail if the function returns normally
603+
- Global evaluators (type checks, assertions, etc.) are automatically skipped for exception cases
604+
605+
**Examples:**
606+
607+
```yaml
608+
# Basic exception testing
609+
calculate_discount:
610+
evals:
611+
IsFloat:
612+
type: float
613+
dataset:
614+
- case:
615+
id: "valid_calculation"
616+
inputs: [100.0, 20.0]
617+
expected: 80.0
618+
619+
- case:
620+
id: "negative_price"
621+
inputs: [-100.0, 20.0]
622+
raises: ValueError
623+
match: "must be positive" # Checks exception message
624+
625+
- case:
626+
id: "invalid_discount"
627+
inputs: [100.0, 150.0]
628+
raises: ValueError # Just checks type, not message
629+
630+
# Division by zero
631+
divide:
632+
evals:
633+
IsNumber:
634+
type: "int | float"
635+
dataset:
636+
- case:
637+
inputs: [10, 2]
638+
expected: 5.0
639+
640+
- case:
641+
inputs: [10, 0]
642+
raises: ZeroDivisionError
643+
644+
# Type validation
645+
parse_age:
646+
dataset:
647+
- case:
648+
input: "25"
649+
expected: 25
650+
651+
- case:
652+
input: "invalid"
653+
raises: ValueError
654+
match: "invalid literal"
655+
656+
- case:
657+
input: -5
658+
raises: ValueError
659+
match: "age must be positive"
660+
661+
# Key errors
662+
get_config_value:
663+
dataset:
664+
- case:
665+
input: "api_key"
666+
expected: "secret_key_123"
667+
668+
- case:
669+
input: "nonexistent_key"
670+
raises: KeyError
671+
match: "nonexistent_key"
672+
673+
# Multiple exception types
674+
process_data:
675+
dataset:
676+
- case:
677+
input: { "valid": "data" }
678+
expected: "processed"
679+
680+
- case:
681+
input: null
682+
raises: TypeError
683+
match: "NoneType"
684+
685+
- case:
686+
input: []
687+
raises: ValueError
688+
match: "empty"
689+
690+
- case:
691+
input: { "invalid": "format" }
692+
raises: KeyError
693+
694+
# Index errors
695+
get_element:
696+
dataset:
697+
- case:
698+
inputs: [[1, 2, 3], 1]
699+
expected: 2
700+
701+
- case:
702+
inputs: [[1, 2, 3], 10]
703+
raises: IndexError
704+
match: "out of range"
705+
```
706+
707+
**How it works:**
708+
709+
1. When `raises` is present in a case, the framework wraps the function execution in a try-catch
710+
2. If an exception is raised:
711+
- Checks if exception type matches `raises`
712+
- If `match` is provided, validates exception message against the regex pattern
713+
- Global evaluators are skipped (they would fail on exception dict)
714+
3. If no exception is raised when `raises` is specified, the test fails
715+
4. If exception type doesn't match, the test fails and shows actual vs expected
716+
717+
**Common Exception Types:**
718+
719+
- `ValueError`: Invalid value/argument
720+
- `TypeError`: Wrong type
721+
- `KeyError`: Missing dictionary key
722+
- `IndexError`: List/array index out of range
723+
- `ZeroDivisionError`: Division by zero
724+
- `AttributeError`: Missing attribute
725+
- `FileNotFoundError`: File doesn't exist
726+
- `RuntimeError`: Generic runtime error
727+
728+
### 9. LLM Judge Evaluator
586729

587730
Uses a Language Model to evaluate outputs based on a custom rubric. Ideal for semantic evaluation, quality assessment, and cases where rule-based checking is insufficient.
588731

examples/functions.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,9 @@ def validate_email(email: str) -> bool:
2121
return bool(re.match(pattern, email))
2222

2323

24-
def calculate_discount(price: float, discount_percent: float) -> float:
24+
async def calculate_discount(price: float, discount_percent: float) -> float:
25+
if price < 0:
26+
raise ValueError("Price must be positive")
2527
if discount_percent < 0 or discount_percent > 100:
2628
raise ValueError("Discount must be between 0 and 100")
2729
return round(price * (1 - discount_percent / 100), 2)
@@ -152,3 +154,18 @@ async def _run():
152154
return result.output
153155

154156
return asyncio.run(_run())
157+
158+
159+
def binary_search(sorted_list: list[int], target: int) -> int:
160+
"""Binary search for target in sorted_list (ascending). Returns index or -1 if not found."""
161+
low = 0
162+
high = len(sorted_list) - 1
163+
while low <= high:
164+
mid = low + (high - low) // 2
165+
if sorted_list[mid] == target:
166+
return mid
167+
elif sorted_list[mid] < target:
168+
low = mid + 1
169+
else:
170+
high = mid - 1
171+
return -1

examples/test_raises.yml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
examples.functions.calculate_discount:
2+
evals:
3+
IsFloat:
4+
type: float
5+
dataset:
6+
- case:
7+
id: "valid_discount"
8+
inputs: [100.0, 20.0]
9+
expected: 80.0
10+
11+
- case:
12+
id: "negative_price_raises"
13+
inputs: [-100.0, 20.0]
14+
raises: ValueError
15+
match: "must be positive"
16+
17+
- case:
18+
id: "invalid_discount_raises"
19+
inputs: [-100.0, 21.0]
20+
raises: ValueError
21+
match: "must be positive"

pyproject.toml

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "vowel"
7-
version = "0.2.4"
7+
version = "0.2.5"
88
description = "A modular evaluation framework for testing functions with YAML-based specifications"
99
readme = "README.md"
1010
requires-python = ">=3.10"
@@ -25,10 +25,12 @@ classifiers = [
2525
]
2626

2727
dependencies = [
28+
"click",
29+
"dotenv",
30+
"pyyaml",
2831
"pydantic>=2.0.0",
29-
"pydantic-evals>=0.2.0",
30-
"pyyaml>=6.0.0",
31-
"click>=8.0.0",
32+
"pydantic-ai>=1.0.0",
33+
"pydantic-evals>=1.0.0",
3234
]
3335

3436
[project.optional-dependencies]

src/vowel/__init__.py

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,23 @@
11
"""
22
vowel - A modular evaluation framework for testing functions with YAML-based specifications.
33
"""
4+
import importlib.metadata
45

5-
__version__ = "0.1.0"
6+
__version__ = importlib.metadata.version("vowel")
67

78
from .eval_types import EvalsFile
89
from .runner import RunEvals
9-
from .utils import (EvalResult, EvalSummary, load_evals, load_evals_file,
10-
load_evals_from_dict, load_evals_from_object,
11-
load_evals_from_yaml_string, run_evals, to_dataset)
10+
from .utils import (
11+
EvalResult,
12+
EvalSummary,
13+
load_evals,
14+
load_evals_file,
15+
load_evals_from_dict,
16+
load_evals_from_object,
17+
load_evals_from_yaml_string,
18+
run_evals,
19+
to_dataset,
20+
)
1221

1322
__all__ = [
1423
"load_evals_file",

src/vowel/cli.py

Lines changed: 2 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,9 @@
1-
import importlib.util
2-
import os
3-
import sys
4-
1+
import click
52
import dotenv
63

7-
dotenv.load_dotenv()
8-
9-
if "--debug" in sys.argv:
10-
if importlib.util.find_spec("logfire"):
11-
import logfire
12-
13-
logfire.configure()
14-
logfire.instrument_pydantic_ai()
15-
else:
16-
raise ImportError(
17-
"Debug mode enabled but logfire is not installed. Please install logfire or disable debug mode."
18-
)
19-
204
from pathlib import Path
215

22-
import click
6+
dotenv.load_dotenv()
237

248

259
@click.command()

src/vowel/eval_types.py

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,20 @@ class PatternMatchCase(BaseModel):
110110
)
111111

112112

113+
class RaisesCase(BaseModel):
114+
"""Exception raising evaluation case. Validates that the function raises a specific exception."""
115+
116+
raises: str = Field(
117+
description="Expected exception type as string (e.g., 'ValueError', 'KeyError', 'TypeError').",
118+
examples=["ValueError", "TypeError", "KeyError", "ZeroDivisionError", "IndexError"],
119+
)
120+
match: Optional[str] = Field(
121+
default=None,
122+
description="Optional regex pattern to match against the exception message.",
123+
examples=["invalid input", "must be positive", "not found"],
124+
)
125+
126+
113127
class LLMJudgeCase(BaseModel):
114128
"""LLM Judge evaluation case. Uses an LLM to evaluate the output based on a rubric."""
115129

@@ -183,7 +197,7 @@ class MatchCase(BaseModel):
183197
),
184198
examples=[5, "hello", [1, 2, 3], {"x": 10, "y": 20}, {"name": "test", "value": 42}],
185199
)
186-
inputs: Optional[list[Any]] = Field(
200+
inputs: Optional[list | dict] = Field(
187201
default=None,
188202
description=(
189203
"Multiple input values to pass to the function as separate arguments (*args). "
@@ -226,6 +240,23 @@ class MatchCase(BaseModel):
226240
default=True,
227241
description="Whether the regex pattern matching should be case-sensitive (only used if pattern is specified).",
228242
)
243+
raises: Optional[str] = Field(
244+
default=None,
245+
description="Expected exception type for this case. If specified, the test expects the function to raise this exception.",
246+
examples=["ValueError", "TypeError", "KeyError", "ZeroDivisionError"],
247+
)
248+
match: Optional[str] = Field(
249+
default=None,
250+
description="Optional regex pattern to match against the exception message (only used if raises is specified).",
251+
examples=["invalid input", "must be positive", "not found"],
252+
)
253+
254+
@field_validator("match")
255+
@classmethod
256+
def validate_match_requires_raises(cls, v, info):
257+
if v is not None and info.data.get("raises") is None:
258+
raise ValueError("'match' can only be used together with 'raises'")
259+
return v
229260

230261
@field_validator("inputs")
231262
@classmethod
@@ -260,6 +291,10 @@ def has_assertion(self) -> bool:
260291
def has_pattern(self) -> bool:
261292
return self.pattern is not None
262293

294+
@property
295+
def has_raises(self) -> bool:
296+
return self.raises is not None
297+
263298

264299
class EvalCase(BaseModel):
265300
"""Internal representation of an evaluation case with its data."""

0 commit comments

Comments
 (0)