-
Notifications
You must be signed in to change notification settings - Fork 74
Cost estimation v1 #2489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Cost estimation v1 #2489
Changes from 48 commits
Commits
Show all changes
52 commits
Select commit
Hold shift + click to select a range
08a15de
Messing with a better estimation system
onmyraedar 2e310d3
Fix type hints
onmyraedar d0be13f
Improved branch weighting algorithm
onmyraedar c59da6e
Clean up job cost estimator
onmyraedar 672f015
Small file renaming + import updates
onmyraedar d240892
Allow characters per token overrides; unify logic
onmyraedar 426cc6a
Smarter question estimation based on type
onmyraedar d4c453c
Token dataclass + QTE tests
onmyraedar a3d6cd7
Rename input_tokens to prompt_tokens
onmyraedar e9f368c
Question estimator tests
onmyraedar e53eabc
Reach probability & job cost estimate tests
onmyraedar 496887d
Fix test
onmyraedar 6c03358
Fix functional test
onmyraedar 9209f7a
Add compute test
onmyraedar 7d71e92
Add .md method to JobCostEstimator
onmyraedar b1f10ad
Add credits + model summary to job cost estimate
onmyraedar 2f42550
Add .describe() methods to estimators for use in generating Markdown …
onmyraedar 4977f3d
QE describe tests + better descriptions for manual overrides
onmyraedar 27fdb79
Take out assumptons section now that we have .describe() methods
onmyraedar 6486c4f
Description should reflect that overrides are merged with base estimate
onmyraedar 24aa0d5
Don't show skip logic warning if the survey has no skip rules
onmyraedar 30d5cc8
Estimate clarifications
onmyraedar 65ea28a
Accurate estimator description for offloaded files
onmyraedar 67163a1
Fall back to 1,000 tokens for offloaded files by default
onmyraedar 8ed0010
Get image dimensions for estimates, when we can
onmyraedar 59fce5c
Use proper OpenAI image estimation
onmyraedar 3d13dfb
Refactor FileStoreEstimator to use type-based classes
onmyraedar e1ce17f
Add AnthropicImageEstimator
onmyraedar b3ade3e
Add GoogleImageEstimator
onmyraedar 1683b40
Separate file for service-based image estimators
onmyraedar c8c1aba
Override refactor; calibrate_from_results
onmyraedar eb3b3fe
Update tests
onmyraedar 99b1a5a
OpenAI PDF estimator v1
onmyraedar 372463c
Anthropic PDF estimator v1
onmyraedar 629dcf5
Add Google PDF estimator v1
onmyraedar 3815961
More PDF algorithm calibration
onmyraedar 33229e7
Improve file estimates: general improvements, PDFs, images
onmyraedar 51575d0
Fix tests
onmyraedar 9ff19f6
Model calibration should be True by default
onmyraedar 3aa1c5b
Calibrate thinking tokens
onmyraedar 8012785
Better skill
onmyraedar e4962e6
Delete skill (moved to ep-agent)
onmyraedar 10bdb65
Merge remote-tracking branch 'origin/main' into humanize_file_upload
onmyraedar 2b533db
Update estimate_remote_job_cost docstring & types
onmyraedar f642094
Greptile fixes
onmyraedar 294b482
More small fixes
onmyraedar 235bc2a
Fix reach -> cost impact; add regression test
onmyraedar ada6742
Fix EOS reach double-count
onmyraedar e67270e
Add image estimator tests; ensure minimum of one patch
onmyraedar f9d01a1
Skip calibration if all values are None
onmyraedar b33a2fc
Fix chars_per_token inconsistency with PDF estimator
onmyraedar 46204a7
Include reach probabilities in summary; add regression test
onmyraedar File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| from .question_token_estimate import QuestionTokenEstimate | ||
| from .cost_estimation_constants import ( | ||
| EDSL_DEFAULT_CHARS_PER_TOKEN, | ||
| TokenAmount, | ||
| TokenRatio, | ||
| ) | ||
| from .job_cost_estimate import JobCostEstimate | ||
| from .question_estimators import ( | ||
| QuestionEstimator, | ||
| ZeroCostEstimator, | ||
| FreeTextStyleEstimator, | ||
| StructuredAnswerEstimator, | ||
| DemandEstimator, | ||
| MatrixEstimator, | ||
| DefaultEstimator, | ||
| DEFAULT_ESTIMATORS, | ||
| ) | ||
| from .file_store_estimator import FileStoreEstimator | ||
| from .job_cost_estimator import JobCostEstimator | ||
| from .token_override import TokenOverride | ||
| from .cost_estimate_calibration import calibrate_from_results | ||
|
|
||
| __all__ = [ | ||
| "QuestionTokenEstimate", | ||
| "JobCostEstimate", | ||
| "EDSL_DEFAULT_CHARS_PER_TOKEN", | ||
| "TokenAmount", | ||
| "TokenRatio", | ||
| "QuestionEstimator", | ||
| "ZeroCostEstimator", | ||
| "FreeTextStyleEstimator", | ||
| "StructuredAnswerEstimator", | ||
| "DemandEstimator", | ||
| "MatrixEstimator", | ||
| "DefaultEstimator", | ||
| "DEFAULT_ESTIMATORS", | ||
| "FileStoreEstimator", | ||
| "JobCostEstimator", | ||
| "TokenOverride", | ||
| "calibrate_from_results", | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,134 @@ | ||
| from __future__ import annotations | ||
| from typing import TYPE_CHECKING | ||
|
|
||
| from .token_override import TokenOverride | ||
|
|
||
| if TYPE_CHECKING: | ||
| from ...results import Results | ||
|
|
||
|
|
||
| def calibrate_from_results( | ||
| results: "Results", | ||
| percentile: int = 75, | ||
| by_model: bool = True, | ||
| ) -> dict[str, TokenOverride | list[TokenOverride]]: | ||
| """Derive token overrides from a pilot Results object. | ||
|
|
||
| Computes the given percentile of actual output tokens per question (and | ||
| optionally per service/model), returning a dict ready to pass as | ||
| token_overrides to JobCostEstimator.estimate_cost(). | ||
|
|
||
| Calibrates both answer_tokens (from raw_model_response.{q}_output_tokens) | ||
| and thinking_tokens (from raw_model_response.{q}_thinking_tokens) when | ||
| thinking token data is present. | ||
|
|
||
| Args: | ||
| results: a completed Results object from a pilot run | ||
| percentile: which percentile of observed output tokens to use (default 75). | ||
| Use 50 for median or a higher value (75-90) for a conservative | ||
| budget estimate. | ||
| by_model: if True (default), return per-(service, model) overrides so each | ||
| model gets its own calibrated estimate; if False, pool all models | ||
| into one global override per question | ||
|
|
||
| Returns: | ||
| dict[str, TokenOverride | list[TokenOverride]] ready for token_overrides= | ||
| """ | ||
| prefix = "raw_model_response." | ||
| output_suffix = "_output_tokens" | ||
| thinking_suffix = "_thinking_tokens" | ||
|
|
||
| output_cols = { | ||
| c[len(prefix) : -len(output_suffix)]: c | ||
| for c in results.columns | ||
| if c.startswith(prefix) and c.endswith(output_suffix) | ||
| } | ||
| thinking_cols = { | ||
| c[len(prefix) : -len(thinking_suffix)]: c | ||
| for c in results.columns | ||
| if c.startswith(prefix) and c.endswith(thinking_suffix) | ||
| } | ||
|
|
||
| overrides: dict[str, TokenOverride | list[TokenOverride]] = {} | ||
|
|
||
| for q, output_col in output_cols.items(): | ||
| thinking_col = thinking_cols.get(q) | ||
|
|
||
| if by_model: | ||
| select_cols = [output_col, "model.inference_service", "model.model"] | ||
| if thinking_col: | ||
| select_cols.insert(1, thinking_col) | ||
| df = results.select(*select_cols).to_pandas() | ||
| df = df.dropna(subset=[output_col]) | ||
| entries: list[TokenOverride] = [] | ||
| for (svc, mdl), grp in df.groupby( | ||
| ["model.inference_service", "model.model"] | ||
| ): | ||
| output_vals = grp[output_col].tolist() | ||
| thinking_tokens = None | ||
| if thinking_col: | ||
| thinking_vals = grp[thinking_col].dropna().tolist() | ||
| if thinking_vals: | ||
| thinking_tokens = _percentile(thinking_vals, percentile) | ||
| entries.append( | ||
| TokenOverride( | ||
| answer_tokens=_percentile(output_vals, percentile), | ||
| thinking_tokens=thinking_tokens, | ||
| service=svc, | ||
| model=mdl, | ||
| note=f"calibrated from pilot (n={len(output_vals)}, p{percentile})", | ||
| ) | ||
| ) | ||
| overrides[q] = entries | ||
| else: | ||
| df = results.select(output_col).to_pandas().dropna(subset=[output_col]) | ||
| output_vals = df[output_col].tolist() | ||
| thinking_tokens = None | ||
| if thinking_col: | ||
| thinking_df = ( | ||
| results.select(thinking_col) | ||
| .to_pandas() | ||
| .dropna(subset=[thinking_col]) | ||
| ) | ||
| thinking_vals = thinking_df[thinking_col].tolist() | ||
| if thinking_vals: | ||
| thinking_tokens = _percentile(thinking_vals, percentile) | ||
| overrides[q] = TokenOverride( | ||
| answer_tokens=_percentile(output_vals, percentile), | ||
| thinking_tokens=thinking_tokens, | ||
| note=f"calibrated from pilot (n={len(output_vals)}, p{percentile})", | ||
| ) | ||
|
|
||
| return overrides | ||
|
|
||
|
|
||
| def _percentile(values: list[float], p: int) -> int: | ||
| """Return the p-th percentile of values using linear interpolation. | ||
|
|
||
| Computes a float index into the sorted list, then interpolates between | ||
| the two surrounding values. This matches numpy.percentile(method='linear') | ||
| and correctly returns the average of the two middle elements for even-length | ||
| lists at p=50 (e.g. [10, 20, 30, 40] -> 25, not 30). | ||
|
|
||
| Args: | ||
| values: list of numeric values | ||
| p: percentile to compute, 0-100 inclusive | ||
|
|
||
| Returns: | ||
| Interpolated percentile value truncated to int, or 0 for an empty list. | ||
| """ | ||
| if not values: | ||
| return 0 | ||
| sorted_values = sorted(values) | ||
| count = len(sorted_values) | ||
| # A float index in [0, count-1] that maps p=0 to the first element | ||
| # and p=100 to the last, with fractional positions in between. | ||
| float_index = (count - 1) * p / 100 | ||
| lower_idx = int(float_index) | ||
| upper_idx = min(lower_idx + 1, count - 1) | ||
| # How far float_index sits between lower_idx and upper_idx (0.0 to 1.0). | ||
| fraction = float_index - lower_idx | ||
| interpolated = sorted_values[lower_idx] + fraction * ( | ||
| sorted_values[upper_idx] - sorted_values[lower_idx] | ||
| ) | ||
| return int(interpolated) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| from __future__ import annotations | ||
| from dataclasses import dataclass | ||
|
|
||
| EDSL_DEFAULT_CHARS_PER_TOKEN = 4 | ||
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class TokenAmount: | ||
| """Fixed token count, independent of input length.""" | ||
| value: int | ||
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class TokenRatio: | ||
| """Token count as a fraction of input tokens.""" | ||
| value: float | ||
|
|
||
|
|
||
| def _resolve_token_spec(spec: TokenAmount | TokenRatio, input_tokens: int) -> int: | ||
| if isinstance(spec, TokenRatio): | ||
| return int(input_tokens * spec.value) | ||
| return spec.value |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.