feat: inline Google Sheets data source integration#830
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughAdds Google Sheets as an inline data source for Breeze Buddy templates. A new ChangesGoogle Sheets Data Sources Feature
Sequence Diagram(s)sequenceDiagram
rect rgba(135, 206, 235, 0.5)
Note over DispatchWorker,Redis: Background prefetch at dispatch time
DispatchWorker->>prefetch_data_sources: asyncio.create_task (template.data_sources)
prefetch_data_sources->>fetch_formatted: per DataSourceRef (timeout 5s, concurrent)
fetch_formatted-->>prefetch_data_sources: formatted string
prefetch_data_sources->>Redis: SET cache_key → content (TTL 60s)
end
rect rgba(144, 238, 144, 0.5)
Note over Agent,FlowConfigLoader: Template loading at call time
Agent->>load_template_config: self.lead
load_template_config->>FlowConfigLoader: load_template
FlowConfigLoader->>_fetch_one_data_source: per DataSourceRef
_fetch_one_data_source->>Redis: GET cache_key
Redis-->>_fetch_one_data_source: cached content (or miss → fetch_formatted)
_fetch_one_data_source-->>FlowConfigLoader: content string
FlowConfigLoader-->>load_template_config: template, config, vars, ds_messages
load_template_config-->>Agent: template, config, vars, ds_messages
Agent->>build_flow_config: ds_messages=self.ds_messages
build_flow_config-->>Agent: flow_config with _data_source_messages
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
app/api/routers/breeze_buddy/templates/handlers.py (1)
130-145:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winNormalize
data_sourcesbefore persistence and preserve existing values when PUT omits the field.
template_data.data_sourcesis a list of Pydantic models; forwarding it directly to the accessor risks serialization failure onjson.dumps. Also, in replace flow, passingNonewhen the field is omitted can unintentionally wipe existingdata_sourcesfor older clients doing unrelated edits.Suggested fix
+def _dump_data_sources(data_sources): + if data_sources is None: + return None + return [ + ds.model_dump(exclude_none=True) if hasattr(ds, "model_dump") else ds + for ds in data_sources + ] ... - template = await create_template( + create_data_sources = _dump_data_sources(template_data.data_sources) + template = await create_template( ... - data_sources=template_data.data_sources, + data_sources=create_data_sources, now=now, ) ... - updated_template = await replace_template( + effective_data_sources = ( + template_data.data_sources + if template_data.data_sources is not None + else existing_template.data_sources + ) + updated_template = await replace_template( ... - data_sources=template_data.data_sources, + data_sources=_dump_data_sources(effective_data_sources), now=now, )Also applies to: 526-540
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/api/routers/breeze_buddy/templates/handlers.py` around lines 130 - 145, The `template_data.data_sources` parameter contains Pydantic models which cannot be directly serialized by json.dumps during persistence. Normalize the data_sources by converting each Pydantic model to a dictionary (using .dict() or .model_dump() method) before passing it to the create_template function call. Additionally, when the data_sources field is omitted or None in a PUT/replace request, preserve the existing data_sources value from the current template instead of passing None to create_template, to avoid unintentionally wiping out the stored value for clients making unrelated edits. Apply these fixes to both the POST handler (create_template call around lines 130-145) and the PUT handler (update_template call around lines 526-540).
🧹 Nitpick comments (4)
app/services/google/sheets.py (1)
42-67: ⚡ Quick winAdd explicit helper return types and concrete dict typing.
Line 42 and Line 66 are missing return annotations, and Line 138 uses
List[dict]instead of a concreteDict[str, Any]shape. This weakens static checks on the service boundary.Suggested patch
-from typing import List, Optional +from typing import Any, Dict, List, Optional ... -def _get_sheets_session(): +def _get_sheets_session() -> Optional[AuthorizedSession]: ... -def _get_json(session: AuthorizedSession, url: str, params: Optional[dict] = None): +def _get_json( + session: AuthorizedSession, + url: str, + params: Optional[Dict[str, str]] = None, +) -> Dict[str, Any]: ... -) -> List[dict]: +) -> List[Dict[str, Any]]:As per coding guidelines: “Add type hints on all function signatures” and “Use Optional[T], List[T], Dict[str, Any], Union for type hints.”
Also applies to: 133-139
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/services/google/sheets.py` around lines 42 - 67, Add explicit return type annotations to the functions _get_sheets_session() and _get_json() to strengthen static type checking. The _get_sheets_session() function should have a return type of Optional[AuthorizedSession] since it returns either an AuthorizedSession instance or None. The _get_json() function needs an appropriate return type annotation based on what it returns. Additionally, update the imprecise typing at lines 133-139 where List[dict] is used and replace it with List[Dict[str, Any]] to provide concrete dictionary structure typing and align with the coding guidelines for explicit type hints.Source: Coding guidelines
app/api/routers/breeze_buddy/data_sources/__init__.py (1)
31-37: ⚡ Quick winAdd explicit return annotations on route handlers.
Line 31, Line 41, and Line 54 should declare concrete return types to keep API contracts type-checkable.
Suggested patch
async def get_sheet_tabs( spreadsheet_url: str = Query(..., description="Full Google Sheets URL"), current_user: UserInfo = Depends(get_current_user_with_rbac), -): +) -> TabsResponse: ... async def get_sheet_columns( ... -): +) -> ColumnsResponse: ... async def preview_sheet( ... -): +) -> PreviewResponse:As per coding guidelines: “Add type hints on all function signatures.”
Also applies to: 41-50, 54-65
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/api/routers/breeze_buddy/data_sources/__init__.py` around lines 31 - 37, The route handlers get_sheet_tabs (lines 31-37), the function at lines 41-50, and the function at lines 54-65 are missing explicit return type annotations on their function signatures. Add a concrete return type annotation to each of these async functions by specifying what the handler actually returns. For each function, determine the appropriate return type based on what the handler function (such as list_tabs_handler) returns and add it to the function signature using the arrow syntax after the closing parenthesis.Source: Coding guidelines
app/ai/voice/agents/breeze_buddy/agent/flow.py (1)
30-30: ⚡ Quick winUse explicit dict key/value typing on
ds_messagessignatures.Line 30 and Line 130 use
List[Dict], which is too broad for this cross-module contract and reduces type-check value.Proposed fix
async def load_template_config( lead: LeadCallTracker, -) -> tuple[TemplateModel, Optional[ConfigurationModel], Dict[str, str], List[Dict]]: +) -> tuple[ + TemplateModel, + Optional[ConfigurationModel], + Dict[str, str], + List[Dict[str, Any]], +]: @@ def build_flow_config( flow_builder: FlowConfigBuilder, template: TemplateModel, - ds_messages: Optional[List[Dict]] = None, + ds_messages: Optional[List[Dict[str, Any]]] = None, ) -> tuple[Dict[str, Any], List, Any]:As per coding guidelines, "Use Optional[T], List[T], Dict[str, Any], Union for type hints."
Also applies to: 130-130
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/ai/voice/agents/breeze_buddy/agent/flow.py` at line 30, Replace the overly broad List[Dict] type hint with List[Dict[str, Any]] to provide explicit key and value typing for the ds_messages parameter in the function signatures. This needs to be updated in the return type annotation at line 30 and also at line 130 where this same pattern appears, ensuring the cross-module contract has proper type clarity as per coding guidelines.Source: Coding guidelines
app/database/queries/breeze_buddy/template.py (1)
65-67: ⚡ Quick winAdd explicit timestamp type hints on new signature params.
Line 66, Line 67, and Line 345 introduce new defaulted parameters without type annotations, which weakens static validation for query contracts.
Proposed fix
+from datetime import datetime from typing import Any, Dict, List, Optional, Tuple @@ def create_template_query( @@ - created_at=None, - updated_at=None, + created_at: Optional[datetime] = None, + updated_at: Optional[datetime] = None, ) -> Tuple[str, List[Any]]: @@ def replace_template_query( @@ - updated_at=None, + updated_at: Optional[datetime] = None, ) -> Tuple[str, List[Any]]:As per coding guidelines, "Add type hints on all function signatures."
Also applies to: 344-345
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/database/queries/breeze_buddy/template.py` around lines 65 - 67, Add explicit type hints to the parameters introduced without type annotations. In app/database/queries/breeze_buddy/template.py at the anchor site (lines 66-67), add appropriate type annotations to the created_at and updated_at parameters that currently only have default values of None. Similarly, apply the same type hint fixes at the sibling site (lines 344-345) where the same parameters appear. The parameters should follow the pattern of other timestamp parameters in the codebase and use Optional typing since they default to None. This ensures consistency with the coding guideline requiring type hints on all function signatures and strengthens static validation for query contracts.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/ai/voice/agents/breeze_buddy/managers/data_source_prefetch.py`:
- Around line 51-53: The redis.setex call in the data_source_prefetch.py file is
missing the required namespace parameter, which violates the coding guidelines
and can cause key collisions across services sharing the same Redis instance.
Add the namespace parameter to the setex call alongside the existing cache_key,
content, and ttl_seconds parameters to ensure proper key isolation across
services.
In `@app/ai/voice/agents/breeze_buddy/template/loader.py`:
- Around line 317-333: Replace the direct redis.get() call with the namespaced
redis_get() helper that includes a namespace parameter to prevent key collisions
across services, passing cache_key and an appropriate namespace identifier.
Additionally, after the fetch_formatted() function successfully returns the
sheet content, immediately persist that content back to cache using the
corresponding namespaced redis_set() helper with the same namespace parameter
and cache_key to ensure the fetched data is available for subsequent cache hits.
- Around line 223-225: The issue is that malformed spreadsheet URLs can raise an
exception during the `spreadsheet_url.split("/d/")[1]` operation before the
protected fetch block, and this exception is currently skipped with `continue`
in the condition checking `isinstance(result, Exception)`. Instead of silently
skipping malformed URLs, wrap the URL parsing logic in a try-except block and
set the result to `DATA_SOURCE_UNAVAILABLE` when an exception occurs during the
split operation, ensuring that the fallback value is used rather than allowing
the variable to disappear entirely.
- Around line 303-304: The combined inline import statement `import hashlib,
json` in the `_fetch_one_data_source` function does not comply with isort
formatting standards. Split the combined import into separate import statements
on individual lines, so `hashlib` and `json` are each imported on their own
line. This will resolve the isort CI check failure.
In `@app/services/google/sheets.py`:
- Around line 75-90: The _fetch() function in the Sheets module obtains a
session from _get_sheets_session() but never closes it, causing resource leaks.
Ensure the session is properly closed after use by wrapping the session usage in
a try-finally block or using a context manager if available. Close the session
in the finally block or after the try block completes, ensuring cleanup happens
regardless of whether an exception occurs. This same fix needs to be applied to
all occurrences of _fetch() functions that follow this pattern (at the locations
mentioned in the review) to prevent socket and file descriptor leaks.
---
Outside diff comments:
In `@app/api/routers/breeze_buddy/templates/handlers.py`:
- Around line 130-145: The `template_data.data_sources` parameter contains
Pydantic models which cannot be directly serialized by json.dumps during
persistence. Normalize the data_sources by converting each Pydantic model to a
dictionary (using .dict() or .model_dump() method) before passing it to the
create_template function call. Additionally, when the data_sources field is
omitted or None in a PUT/replace request, preserve the existing data_sources
value from the current template instead of passing None to create_template, to
avoid unintentionally wiping out the stored value for clients making unrelated
edits. Apply these fixes to both the POST handler (create_template call around
lines 130-145) and the PUT handler (update_template call around lines 526-540).
---
Nitpick comments:
In `@app/ai/voice/agents/breeze_buddy/agent/flow.py`:
- Line 30: Replace the overly broad List[Dict] type hint with List[Dict[str,
Any]] to provide explicit key and value typing for the ds_messages parameter in
the function signatures. This needs to be updated in the return type annotation
at line 30 and also at line 130 where this same pattern appears, ensuring the
cross-module contract has proper type clarity as per coding guidelines.
In `@app/api/routers/breeze_buddy/data_sources/__init__.py`:
- Around line 31-37: The route handlers get_sheet_tabs (lines 31-37), the
function at lines 41-50, and the function at lines 54-65 are missing explicit
return type annotations on their function signatures. Add a concrete return type
annotation to each of these async functions by specifying what the handler
actually returns. For each function, determine the appropriate return type based
on what the handler function (such as list_tabs_handler) returns and add it to
the function signature using the arrow syntax after the closing parenthesis.
In `@app/database/queries/breeze_buddy/template.py`:
- Around line 65-67: Add explicit type hints to the parameters introduced
without type annotations. In app/database/queries/breeze_buddy/template.py at
the anchor site (lines 66-67), add appropriate type annotations to the
created_at and updated_at parameters that currently only have default values of
None. Similarly, apply the same type hint fixes at the sibling site (lines
344-345) where the same parameters appear. The parameters should follow the
pattern of other timestamp parameters in the codebase and use Optional typing
since they default to None. This ensures consistency with the coding guideline
requiring type hints on all function signatures and strengthens static
validation for query contracts.
In `@app/services/google/sheets.py`:
- Around line 42-67: Add explicit return type annotations to the functions
_get_sheets_session() and _get_json() to strengthen static type checking. The
_get_sheets_session() function should have a return type of
Optional[AuthorizedSession] since it returns either an AuthorizedSession
instance or None. The _get_json() function needs an appropriate return type
annotation based on what it returns. Additionally, update the imprecise typing
at lines 133-139 where List[dict] is used and replace it with List[Dict[str,
Any]] to provide concrete dictionary structure typing and align with the coding
guidelines for explicit type hints.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 92c67258-3944-477e-8919-32411565e2e8
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (19)
app/ai/voice/agents/breeze_buddy/agent/__init__.pyapp/ai/voice/agents/breeze_buddy/agent/flow.pyapp/ai/voice/agents/breeze_buddy/dispatch/worker.pyapp/ai/voice/agents/breeze_buddy/managers/data_source_prefetch.pyapp/ai/voice/agents/breeze_buddy/template/loader.pyapp/ai/voice/agents/breeze_buddy/template/types.pyapp/api/routers/breeze_buddy/__init__.pyapp/api/routers/breeze_buddy/data_sources/__init__.pyapp/api/routers/breeze_buddy/data_sources/handlers.pyapp/api/routers/breeze_buddy/templates/handlers.pyapp/database/accessor/breeze_buddy/template.pyapp/database/decoder/breeze_buddy/template.pyapp/database/migrations/034_add_data_sources_to_template.sqlapp/database/queries/breeze_buddy/template.pyapp/schemas/breeze_buddy/data_source.pyapp/services/data_sources.pyapp/services/google/__init__.pyapp/services/google/sheets.pypyproject.toml
| redis = await get_redis_service() | ||
| await redis.setex(cache_key, content, ttl_seconds=_CACHE_TTL) | ||
| logger.info( |
There was a problem hiding this comment.
Add Redis namespace on cache write.
Line 52 writes with setex(...) but omits namespace, which can collide keys across services sharing Redis.
Suggested patch
- await redis.setex(cache_key, content, ttl_seconds=_CACHE_TTL)
+ await redis.setex(
+ cache_key,
+ content,
+ ttl_seconds=_CACHE_TTL,
+ namespace="breeze_buddy",
+ )As per coding guidelines: “Always use namespace parameter in redis_get/redis_set calls to prevent key collisions across services.”
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| redis = await get_redis_service() | |
| await redis.setex(cache_key, content, ttl_seconds=_CACHE_TTL) | |
| logger.info( | |
| redis = await get_redis_service() | |
| await redis.setex( | |
| cache_key, | |
| content, | |
| ttl_seconds=_CACHE_TTL, | |
| namespace="breeze_buddy", | |
| ) | |
| logger.info( |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/ai/voice/agents/breeze_buddy/managers/data_source_prefetch.py` around
lines 51 - 53, The redis.setex call in the data_source_prefetch.py file is
missing the required namespace parameter, which violates the coding guidelines
and can cause key collisions across services sharing the same Redis instance.
Add the namespace parameter to the setex call alongside the existing cache_key,
content, and ttl_seconds parameters to ensure proper key isolation across
services.
Source: Coding guidelines
| for ref, result in zip(template_obj.data_sources, contents): | ||
| if isinstance(result, Exception) or result is None: | ||
| continue |
There was a problem hiding this comment.
Malformed spreadsheet URLs can silently skip injection instead of falling back.
spreadsheet_url.split("/d/")[1] can raise before the protected fetch block. In load_template, that exception is currently skipped (continue), so the variable/message may disappear entirely rather than using DATA_SOURCE_UNAVAILABLE.
Suggested fix
- spreadsheet_id = ref.spreadsheet_url.split("/d/")[1].split("/")[0]
+ try:
+ spreadsheet_id = ref.spreadsheet_url.split("/d/")[1].split("/")[0]
+ except Exception:
+ return DATA_SOURCE_UNAVAILABLE
...
- if isinstance(result, Exception) or result is None:
- continue
+ if isinstance(result, Exception) or result is None:
+ result = DATA_SOURCE_UNAVAILABLEAlso applies to: 308-309
🧰 Tools
🪛 Ruff (0.15.15)
[warning] 223-223: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/ai/voice/agents/breeze_buddy/template/loader.py` around lines 223 - 225,
The issue is that malformed spreadsheet URLs can raise an exception during the
`spreadsheet_url.split("/d/")[1]` operation before the protected fetch block,
and this exception is currently skipped with `continue` in the condition
checking `isinstance(result, Exception)`. Instead of silently skipping malformed
URLs, wrap the URL parsing logic in a try-except block and set the result to
`DATA_SOURCE_UNAVAILABLE` when an exception occurs during the split operation,
ensuring that the fallback value is used rather than allowing the variable to
disappear entirely.
| import hashlib, json | ||
|
|
There was a problem hiding this comment.
Fix CI-blocking isort import formatting in _fetch_one_data_source.
The combined inline import fails current isort checks.
Suggested fix
- import hashlib, json
+ import hashlib
+ import json🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/ai/voice/agents/breeze_buddy/template/loader.py` around lines 303 - 304,
The combined inline import statement `import hashlib, json` in the
`_fetch_one_data_source` function does not comply with isort formatting
standards. Split the combined import into separate import statements on
individual lines, so `hashlib` and `json` are each imported on their own line.
This will resolve the isort CI check failure.
Source: Pipeline failures
| redis = await get_redis_service() | ||
| cached = await redis.get(cache_key) | ||
| if cached: | ||
| return cached | ||
| except Exception: | ||
| pass | ||
|
|
||
| try: | ||
| return await asyncio.wait_for( | ||
| fetch_formatted( | ||
| spreadsheet_id=spreadsheet_id, | ||
| sheet_name=ref.sheet_name, | ||
| columns=ref.columns, | ||
| format=ref.format.value, | ||
| ), | ||
| timeout=0.8, | ||
| ) |
There was a problem hiding this comment.
Use namespaced Redis helpers and persist fetched content on cache miss.
The current path reads Redis directly and never writes the freshly fetched sheet content back, so runtime caching can be bypassed and keys can collide across services.
As per coding guidelines: "Always use namespace parameter in redis_get/redis_set calls to prevent key collisions across services."
🧰 Tools
🪛 Ruff (0.15.15)
[error] 321-322: try-except-pass detected, consider logging the exception
(S110)
[warning] 321-321: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/ai/voice/agents/breeze_buddy/template/loader.py` around lines 317 - 333,
Replace the direct redis.get() call with the namespaced redis_get() helper that
includes a namespace parameter to prevent key collisions across services,
passing cache_key and an appropriate namespace identifier. Additionally, after
the fetch_formatted() function successfully returns the sheet content,
immediately persist that content back to cache using the corresponding
namespaced redis_set() helper with the same namespace parameter and cache_key to
ensure the fetched data is available for subsequent cache hits.
Source: Coding guidelines
| def _fetch(): | ||
| session = _get_sheets_session() | ||
| if not session: | ||
| return [] | ||
| try: | ||
| result = _get_json( | ||
| session, | ||
| f"https://sheets.googleapis.com/v4/spreadsheets/{spreadsheet_id}", | ||
| params={"fields": "sheets.properties.title"}, | ||
| ) | ||
| return [sheet["properties"]["title"] for sheet in result.get("sheets", [])] | ||
| except Exception as e: | ||
| logger.error(f"Error listing tabs for {spreadsheet_id}: {e}") | ||
| return [] | ||
|
|
||
| return await asyncio.get_running_loop().run_in_executor(None, _fetch) |
There was a problem hiding this comment.
Close per-request Sheets sessions after use.
On Line 76, Line 100, and Line 142, each _fetch() allocates an authenticated HTTP session but never closes it. Over time this can leak sockets/file descriptors and degrade worker availability.
Suggested patch
def _fetch():
session = _get_sheets_session()
if not session:
return []
try:
...
except Exception as e:
logger.error(...)
return []
+ finally:
+ session.close()Also applies to: 99-130, 141-185
🧰 Tools
🪛 Ruff (0.15.15)
[warning] 86-86: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/services/google/sheets.py` around lines 75 - 90, The _fetch() function in
the Sheets module obtains a session from _get_sheets_session() but never closes
it, causing resource leaks. Ensure the session is properly closed after use by
wrapping the session usage in a try-finally block or using a context manager if
available. Close the session in the finally block or after the try block
completes, ensuring cleanup happens regardless of whether an exception occurs.
This same fix needs to be applied to all occurrences of _fetch() functions that
follow this pattern (at the locations mentioned in the review) to prevent socket
and file descriptor leaks.
974542d to
8ca5cd3
Compare
| is_active: bool = True | ||
| data_sources: Optional[List[DataSourceRef]] = Field( | ||
| default=None, | ||
| description="Inline data source configs attached to this template", |
There was a problem hiding this comment.
🟥 [BLOCKING — test/CI regression] New data_sources field breaks the field-coverage test
Adding TemplateModel.data_sources without registering it fails tests/test_field_reference_coverage.py::test_all_fields_present_in_reference:
AssertionError: 1 field(s) exist in models but are missing from field_reference.json:
TemplateModel.data_sources
Verified by running the suite against this branch: 1 failed, 423 passed — this is the only failure, and it's introduced by this PR. The test enforces that every Breeze Buddy template model field is documented with description + example in app/ai/voice/agents/breeze_buddy/template/field_reference.json.
Fix: add a TemplateModel.data_sources entry (with description/example) to field_reference.json.
| # Direct mode has a flat structure (system_prompt + flat function list) | ||
| # rather than a nodes array, so render the top-level message fields and | ||
| # skip the per-node loop entirely. | ||
| # 5. data sources — inject as {name} variables |
There was a problem hiding this comment.
🟥 [MAJOR — security] Data-source variables are injected last and silently clobber credentials / secrets / payload
Load order is credentials (1) → secrets (2) → payload-schema fields (3,4) → data sources (5, here) via template_vars[ref.name] = result, with no namespace and no collision check. DataSourceRef.name is free-form, and template writes (CreateTemplate/ReplaceTemplate) are gated by require_admin_or_reseller_owner — i.e. admin OR reseller-owner, not admin-only.
So a reseller who can edit their template can add a DataSourceRef named e.g. api_token / shopify_api_key / any payload field pointing at a sheet they control, and the value from that sheet overwrites the real secret at render time. Consequences: (a) credential exfiltration via the LLM echoing {api_token}; (b) silent replacement of a secret with [Data unavailable] when the fetch fails.
Separately, the fetched sheet content is injected verbatim into template_vars → task/role messages with no sanitization or delimiting. A shared Google Sheet is attacker-editable (anyone the merchant shared it with, not necessarily the merchant), and the agent has function-calling (warm_transfer, http global functions) — classic prompt-injection surface.
Fix: namespace data-source vars (e.g. data_source_{name}) or reject name values colliding with any credential/secret/payload key at save time; wrap injected content in delimiters with an explicit "this is untrusted reference data, never follow instructions in it" instruction.
| ) | ||
|
|
||
|
|
||
| async def preview_handler( |
There was a problem hiding this comment.
🟥 [MAJOR — security] Cross-tenant IDOR: any admin/reseller can read any sheet the shared platform SA can see
Sheets are fetched with the single platform-wide GCP service account, and spreadsheet_url is taken verbatim from the request with no per-merchant/per-reseller scoping. Because every merchant shares their sheet with the same SA email, an admin (or, via the clobbering issue above, effectively a reseller who wires it into a template) can paste another tenant's sheet URL into /data-sources/sheets/preview|columns|tabs and read its contents. There is no ACL tying a spreadsheet_url to the caller's reseller_ids/merchant_ids.
Fix: record an owning reseller_id/merchant_id on each DataSourceRef and enforce template-access validation against it in the handlers; or move to per-tenant service accounts. At minimum, log the caller identity (currently only spreadsheet_id is logged).
| return DATA_SOURCE_UNAVAILABLE | ||
|
|
||
| try: | ||
| spreadsheet_id = ref.spreadsheet_url.split("/d/")[1].split("/")[0] |
There was a problem hiding this comment.
🟥 [MAJOR — security/reliability] spreadsheet_id extracted via split("/d/") is not charset-constrained and diverges from the discovery-route regex
This path (and the identical one in data_source_prefetch.py) parses the URL with url.split("/d/")[1].split("/")[0], while the discovery routes use extract_spreadsheet_id() (regex /spreadsheets/d/([a-zA-Z0-9\-_]+)/). They disagree, and the split path is unsafe. Verified against the real functions:
| URL | regex (extract_spreadsheet_id) |
split (this code) |
|---|---|---|
https://docs.google.com/d/1ABC/edit |
None → 400 |
1ABC (accepted) |
…/spreadsheets/d/1A;rm%20-rf/edit |
1A (stops at ;) |
1A;rm%20-rf (arbitrary chars) |
The split value is f-interpolated into https://sheets.googleapis.com/v4/spreadsheets/{spreadsheet_id}/… and used as a Redis key. So a URL the discovery routes reject is silently accepted at call time, and the id can contain ;, spaces, etc. — an injection sink into the googleapis resource path. The host is fixed (no arbitrary-host SSRF), but it can retarget the resource path / inject ?/#.
Fix: use extract_spreadsheet_id(url) everywhere and reject None; the same hardening should validate the id against the regex before any URL build in sheets.py.
| columns=ref.columns, | ||
| format=ref.format.value, | ||
| ), | ||
| timeout=0.8, |
There was a problem hiding this comment.
🟥 [MAJOR — reliability] wait_for(..., timeout=0.8) cannot cancel the blocking executor call; thread + session leak, and 0.8s is unrealistic on a cache miss
fetch_formatted → fetch_sheet_data runs its blocking HTTP via run_in_executor(None, _fetch). When wait_for times out it cancels the awaiting coroutine, but a run_in_executor task cannot be cancelled — the thread keeps running the Google HTTP call (up to the 10s socket timeout) and AuthorizedSession.close() in the _fetch finally never fires until the thread finishes, so sockets linger. Under repeated timeouts (very likely — see below) on the shared default executor, this competes with and can starve other run_in_executor users.
Also: 0.8s is far below realistic latency for a cache miss. The cold path does token exchange (often 200–500ms) plus the values request, and two round-trips when sheet_name is None (metadata then values). The prefetch uses 5.0s; the loader using 0.8s means the data source will silently fall back to [Data unavailable] for most cold fetches — the whole feature is effectively dependent on the fire-and-forget prefetch (see the GC comment on worker.py) having already succeeded.
Fix: raise to ~3–5s (match prefetch) or drop the per-call timeout in favor of a shared deadline; move _fetch to a bounded dedicated ThreadPoolExecutor; consider a single get/batchGet for metadata+values.
| ) | ||
| cache_key = _cache_key(ref) | ||
| redis = await get_redis_service() | ||
| await redis.setex(cache_key, content, ttl_seconds=_CACHE_TTL) |
There was a problem hiding this comment.
🟥 [MAJOR — reliability] The error sentinel [Data unavailable] is written to Redis with a 60s TTL
fetch_formatted returns DATA_SOURCE_UNAVAILABLE on any error or empty sheet, and _prefetch_one stores it unconditionally here. The loader's own write path guards with if content != DATA_SOURCE_UNAVAILABLE — but the prefetch path does not. So a single transient blip at dispatch time (Google 5xx, a slow token exchange, a 0.8s timeout upstream) poisons the cache for the full 60s, and every lead dialed in that window renders [Data unavailable] into its prompt even if the sheet recovers seconds later.
Fix: skip setex when content == DATA_SOURCE_UNAVAILABLE; optionally cache a short (~5s) negative marker to avoid hammering Google, but never the 60s positive TTL.
| if template: | ||
| template = apply_playground_overrides(locked, template) | ||
| if template.data_sources: | ||
| asyncio.create_task(prefetch_data_sources(template=template)) |
There was a problem hiding this comment.
🟧 [MAJOR — reliability] Fire-and-forget task with no reference held → silent GC
asyncio.create_task(...) returns a Task that is assigned to nothing. CPython's asyncio keeps only a weak reference to tasks, so the GC may collect it mid-flight (especially under memory pressure), emitting RuntimeWarning: coroutine 'prefetch_data_sources' was never awaited and silently aborting the prefetch. I grepped — no reference is retained anywhere. This is the entire point of the dispatch-time warming; losing it silently degrades to the 0.8s loader-time fetch (which itself usually times out). The two bugs compound: prefetch silently dropped → loader 0.8s timeout → [Data unavailable].
Fix: retain a reference, e.g. a module-level set with a done_callback that discards, or store on the Worker instance (self._prefetch_tasks.add(task); task.add_done_callback(self._prefetch_tasks.discard)).
|
|
||
| return records | ||
| except Exception as e: | ||
| logger.error(f"Error fetching data for {spreadsheet_id}/{sheet_name}: {e}") |
There was a problem hiding this comment.
🟥 [MAJOR — operability] Empty sheet vs. every error class are indistinguishable
fetch_sheet_data catches all exceptions and returns []: 403 (sheet not shared with the SA), 404 (deleted), network error, JSON error — and a legitimately empty sheet all collapse to [], then fetch_formatted maps that to [Data unavailable]. The merchant can't tell "I forgot to share the sheet with the SA" from "the sheet is empty" from "Google is down", and only a generic exception string is logged (no HTTP status). The sharing step is manual, so 403 will be the single most common real-world failure — and it presents as an opaque "Data unavailable" with no actionable signal.
Fix: raise/return distinct signals for auth (403), not-found (404), and genuinely-empty; surface a structured, status-coded reason in logs; for the 403-not-shared case consider a merchant-facing hint.
| return match.group(1) if match else None | ||
|
|
||
|
|
||
| def _get_sheets_session(): |
There was a problem hiding this comment.
🟧 [MAJOR — performance/quota] A new AuthorizedSession (+ JSON credential parse + token exchange) is built on every fetch, with no token caching
Every cache miss does json.loads(GOOGLE_CREDENTIALS_JSON) + service_account.Credentials.from_service_account_info(...) + constructs a fresh AuthorizedSession, then .close()s it in finally. AuthorizedSession lazily mints an OAuth2 token on first request, so each cold fetch = (token round-trip + values round-trip), and back-to-back fetches re-mint tokens. Under the unbounded asyncio.gather concurrency (no Semaphore), many threads exchange tokens simultaneously — wasteful and capable of tripping the SA's per-minute auth quota.
Fix: build one module-level AuthorizedSession lazily and reuse it — google.auth refresh is thread-safe and the session caches the token until expiry.
| return await asyncio.get_running_loop().run_in_executor(None, _fetch) | ||
|
|
||
|
|
||
| def _rows_to_markdown_table(headers: List[str], rows: List[dict]) -> str: |
There was a problem hiding this comment.
🟧 [MAJOR — correctness] Markdown table cells/headers aren't escaped — a | or newline produces a malformed table
Verified against the real _rows_to_markdown_table: a cell value pick A | B renders as | pick A | B | yes | — a phantom third column — and a newline in a cell splits the row across lines. The LLM sees a broken table. (The CSV path handles this correctly via csv.writer.)
| q | a |
| --- | --- |
| pick A | B | yes | <- one cell became two columns
Fix: escape | → \| and \n → space (or <br>) in cell/header values in _rows_to_markdown_table.
| ALTER TABLE template ADD COLUMN data_sources JSONB DEFAULT NULL; | ||
|
|
||
| COMMENT ON COLUMN template.data_sources IS | ||
| 'Array of inline DataSourceRef: [{name, inject_as, spreadsheet_url, sheet_name?, columns?, format?, is_active}]'; |
There was a problem hiding this comment.
🟧 [MAJOR — docs/contract] Migration comment references a non-existent inject_as field; PR description advertises features that aren't implemented
The column COMMENT documents [{name, inject_as, spreadsheet_url, ...}], but DataSourceRef has no inject_as field (fields are name, spreadsheet_url, sheet_name, columns, format, is_active). grep confirms inject_as appears only in this SQL comment.
This matches the PR description, which claims:
inject_as="message"→ prepended system messageModified: flow.py — _ds_messages propagation- "
_ds_messagesreturned separately fromload_template()— survives playground overrides"
None of that exists in the diff. The actual implementation only ever injects as template_vars[{name}] (the inject_as="var" path), flow.py's change is only deleted comments/docstrings, and there is no _ds_messages anywhere. So the shipped behavior is a subset of what's described.
Fix: either implement the message injection mode + _ds_messages, or correct the PR description and drop inject_as from this COMMENT so the persisted schema docs match the model.
PR #830 review — inline Google Sheets data sourcesVerdict: Request changes. The design (inline ✅ Verified by actually running it
🟥 Critical / blocking
🟥 Major — security
🟥 Major — reliability
🟧 Major — operability / perf / correctness
🟨 Minor (not posted inline)
👍 What looks good
Suggested pre-merge order
Review generated by checking out the PR, running the suite + type/format gates, and booting the server against live Postgres/Redis. |
55301fe to
46ebb66
Compare
46ebb66 to
e60b854
Compare
Summary
DataSourceRefconfig stored intemplate.data_sourcesJSONBdata_sourcetable, CRUD endpoints, or accessor layersinject_as="var"→template_vars[{name}]variable substitutioninject_as="message"→ prepended system message/data-sources/sheets/*) for sheet exploration (admin-only)_ds_messagesreturned separately fromload_template()— survives playground overridesChanged Files (20 files, +723/-25)
services/google/sheets.py— Google Sheets API service (markdown/csv/json)managers/data_source_prefetch.py— background cache warmingapi/routers/data_sources/— discovery routes (admin-only)migrations/034—ALTER TABLE template ADD data_sources JSONBtypes.py—DataSourceRefwith inline config +data_sourcesonTemplateModelloader.py— Layer 5 content injectionflow.py—_ds_messagespropagationworker.py— prefetch at dispatchpyproject.toml— addgoogle-api-python-clientSummary by CodeRabbit
Release Notes
New Features