Skip to content

Handle add-on filesystem errors gracefully and reduce Sentry noise#6707

Open
agners wants to merge 1 commit intomainfrom
improve-add-on-file-system-error-message
Open

Handle add-on filesystem errors gracefully and reduce Sentry noise#6707
agners wants to merge 1 commit intomainfrom
improve-add-on-file-system-error-message

Conversation

@agners
Copy link
Copy Markdown
Member

@agners agners commented Apr 7, 2026

Proposed change

Handle OSError (e.g. errno 74 / EBADMSG) in add-on metadata reads (long_description, refresh_path_cache) gracefully instead of letting them bubble up as unhandled exceptions. A new translatable AddonFileReadError is raised after calling check_oserror() to mark the system unhealthy, giving API consumers a proper error response.

Additionally, in core.py setup(), skip Sentry reporting when the resolution system has already handled the error (detected by checking if a new unhealthy reason was added during task execution). This avoids flooding Sentry with filesystem corruption errors that aren't actionable for developers -- the user is already notified via the resolution system. The log level is also lowered from critical (which triggers Sentry via LoggingIntegration) to error without stack trace in that case.

SUPERVISOR-BC6 alone accounts for 548K Sentry events from a single user with a corrupt filesystem.

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New feature (which adds functionality to the supervisor)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

  • This PR fixes or closes issue: fixes SUPERVISOR-BC6, SUPERVISOR-BZJ
  • This PR is related to issue:
  • Link to documentation pull request:
  • Link to cli pull request:
  • Link to client library pull request:

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • The code has been formatted using Ruff (ruff format supervisor tests)
  • Tests have been added to verify that the new code works.

If API endpoints or add-on configuration are added/changed:

Add AddonFileReadError for add-on metadata read failures (long_description,
refresh_path_cache) caused by filesystem errors like EBADMSG (errno 74).
The new exception calls check_oserror() to mark the system unhealthy via
the resolution system, then raises a translatable API error so callers
get a proper error response instead of an unhandled OSError.

Fixes SUPERVISOR-BC6 (548K events from the API path) and
SUPERVISOR-BZJ (from the startup/load path).

In core.py setup(), skip reporting exceptions to Sentry when the error
has already been handled by the resolution system. This is detected by
checking if a new unhealthy reason was added during the task execution
(e.g. via check_oserror). In that case the user is already notified, so
we log at error level (no stack trace) instead of critical (which would
also send to Sentry via the LoggingIntegration) and skip the explicit
capture_exception call.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@agners agners requested a review from mdegat01 April 7, 2026 18:16
@agners agners added the bugfix A bug fix label Apr 7, 2026
@agners
Copy link
Copy Markdown
Member Author

agners commented Apr 7, 2026

This PR got a bit out of hand: I simply wanted to introduce a API error for those read cases. But since the same code is also used during setup(), I had to also come up with a solution for this codepath.

I was considering simply handling the new AddonFileReadError exception in setup() , but this didn't seem too scale well (we'd have too keep a list of all the exception which we consider handled during setup() 🤔 .

The solution I chose now is also a bit hacky, so not sure if it's really better.

Simply consider all HassioError as "handled" doesn't work, we have quite some Sentry reports where we maybe need to inform the user in one form or another:

● OK, here's the consolidated list of Sentry issues coming from core.py:setup():

  From _adjust_system_datetime:

  ┌─────────────────┬───────────────────────────────────────────────────────────────────┬────────┬────────────┐
  │      Issue      │                               Error                               │ Events │   Status   │
  ├─────────────────┼───────────────────────────────────────────────────────────────────┼────────┼────────────┤
  │ SUPERVISOR-92R  │ DBusFatalError: Automatic time synchronization is enabled         │ 14,285 │ unresolved │
  ├─────────────────┼───────────────────────────────────────────────────────────────────┼────────┼────────────┤
  │ SUPERVISOR-W4W  │ DBusFatalError: Failed to set time zone: Permission denied        │ 333    │ unresolved │
  ├─────────────────┼───────────────────────────────────────────────────────────────────┼────────┼────────────┤
  │ SUPERVISOR-15YG │ DBusInterfaceMethodError: Invalid or not installed time zone      │ 77     │ unresolved │
  ├─────────────────┼───────────────────────────────────────────────────────────────────┼────────┼────────────┤
  │ SUPERVISOR-1GF9 │ DBusSystemdNoSuchUnit: Unit systemd-timesyncd.service not loaded. │ 165    │ unresolved │
  ├─────────────────┼───────────────────────────────────────────────────────────────────┼────────┼────────────┤
  │ SUPERVISOR-WD6  │ HostNotSupportedError: No timedate D-Bus connection available     │ 2,003  │ unresolved │
  └─────────────────┴───────────────────────────────────────────────────────────────────┴────────┴────────────┘

  From OSManager.load → data_disk:

  ┌────────────────┬────────────────────────────────────────────┬────────┬────────────┐
  │     Issue      │                   Error                    │ Events │   Status   │
  ├────────────────┼────────────────────────────────────────────┼────────┼────────────┤
  │ SUPERVISOR-S5W │ DBusNotConnectedError (via wrap_dbus)      │ 1,081  │ unresolved │
  ├────────────────┼────────────────────────────────────────────┼────────┼────────────┤
  │ SUPERVISOR-CPN │ DBusNotConnectedError (via connected_dbus) │ 18,155 │ unresolved │
  └────────────────┴────────────────────────────────────────────┴────────┴────────────┘

  From AddonManager.load → asyncio.gather:

  ┌──────────────────────────────┬──────────────────────────────────────────────────────────┬────────┬────────────┐
  │            Issue             │                          Error                           │ Events │   Status   │
  ├──────────────────────────────┼──────────────────────────────────────────────────────────┼────────┼────────────┤
  │ SUPERVISOR-BZJ               │ OSError: [Errno 74] Bad message (icon.png .exists())     │ 57     │ resolved   │
  ├──────────────────────────────┼──────────────────────────────────────────────────────────┼────────┼────────────┤
  │ SUPERVISOR-VAX               │ OSError: [Errno 74] Bad message (DOCS.md .exists())      │ 12     │ resolved   │
  ├──────────────────────────────┼──────────────────────────────────────────────────────────┼────────┼────────────┤
  │ SUPERVISOR-1BA8              │ OSError: [Errno 74] Bad message (translations .exists()) │ 1      │ resolved   │
  ├──────────────────────────────┼──────────────────────────────────────────────────────────┼────────┼────────────┤
  │ Multiple JobException issues │ Docker attach/install failures during addon.load()       │ varies │ unresolved │
  └──────────────────────────────┴──────────────────────────────────────────────────────────┴────────┴────────────┘
  From StoreManager.load → store/data.py:

  ┌──────────────────────────────┬─────────────────────────────────────────────────────────────┬────────┬────────┐
  │            Issue             │                            Error                            │ Events │ Status │
  ├──────────────────────────────┼─────────────────────────────────────────────────────────────┼────────┼────────┤
  │ SUPERVISOR-9X5               │ TypeError: expected string or bytes-like object, got 'dict' │ recent │ -      │
  └──────────────────────────────┴─────────────────────────────────────────────────────────────┴────────┴────────┘

Thoughts?

@mdegat01
Copy link
Copy Markdown
Contributor

mdegat01 commented Apr 8, 2026

I mean my first though reading the list is we probably shouldn't be relying on setup to report the HassioError type exceptions we want to know about. Like for those errors from _adjust_system_datetime, setup is not the only thing that calls that. If those errors are something we (the supervisor dev team) and the user needs to know about then they should be logged and reported from that method. Otherwise when they occur while making changes on a running Supervisor from the API neither of us will be properly informed.

So yea my take would be using setup for reporting should be a last resort. If HassioErrors are not being handled and reported properly in the places they are being raised then lets fix that. setup should just make sure the ones that weren't handled (non-HassioError type exceptions) have some last resort logging and capturing.

We could also require each of these load methods be jobs with the annotation and then even that handling can be dropped since the Job decorator takes care of that already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants