Reliability expectations and practices for this project.
GET /healthverifies B2 connectivity and returnshealthyordegraded- Health endpoint is always available, even when B2 is down
- HTTP handlers return structured error responses with appropriate status codes
- External service failures (B2) are caught and surfaced as 500/503 responses
- No unhandled exceptions leak stack traces to clients
- Uncaught exceptions are converted to a typed JSON 500 (
{"detail": "Internal server error"}, modeled byapp.types.ErrorResponse) by the catch-all intiming_middleware - Error responses carry CORS headers.
CORSMiddlewareis registered LAST inmain.pyso it is the outermost middleware and wraps every response — including uncaught-exception 500s produced by the inner catch-all. This is intentional and load-bearing: if a 500 shipped withoutAccess-Control-Allow-Origin, the browser would block it and the frontend would surface only an opaque "network error", hiding the real server bug. Regression-guarded bytests/test_error_handling.py::test_unhandled_exception_500_carries_cors_headers.
- Structured JSON logging via Python stdlib
- Every request gets a
request_idfor tracing - Log levels: ERROR for failures, WARNING for degraded state, INFO for requests
- Request timing middleware logs duration for every request
/metricsendpoint exposes basic Prometheus-format counters- Upload success/failure counts tracked
- File listing returns empty list (not error) when B2 has no objects
- Metadata extraction failures don't block upload (return partial metadata)
- Frontend shows skeleton states while loading, error states on failure
- Railway health checks on
/health - Zero-downtime deploys via rolling updates
- Environment-specific configuration via env vars (no config files in prod)