Skip to content
Open
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
bb0349a
feat(gateway): add route fields for CORS, compression, JWT, mTLS
poyrazK May 18, 2026
82af532
feat(gateway): wire JWT validation, mTLS, compression, dry-run, metri…
poyrazK May 18, 2026
251845e
Add gateway handler tests for CORS, compression, and dry-run
poyrazK May 19, 2026
9e2df55
Fix PR #596 code review findings
poyrazK May 19, 2026
86b736c
Fix remaining review gaps
poyrazK May 19, 2026
894c348
docs: add ADR-028 for gateway per-route rate limiting
poyrazK May 19, 2026
2841029
Add JWKS circuit breaker and empty-key validation
poyrazK May 19, 2026
fa1d4f9
Address JWKS review observations
poyrazK May 19, 2026
63248e7
Add JWT handler tests: valid token, invalid token error, claims propa…
poyrazK May 19, 2026
893f9f5
test: add TraceContext preservation and gzip flush tests, fix GetProx…
poyrazK May 19, 2026
2e47bdc
docs: update FEATURES.md with gateway capabilities and add ADR-029
poyrazK May 19, 2026
69dcc25
docs: expand cloud-gateway.md with full route config, JWT, mTLS, CORS…
poyrazK May 19, 2026
e77d2a2
fix(gateway): use net.Error interface in isRetryableError, drain body…
poyrazK May 19, 2026
34d6037
merge: resolve conflicts keeping our gateway features
poyrazK May 19, 2026
e7e86d8
fix: resolve merge conflicts and update to main router signatures
poyrazK May 19, 2026
6442565
chore: regenerate swagger docs
poyrazK May 19, 2026
27f39ba
style: run gofmt on gateway files
poyrazK May 19, 2026
5ab7282
fix(gateway): address lint findings - crypto rand, errcheck, bodyclose
poyrazK May 19, 2026
0660fdc
style: fix formatting on gateway.go
poyrazK May 19, 2026
ab91877
fix(gateway): address remaining lint findings - bodyclose, errcheck, …
poyrazK May 19, 2026
d7f6cef
fix(gateway): add errcheck ignore for io.Copy
poyrazK May 19, 2026
a888a41
refactor(gateway): extract JWT, body size, rate limit into helper met…
poyrazK May 19, 2026
1826713
fix retry transport to detect "timeout" in error strings
poyrazK May 19, 2026
aad99dd
fix gateway GetProxy nil proxy handling to prevent panics
poyrazK May 19, 2026
42b14f2
fix gateway: return last error when retries exhausted instead of nil …
poyrazK May 19, 2026
d6461e2
fix(gateway): add fast-fail circuit breaker to prevent retry storms
poyrazK May 22, 2026
49e6270
fix(gateway): address PR #596 review findings
poyrazK May 22, 2026
773b4cb
fix(gateway): sanitize JWT error to prevent timing attacks
poyrazK May 22, 2026
4710b97
docs(gateway): document sync.Pool trade-off for gzip writers
poyrazK May 22, 2026
2777dbd
style: fix gofmt formatting on gateway.go
poyrazK May 22, 2026
792e00a
fix(gateway): address CodeRabbit findings
poyrazK May 22, 2026
0d61bda
style: fix gofmt alignment in retryTransport struct
poyrazK May 22, 2026
41e3c5e
fix(gateway): address remaining lint issues
poyrazK May 22, 2026
ebe61d7
fix(gateway): suppress G115 warning with explicit lint ignore
poyrazK May 22, 2026
9279465
fix(gateway): use int32 for fastFailThreshold to eliminate G115 warning
poyrazK May 22, 2026
832ecd1
fix(gateway): fix nolint directive - use G115 not gosec/G115
poyrazK May 22, 2026
02175a7
fix(gateway): use gosec nolint for G115 suppression
poyrazK May 22, 2026
6c85ae4
style: gofmt formatting fix
poyrazK May 22, 2026
39ba8c1
fix(storage): increase sleep time for flaky repair stream test
poyrazK May 22, 2026
edb6420
Merge origin/main into release/gateway-features
poyrazK May 22, 2026
14b3c64
fix(gateway): flush gzip writer before returning to pool
poyrazK May 22, 2026
7479197
fix(gateway): safe Hijack implementation with type assertion
poyrazK May 22, 2026
cf83504
temp(discovery): disable gzip compression in gateway
poyrazK May 24, 2026
b773df6
fix(gateway): flush gzip before returning to pool
poyrazK May 24, 2026
8ad8836
fix(gateway): remove sync.Pool for gzip writers
poyrazK May 24, 2026
dd9c718
debug: disable gzip to isolate E2E failures
poyrazK May 24, 2026
9de3441
fix(gateway): re-enable gzip with simpler implementation
poyrazK May 24, 2026
590d6ef
style: run gofmt on gateway_handler
poyrazK May 24, 2026
1e35fad
fix(gateway): preserve response body on retryable status for caller
poyrazK May 24, 2026
d97bb53
fix(gateway): do not drain/close body on successful non-retryable res…
poyrazK May 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 11 additions & 3 deletions docs/FEATURES.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,14 +241,22 @@ This document provides a comprehensive overview of every feature currently imple
- **VPC Scoped**: Zones are scoped to VPCs for private network resolution.

### 13. API Gateway 🆕
**What it is**: Managed entry point for microservices with advanced routing, pattern matching, and rate limiting.
**Tech Stack**: Go `httputil.ReverseProxy`, Redis, Regex Matcher.
**What it is**: Managed entry point for microservices with advanced routing, pattern matching, rate limiting, JWT authentication, mTLS, and observability.
**Tech Stack**: Go `httputil.ReverseProxy`, Redis, Regex Matcher, `golang.org/x/time/rate`, Prometheus metrics, `golang.org/x/sync/singleflight`.

**Implementation**:
- **Advanced Pattern Matching**: Support for RESTful patterns like `/users/{id}`, regex-constrained parameters `/id/{id:[0-9]+}`, and wildcards `/static/*`.
- **HTTP Method Routing**: Route requests to different backends based on the HTTP verb (GET, POST, etc.) for the same path.
- **Dynamic Specificity Scoring**: Automatic route selection based on prefix specificity, exact match bonuses, and explicit user-defined priority.
- **Prefix Stripping**: Intelligent stripping of path patterns before forwarding to downstream services.
- **Rate Limiting**: Integrated distributed rate limiting per route.
- **Rate Limiting**: Integrated distributed rate limiting per route using token-bucket algorithm. Supports per-client key (API key prefix or IP) identification.
- **Circuit Breaker & Retry**: Per-route circuit breaker with configurable threshold and reset timeout. Automatic retry with exponential backoff on `502`, `503`, `504`, `429` responses. Only idempotent methods (GET, HEAD, PUT, DELETE, OPTIONS) are retried.
- **JWT Authentication**: JWKS-backed JWT validation with issuer and audience verification. RSA public keys parsed from JWK `n`/`e` parameters. JWKS fetches are deduplicated via singleflight and protected by a circuit breaker. Claims are propagated to upstream services via `X-JWT-Claim-*` headers.
- **mTLS Support**: Configurable client certificates (PEM) and CA certificates for backend TLS verification.
- **CORS**: Per-route CORS configuration with `allowed_origins`, `allowed_methods`, `allowed_headers`, `expose_headers`, and `max_age`.
- **Compression**: Gzip response compression when client advertises support and route has compression enabled.
- **Dry-Run Validation**: Route creation supports `?dry_run=true` to validate CIDR blocks, TLS conflicts, and mTLS certificate pairing without persisting.
- **Observability**: Prometheus metrics for upstream latency (`thecloud_gateway_upstream_latency_seconds`), retry totals (`thecloud_gateway_retry_total`), circuit breaker state (`thecloud_gateway_circuit_breaker_state`), JWKS fetch totals and breaker state (`thecloud_jwks_fetch_total`, `thecloud_jwks_breaker_state`). W3C TraceContext headers (`traceparent`, `tracestate`) are preserved or generated for upstream traceability.
- **Audit Logging**: Comprehensive tracking of all route changes and gateway operations.

### 14. CloudStacks (Native IaC) 🆕
Expand Down
2 changes: 1 addition & 1 deletion docs/adr/ADR-028-gateway-per-route-rate-limiting.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,4 +47,4 @@ We introduced per-route rate limiting that supplements the existing global rate
**Why rejected:** Introduces new dependency; the existing `golang.org/x/time/rate` meets requirements with minimal added code.

### Alternative 3: Global limiter with route-specific token deductions
**Why rejected:** Complex to reason about; one route's burst could starve another. Independent limiters per route provide cleaner semantics.
**Why rejected:** Complex to reason about; one route's burst could starve another. Independent limiters per route provide cleaner semantics.
63 changes: 63 additions & 0 deletions docs/adr/ADR-029-gateway-security-resilience.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# ADR 029: Gateway Security and Resilience

## Status
Accepted

## Date
2026-05-19

## Context

PR #596 extends the API gateway with security (JWT, mTLS), resilience (circuit breaker, retry), and observability features. These require non-trivial decisions about how external trust is established, how failures are handled, and how observability data is exposed.

## Decision

### JWT Authentication with JWKS

The gateway validates JWTs by fetching public keys from a JWKS endpoint. Because multiple routes may share the same JWKS URL (or different routes may have different JWKS URLs), we use a **per-route JWKS cache** with a 5-minute TTL. JWKS fetches are deduplicated using `singleflight.Group` so concurrent requests for the same keyset result in a single HTTP call.

**JWKS circuit breaker**: When the JWKS endpoint returns errors, a circuit breaker (`Threshold: 3`, `ResetTimeout: 30s`) prevents further fetch attempts until the half-open probe succeeds. Metrics exported: `JWKSBreakerState` (gauge 0/1/2) and `JWKSFetchTotal` (counter with labels: `success`, `error`, `circuit_open`).

**RSA key parsing**: Keys are parsed from JWK `n` (modulus) and `e` (exponent) using `math/big.Int`. An exponent `>= 1<<30` is rejected as a defensive measure against overflow.

**Claim propagation**: Validated JWT claims are forwarded to the upstream service as `X-JWT-Claim-{key}` headers.

### mTLS Configuration

Routes can be configured with `client_cert` + `client_key` (PEM) for client certificates and `ca_cert` for backend certificate verification. Both client cert and key must be provided together; CA cert is optional. `buildTLSConfig` returns descriptive errors on malformed certs/keys rather than silently ignoring them.

### Circuit Breaker and Retry

Each route can have `circuit_breaker_threshold` (consecutive failures to trip) and `circuit_breaker_timeout` (ms in open before half-open). Retry behavior is controlled by `max_retries` and `retry_timeout`. Retries are only attempted for idempotent methods (GET, HEAD, PUT, DELETE, OPTIONS) and only on retryable status codes (502, 503, 504, 429) or retryable errors (connection refused, timeout, reset by peer, broken pipe).

### CORS

Per-route CORS uses `allowed_origins`, `allowed_methods`, `allowed_headers`, `expose_headers`, and `max_age`. When `allowed_origins` includes `"*"` with credentials, the response sets `Access-Control-Allow-Credentials: true`.

## Consequences

### Positive
- JWT authentication enables stateless auth with upstream services trusting gateway-validated tokens
- JWKS singleflight deduplication prevents thundering herd on shared JWKS endpoints
- Circuit breaker protects gateway from cascading failures when backends are unhealthy
- mTLS enables secure service-to-service communication within the cluster

### Negative
- JWKS caching introduces staleness window (up to 5 minutes for rotated keys)
- Circuit breaker state is in-memory only — restarts reset all circuits
- Retry behavior increases latency on failure and may amplify load on unhealthy backends

### Neutral
- CORS headers are only injected when `allowed_origins` is non-empty
- TLS settings (skip verify, require TLS, client certs) are independent concerns

## Alternatives Considered

### Alternative 1: Opaque JWT validation without JWKS caching
**Why rejected:** Would require fetching JWKS on every request, defeating the purpose of JWT's stateless design and creating unnecessary latency.

### Alternative 2: Global circuit breaker for all backends
**Why rejected:** Per-route circuit breakers provide finer-grained isolation — one unhealthy backend shouldn't trip the breaker for unrelated routes.

### Alternative 3: Retry on any HTTP error
**Why rejected:** Non-idempotent methods (POST, PATCH) must not be retried automatically as that could cause duplicate operations. The current implementation limits retries to idempotent methods and specific error codes.
Loading
Loading