feat(sync): [DC-216839]: use HTTP/2 with automatic HTTP/1.1 fallback on framing errors#4114
feat(sync): [DC-216839]: use HTTP/2 with automatic HTTP/1.1 fallback on framing errors#4114rchincha wants to merge 25 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a sync-specific HTTP transport that prefers HTTP/2 but automatically retries over HTTP/1.1 when upstreams emit HTTP/2 framing/protocol errors (e.g., the “malformed HTTP response” class of failures), and documents/validates the behavior via tests.
Changes:
- Introduces
http2FallbackTransportand wires it into sync’s regclient HTTP client. - Adds unit coverage for framing-error classification + retry/sticky-host behavior, plus a new blackbox smoke test and CI inclusion.
- Updates documentation and module dependencies to support the new HTTP/2 transport logic.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/ports.json | Allocates port ranges for the new blackbox test. |
| test/blackbox/sync_http2_fallback.bats | New smoke test for sync behavior with the new transport (happy-path). |
| test/blackbox/ci.sh | Adds the new blackbox test to CI execution list. |
| pkg/extensions/sync/sync_test.go | Makes TLS sync test more robust by retrying until the manifest is available. |
| pkg/extensions/sync/service.go | Switches sync regclient HTTP client transport to the new HTTP/2-with-fallback transport. |
| pkg/extensions/sync/http2_fallback.go | Implements HTTP/2-first RoundTripper with sticky HTTP/1.1 fallback on framing errors. |
| pkg/extensions/sync/http2_fallback_internal_test.go | Adds focused unit tests for fallback, stickiness, and real-transport boundary behavior. |
| go.mod | Promotes golang.org/x/net to a direct dependency (for http2). |
| examples/README.md | Documents the new HTTP/2 transport and automatic fallback behavior for operators. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
177100e to
491ab77
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4114 +/- ##
==========================================
- Coverage 91.55% 91.44% -0.11%
==========================================
Files 200 201 +1
Lines 29306 29479 +173
==========================================
+ Hits 26830 26958 +128
- Misses 1594 1625 +31
- Partials 882 896 +14 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…on framing errors Default to HTTP/2 for upstream sync; on the first INTERNAL_ERROR / malformed framing response (Mode B), automatically retry with HTTP/1.1 and remember the choice for subsequent calls. Includes blob streaming path. Replaces the prior unconditional "force HTTP/1.1" workaround so we keep HTTP/2 perf for healthy upstreams. Squashed from c02c0d8a, 26d13489, 0e04f5e4. Signed-off-by: Marga Rudno-Rudzińska <marga.rudno@gmail.com>
Move http2FallbackTransport, isHTTP2FramingError, and the transport construction out of service.go into pkg/extensions/sync/http2_fallback.go. service.go now calls a single newHTTP2FallbackTransport() helper. Same anti-pattern rchincha flagged on project-zot#4063 ("too much going in pkg/api/routes.go wrt storage etc. We should move all that code out to pkg/api/blobstream.go"). Pure refactor — no behavior change. Signed-off-by: Marga Rudno-Rudzińska <marga.rudno@gmail.com>
…rrors Replace substring matches on err.Error() with errors.As against http2.StreamError and *http2.GoAwayError; keep the "malformed HTTP response" substring check with a comment linking the Go stdlib gap (no typed error is exposed for that path). Add unit tests in http2_fallback_test.go covering the classifier and the RoundTrip primary/fallback paths, including the body-rewind branch. Signed-off-by: Marga Rudno-Rudzińska <marga.rudno@gmail.com>
Extract the shared DefaultTransport.Clone + ResponseHeaderTimeout setup into clonedTransport so the primary and HTTP/1.1 fallback transports are built from the same code path. The fallback now only differs by disabling HTTP/2 negotiation, which keeps the divergence between the two paths visible at a glance and prevents the timeout setting from silently drifting between them. Restore the rationale comments around the timeout configuration and the purpose of the fallback overrides. Signed-off-by: Marga Rudno-Rudzińska <marga.rudno@gmail.com>
After a host framed-errors once, route every subsequent request for that host straight to the HTTP/1.1 fallback for 15 minutes. The upstream LB that triggers the fallback (Docker Hub, CDNs sitting in front of a registry) keeps the misbehavior across many TCP connections, so paying the HTTP/2 attempt + retry round-trip on every sync request adds latency without changing the outcome. State lives in a sync.Map keyed by URL host; entries expire on read so no background reaper is needed. A pluggable clock keeps the TTL test deterministic. After the TTL the host gets another HTTP/2 attempt, so a temporary upstream issue does not permanently downgrade the host. Signed-off-by: Marga Rudno-Rudzińska <marga.rudno@gmail.com>
…back Add an operator-facing section under examples/README.md > Sync that explains the new default (HTTP/2 for upstream sync), the framing-error fallback path, and the sticky-per-host behavior with its 15-minute TTL. Include upgrade notes so operators coming from the previous HTTP/1.1-only sync transport know no configuration change is required and what log line indicates a host has been downgraded. Signed-off-by: Marga Rudno-Rudzińska <marga.rudno@gmail.com>
Adds test/blackbox/sync_http2_fallback.bats covering the on-demand sync happy path through the new HTTP/2 fallback transport (the fallback path itself is unit tested in pkg/extensions/sync/http2_fallback_test.go). Allocates port range 11540-11549 (upstream) and 11560-11569 (downstream) in test/ports.json and registers the test in test/blackbox/ci.sh#L19. Addresses rchincha's "Pls also consider adding a BATS test under test/blackbox" pattern from project-zot#4063. Signed-off-by: Marga Rudno-Rudzińska <marga.rudno@gmail.com>
…ureTransport The cloned http.DefaultTransport used Go's bundled http2, whose error types differ from golang.org/x/net/http2, so errors.As never matched and typed-error fallback never fired. ConfigureTransport rewires the primary to the external package. Replaces the bats smoke test with a real-server integration test and corrects the upgrade note. Signed-off-by: Marga Rudno-Rudzińska <marga.rudno@gmail.com>
The transport refactor dropped the upstream comments explaining the DefaultTransport clone, ResponseHeaderTimeout, and SyncTimeout. Restore them verbatim where the code now lives: clone/header rationale in clonedTransport, the SyncTimeout note above the http.Client. Signed-off-by: Marga Rudno-Rudzińska <marga.rudno@gmail.com>
…o newClient Signed-off-by: Marga Rudno-Rudzińska <marga.rudno@gmail.com>
- rename file to *_internal_test.go so testpackage allows internal package - replace dynamic errors.New with package-level sentinels (err113) - switch to http.NewRequestWithContext using t.Context() (noctx) - close stub response bodies via defer (bodyclose) - modernize for-loop to range-over-int (modernize) - rename short var tr to transport (varnamelen) - wrap long log message in http2_fallback.go (lll) Signed-off-by: Marga Rudno-Rudzińska <marga.rudno@gmail.com>
…ack retries On an HTTP/2 framing error the fallback path now closes any non-nil response body from the primary attempt before retrying, so a connection is not leaked when a RoundTripper returns both a response and an error. The retry also no longer replays a request body it cannot rewind: a real body (not http.NoBody) with no GetBody returns the primary error instead of sending a possibly-truncated payload over HTTP/1.1. Signed-off-by: Marga Rudno-Rudzińska <marga.rudno@gmail.com>
Adds back test/blackbox/sync_http2_fallback.bats covering the on-demand sync happy path through the HTTP/2 fallback transport, wires it into test/blackbox/ci.sh, and reserves port ranges 11540-11549 (upstream) and 11560-11569 (downstream) in test/ports.json. Adds jq to the test prerequisites since the catalog assertions use it. Signed-off-by: Marga Rudno-Rudzińska <marga.rudno@gmail.com>
Signed-off-by: Ramkumar Chinchani <rchincha.dev@gmail.com>
Signed-off-by: Ramkumar Chinchani <rchincha.dev@gmail.com>
Signed-off-by: Ramkumar Chinchani <rchincha.dev@gmail.com>
…zation in TestConfigReloader and TestTLS Signed-off-by: Ramkumar Chinchani <rchincha.dev@gmail.com>
Signed-off-by: Ramkumar Chinchani <rchincha.dev@gmail.com>
… readability Signed-off-by: Ramkumar Chinchani <rchincha.dev@gmail.com>
… stability Signed-off-by: Ramkumar Chinchani <rchincha.dev@gmail.com>
… and handle non-existence errors Signed-off-by: Ramkumar Chinchani <rchincha.dev@gmail.com>
… TestTLS for certificate verification Signed-off-by: Ramkumar Chinchani <rchincha.dev@gmail.com>
Signed-off-by: Ramkumar Chinchani <rchincha.dev@gmail.com>
Signed-off-by: Ramkumar Chinchani <rchincha.dev@gmail.com>
bbb1993 to
072bb58
Compare
Signed-off-by: Ramkumar Chinchani <rchincha.dev@gmail.com>
695120b to
ca5e796
Compare
| } | ||
|
|
||
| hostDir := filepath.Join(certDir, host) | ||
| files, err := os.ReadDir(hostDir) |
| continue | ||
| } | ||
|
|
||
| certPEM, err := os.ReadFile(filepath.Join(hostDir, file.Name())) //nolint:gosec // hostDir is derived from configured cert directories |
| # Smoke test for the HTTP/2 → HTTP/1.1 fallback transport used by sync. | ||
| # The fallback only kicks in when the upstream emits malformed HTTP/2 framing | ||
| # (Docker Hub's LB does this occasionally); a normal zot upstream speaks HTTP/2 | ||
| # cleanly and the primary path is exercised. This test verifies the new | ||
| # transport does not break the happy path. Fallback-specific behavior is unit |
| run curl http://127.0.0.1:${zot_upstream_port}/v2/_catalog | ||
| [ "$status" -eq 0 ] | ||
| [ $(echo "${lines[-1]}" | jq '.repositories[]') = '"golang"' ] | ||
|
|
||
| run curl http://127.0.0.1:${zot_downstream_port}/v2/golang/manifests/1.20 |
| run curl http://127.0.0.1:${zot_downstream_port}/v2/_catalog | ||
| [ "$status" -eq 0 ] | ||
| [ $(echo "${lines[-1]}" | jq '.repositories[]') = '"golang"' ] |
|
|
||
| ### Sync's HTTP/2 transport with HTTP/1.1 fallback | ||
|
|
||
| Sync requests now use HTTP/2 by default. Some upstream load balancers (notably Docker Hub's) occasionally send raw HTTP/2 SETTINGS frames on a connection that Go's `net/http` opened as HTTP/1.1, surfacing as `malformed HTTP response` or HTTP/2 `INTERNAL_ERROR` / `PROTOCOL_ERROR`. The sync transport catches those framing errors at the RoundTrip layer and transparently retries the request over HTTP/1.1, so neither the regclient pipeline nor the on-demand caller sees the failure. |
| @@ -0,0 +1,132 @@ | |||
| # Smoke test for the HTTP/2 → HTTP/1.1 fallback transport used by sync. | |||
| # The fallback only kicks in when the upstream emits malformed HTTP/2 framing | |||
There was a problem hiding this comment.
sync_http2_fallback.bats is largely redundant with existing CI tests for sync. It doesn’t test fallback, and the smoke path it does run is already exercised by sync.bats (and friends).
Can we drop this?
| t.stickyHost.Store(host, t.now().Add(t.stickyTTL)) | ||
| } | ||
|
|
||
| func isHTTP2FramingError(err error) bool { |
There was a problem hiding this comment.
This matches all StreamError / GoAwayError codes, not just protocol/framing mismatches (e.g. REFUSED_STREAM, ENHANCE_YOUR_CALM). For Docker Hub that’s probably fine, but it could mask genuine HTTP/2 issues on other registries by silently downgrading to HTTP/1.1.
I suggest this is refactored in 3 separate functions
func shouldRetryOnHTTP11Fallback(err error) bool {
return isHTTP2SessionError(err) || isHTTP1HTTP2Mismatch(err)
}
func isHTTP2SessionError(err error) bool { /* StreamError, GoAwayError */ }
func isHTTP1HTTP2Mismatch(err error) bool {
return strings.Contains(err.Error(), "malformed HTTP response")
}
What type of PR is this?
Which issue does this PR fix:
What does this PR do / Why do we need it:
If an issue # is not available please add repro steps and logs showing the issue:
Testing done on this change:
Automation added to e2e:
Will this break upgrades or downgrades?
Does this PR introduce any user-facing change?:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.