-
Notifications
You must be signed in to change notification settings - Fork 436
hf_xet downloads stall #2359
Description
Title: hf_xet downloads stall due to massive out-of-order packet delivery from Xet CDN
Environment
- OS: Fedora 43 KDE Plasma
- Python: 3.14.3
- huggingface_hub: 1.9.0
- hf_xet: 1.4.3
Hello. I was struggling with broken downloads so asked Sonnet and Gemini to do some digging and this is what they found. Thanks.
Behaviour
Xet downloads start at 80-400 MB/s (local buffer reads), then stall completely after 16-200 MB transferred. Plain HTTP (HF_HUB_DISABLE_XET=1) downloads at stable line speed throughout.
Diagnosis
ss -tinp captured during a stall shows:
rcv_ooopackin the thousands on every Xet connection (e.g. 11,671 / 17,548 / 10,697)- TCP congestion window (
cwnd) stuck at 10 — never expands - One connection with
rtt:793ms, jitter 598ms,rto:3185— gone into timeout - All connections to AWS endpoints (18.244.140.x, 50.17.153.x)
TCP is spending all resources reordering packets. Congestion window never opens. Eventually one connection times out and the transfer stalls permanently.
Workaround
HF_HUB_DISABLE_XET=1 hf download <repo> <file>
Suspected cause
Xet CDN delivering chunks heavily out of order at scale, causing TCP congestion collapse.
Reproduction
No code required — this is a network/protocol issue reproducible with the standard CLI:
# Ensure hf_xet is active
hf env | grep -i xet
# Start download of any large Xet-enabled repo (example used in diagnosis):
hf download unsloth/gemma-4-26B-A4B-it-GGUF gemma-4-26B-A4B-it-MXFP4_MOE.gguf --local-dir ~/Downloads
# In a second terminal, monitor TCP connections while download is active:
watch -n 1 'ss -tinp | grep hf'
What to observe:
- Download progress stalls after 16-200 MB transferred
ssoutput showsrcv_ooopackclimbing into thousands on all connectionscwndremains stuck at 10 throughout- Eventually one connection shows
rttin the hundreds of ms with high jitter
Confirm Xet is active by checking the download shows an initial burst speed
(80-400 MB/s reported) before stalling — this distinguishes Xet buffer reads
from plain HTTP which delivers steady line speed throughout.
Confirm issue is Xet-specific:
# Stalls:
hf download unsloth/gemma-4-26B-A4B-it-GGUF gemma-4-26B-A4B-it-MXFP4_MOE.gguf --local-dir ~/Downloads
# Works fine at line speed:
HF_HUB_DISABLE_XET=1 hf download unsloth/gemma-4-26B-A4B-it-GGUF gemma-4-26B-A4B-it-MXFP4_MOE.gguf --local-dir ~/Downloads
The smoking gun appears to be:
rtt:793.869/597.661 rto:3185
One connection to 18.244.140.27 has an RTT of 794ms with huge variance (598ms jitter). That's catastrophic. Normal connections to the same host are showing 105ms. That one connection has gone sick and is dragging everything down.
Also notable:
rcv_ooopack (out-of-order packets) is enormous on every connection — thousands of them. 18.244.140.27 fd=11 shows 11,671 out-of-order packets received.
cwnd is stuck at 10 on every connection — TCP congestion window never grows, meaning TCP thinks the network is congested and won't open up.
Conclusion: The Xet CDN is sending chunks out of order at scale, TCP is spending all its time reordering, the congestion window never expands, and one connection eventually goes into timeout hell dragging the whole transfer to a halt.
This is a Xet/CDN-side problem, not your system. Your TCP stack is behaving correctly — it's the server hammering you with out-of-order packets.