Skip to content

hf_xet downloads stall #2359

@TidyWeb

Description

@TidyWeb

Title: hf_xet downloads stall due to massive out-of-order packet delivery from Xet CDN

Environment

  • OS: Fedora 43 KDE Plasma
  • Python: 3.14.3
  • huggingface_hub: 1.9.0
  • hf_xet: 1.4.3

Hello. I was struggling with broken downloads so asked Sonnet and Gemini to do some digging and this is what they found. Thanks.

Behaviour
Xet downloads start at 80-400 MB/s (local buffer reads), then stall completely after 16-200 MB transferred. Plain HTTP (HF_HUB_DISABLE_XET=1) downloads at stable line speed throughout.

Diagnosis
ss -tinp captured during a stall shows:

  • rcv_ooopack in the thousands on every Xet connection (e.g. 11,671 / 17,548 / 10,697)
  • TCP congestion window (cwnd) stuck at 10 — never expands
  • One connection with rtt:793ms, jitter 598ms, rto:3185 — gone into timeout
  • All connections to AWS endpoints (18.244.140.x, 50.17.153.x)

TCP is spending all resources reordering packets. Congestion window never opens. Eventually one connection times out and the transfer stalls permanently.

Workaround
HF_HUB_DISABLE_XET=1 hf download <repo> <file>

Suspected cause
Xet CDN delivering chunks heavily out of order at scale, causing TCP congestion collapse.

Reproduction

No code required — this is a network/protocol issue reproducible with the standard CLI:

# Ensure hf_xet is active
hf env | grep -i xet

# Start download of any large Xet-enabled repo (example used in diagnosis):
hf download unsloth/gemma-4-26B-A4B-it-GGUF gemma-4-26B-A4B-it-MXFP4_MOE.gguf --local-dir ~/Downloads

# In a second terminal, monitor TCP connections while download is active:
watch -n 1 'ss -tinp | grep hf'

What to observe:

  • Download progress stalls after 16-200 MB transferred
  • ss output shows rcv_ooopack climbing into thousands on all connections
  • cwnd remains stuck at 10 throughout
  • Eventually one connection shows rtt in the hundreds of ms with high jitter

Confirm Xet is active by checking the download shows an initial burst speed
(80-400 MB/s reported) before stalling — this distinguishes Xet buffer reads
from plain HTTP which delivers steady line speed throughout.

Confirm issue is Xet-specific:

# Stalls:
hf download unsloth/gemma-4-26B-A4B-it-GGUF gemma-4-26B-A4B-it-MXFP4_MOE.gguf --local-dir ~/Downloads

# Works fine at line speed:
HF_HUB_DISABLE_XET=1 hf download unsloth/gemma-4-26B-A4B-it-GGUF gemma-4-26B-A4B-it-MXFP4_MOE.gguf --local-dir ~/Downloads

The smoking gun appears to be:

rtt:793.869/597.661 rto:3185

One connection to 18.244.140.27 has an RTT of 794ms with huge variance (598ms jitter). That's catastrophic. Normal connections to the same host are showing 105ms. That one connection has gone sick and is dragging everything down.
Also notable:

rcv_ooopack (out-of-order packets) is enormous on every connection — thousands of them. 18.244.140.27 fd=11 shows 11,671 out-of-order packets received.
cwnd is stuck at 10 on every connection — TCP congestion window never grows, meaning TCP thinks the network is congested and won't open up.

Conclusion: The Xet CDN is sending chunks out of order at scale, TCP is spending all its time reordering, the congestion window never expands, and one connection eventually goes into timeout hell dragging the whole transfer to a halt.
This is a Xet/CDN-side problem, not your system. Your TCP stack is behaving correctly — it's the server hammering you with out-of-order packets.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions