hf_xet downloads stall

**Title:** hf_xet downloads stall due to massive out-of-order packet delivery from Xet CDN

**Environment**
- OS: Fedora 43 KDE Plasma
- Python: 3.14.3
- huggingface_hub: 1.9.0
- hf_xet: 1.4.3

Hello.  I was struggling with broken downloads so asked Sonnet and Gemini to do some digging and this is what they found.  Thanks.

**Behaviour**
Xet downloads start at 80-400 MB/s (local buffer reads), then stall completely after 16-200 MB transferred. Plain HTTP (`HF_HUB_DISABLE_XET=1`) downloads at stable line speed throughout.

**Diagnosis**
`ss -tinp` captured during a stall shows:
- `rcv_ooopack` in the thousands on every Xet connection (e.g. 11,671 / 17,548 / 10,697)
- TCP congestion window (`cwnd`) stuck at 10 — never expands
- One connection with `rtt:793ms`, jitter 598ms, `rto:3185` — gone into timeout
- All connections to AWS endpoints (18.244.140.x, 50.17.153.x)

TCP is spending all resources reordering packets. Congestion window never opens. Eventually one connection times out and the transfer stalls permanently.

**Workaround**
`HF_HUB_DISABLE_XET=1 hf download <repo> <file>`

**Suspected cause**
Xet CDN delivering chunks heavily out of order at scale, causing TCP congestion collapse.

**Reproduction**

No code required — this is a network/protocol issue reproducible with the standard CLI:

    # Ensure hf_xet is active
    hf env | grep -i xet

    # Start download of any large Xet-enabled repo (example used in diagnosis):
    hf download unsloth/gemma-4-26B-A4B-it-GGUF gemma-4-26B-A4B-it-MXFP4_MOE.gguf --local-dir ~/Downloads

    # In a second terminal, monitor TCP connections while download is active:
    watch -n 1 'ss -tinp | grep hf'

**What to observe:**
- Download progress stalls after 16-200 MB transferred
- `ss` output shows `rcv_ooopack` climbing into thousands on all connections
- `cwnd` remains stuck at 10 throughout
- Eventually one connection shows `rtt` in the hundreds of ms with high jitter

**Confirm Xet is active** by checking the download shows an initial burst speed
(80-400 MB/s reported) before stalling — this distinguishes Xet buffer reads
from plain HTTP which delivers steady line speed throughout.

**Confirm issue is Xet-specific:**

    # Stalls:
    hf download unsloth/gemma-4-26B-A4B-it-GGUF gemma-4-26B-A4B-it-MXFP4_MOE.gguf --local-dir ~/Downloads

    # Works fine at line speed:
    HF_HUB_DISABLE_XET=1 hf download unsloth/gemma-4-26B-A4B-it-GGUF gemma-4-26B-A4B-it-MXFP4_MOE.gguf --local-dir ~/Downloads

The smoking gun appears to be:

rtt:793.869/597.661  rto:3185

One connection to 18.244.140.27 has an RTT of 794ms with huge variance (598ms jitter). That's catastrophic. Normal connections to the same host are showing 105ms. That one connection has gone sick and is dragging everything down.
Also notable:

rcv_ooopack (out-of-order packets) is enormous on every connection — thousands of them. 18.244.140.27 fd=11 shows 11,671 out-of-order packets received.
cwnd is stuck at 10 on every connection — TCP congestion window never grows, meaning TCP thinks the network is congested and won't open up.

Conclusion: The Xet CDN is sending chunks out of order at scale, TCP is spending all its time reordering, the congestion window never expands, and one connection eventually goes into timeout hell dragging the whole transfer to a halt.
This is a Xet/CDN-side problem, not your system. Your TCP stack is behaving correctly — it's the server hammering you with out-of-order packets.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hf_xet downloads stall #2359

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

hf_xet downloads stall #2359

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions