Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions docs/xet/chunking.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,24 @@ When a boundary found or taken:

At end-of-file, if `start_offset < len(data)`, emit the final chunk `[start_offset, len(data))`.

### Decision Flowchart

```mermaid
flowchart TD
A["Read next byte b"] --> B["h = (h << 1) + TABLE[b]"]
B --> C["size = offset - start + 1"]
C --> D{"size < MIN_CHUNK_SIZE\n(8 KiB)?"}
D -->|Yes| A
D -->|No| E{"size >= MAX_CHUNK_SIZE\n(128 KiB)?"}
E -->|Yes| G["Emit chunk, reset h = 0"]
E -->|No| F{"(h & MASK) == 0?"}
F -->|Yes| G
F -->|No| A
G --> H{"End of file?"}
H -->|No| A
H -->|Yes| I["Emit final chunk if data remains"]
```

### Pseudocode

```text
Expand Down
8 changes: 4 additions & 4 deletions docs/xet/deduplication.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,10 +56,10 @@ When a file is processed for upload, it undergoes the following steps:

```mermaid
graph TD
A[File Input] --> B[Content-Defined Chunking]
B --> C[Hash Computation]
C --> D[Chunk Creation]
D --> E[Deduplication Query]
A["File Input"] --> B["Content-Defined Chunking"]
B --> C["Hash Computation"]
C --> D["Chunk Creation"]
D --> E["Deduplication Query"]
```

1. **Chunking**: Content-defined chunking using GearHash algorithm creates variable-sized chunks of file data
Expand Down
11 changes: 11 additions & 0 deletions docs/xet/file-id.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,14 @@ This is the string representation of the hash and can be used directly in the fi
> [!NOTE]
> The resolve URL will return a 302 redirect http status code, following the redirect will download the content via the old LFS compatible route rather than through the Xet protocol.
In order to use the Xet protocol you MUST NOT follow this redirect.

```mermaid
sequenceDiagram
autonumber
actor C as Client
participant Hub as Hugging Face Hub
C->>Hub: GET /namespace/repo/resolve/branch/filepath<br/>Authorization: Bearer <hf_token>
Hub-->>C: 302 Redirect + X-Xet-Hash header
Note over C: Extract X-Xet-Hash value = Xet File ID<br/>Do NOT follow the 302 redirect
C->>C: Use File ID with CAS Reconstruction API
```
13 changes: 13 additions & 0 deletions docs/xet/hashing.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,19 @@ The Xet protocol utilizes a few different hashing types.

All hashes referenced are 32 bytes (256 bits) long.

```mermaid
flowchart LR
subgraph Input
CD["Chunk Data"]
CH["Chunk Hashes"]
end
CD -->|"blake3(data, DATA_KEY)"| ChunkHash["Chunk Hash"]
ChunkHash --> CH
CH -->|"Merkle Tree\n+ INTERNAL_NODE_KEY"| XorbHash["Xorb Hash"]
CH -->|"Merkle Tree\n+ INTERNAL_NODE_KEY\nthen blake3(root, zeros)"| FileHash["File Hash"]
CH -->|"blake3(concat hashes,\nVERIFICATION_KEY)"| VerifHash["Term Verification Hash"]
```

## Chunk Hashes

After cutting a chunk of data, the chunk hash is computed via a blake3 keyed hash with the following key (DATA_KEY):
Expand Down
37 changes: 37 additions & 0 deletions docs/xet/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,43 @@ Implementors can create their own clients, SDKs, and tools that speak the Xet pr

## Overall Xet Architecture

```mermaid
block
columns 3
File["📄 File"]
space
space
CDC["Chunking (CDC)"]
space
space
block:chunks
columns 5
C0["Chunk 0"] C1["Chunk 1"] C2["Chunk 2"] C3["..."] C4["Chunk N"]
end
space
space
space
block:xorbs
columns 2
X0["Xorb A\n(chunks 0–1023)"]
X1["Xorb B\n(chunks 1024–N)"]
end
space
Shard["Shard\n(file reconstructions\n+ xorb metadata)"]
space
space
space
CAS["CAS Server\n(Content Addressable Storage)"]
space
space
File --> CDC
CDC --> chunks
chunks --> xorbs
xorbs --> Shard
xorbs --> CAS
Shard --> CAS
```

- [Content-Defined Chunking](./chunking): Gearhash-based CDC with parameters, boundary rules, and performance optimizations.
- [Hashing Methods](./hashing): Descriptions and definitions of the different hashing functions used for chunks, xorbs and term verification entries.
- [File Reconstruction](./file-reconstruction): Defining "term"-based representation of files using xorb hash + chunk ranges.
Expand Down
138 changes: 74 additions & 64 deletions docs/xet/shard.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,12 +116,14 @@ struct MDBShardFileHeader {

**Memory Layout**:

```txt
┌────────────────────────────────────────────────────────────────┬───────────┬───────────┐
│ tag (32 bytes) │ version │ footer_sz │
│ Magic Number Identifier │ (8 bytes) │ (8 bytes) │
└────────────────────────────────────────────────────────────────┴───────────┴───────────┘
0 32 40 48
```mermaid
---
title: "MDBShardFileHeader (48 bytes)"
---
packet
0-31: "tag (32 bytes) — Magic Number Identifier"
32-39: "version (u64)"
40-47: "footer_size (u64)"
```

**Deserialization steps**:
Expand Down Expand Up @@ -220,12 +222,15 @@ Given the `file_data_sequence_header.file_flags & MASK` (bitwise AND) operations

**Memory Layout**:

```txt
┌────────────────────────────────────────────────────────────────┬──────────┬───────────┬────────────┐
│ file_hash (32 bytes) │file_flags│num_entries│ _unused │
│ File Hash Value │(4 bytes) │(4 bytes) │ (8 bytes) │
└────────────────────────────────────────────────────────────────┴──────────┴───────────┴────────────┘
0 32 36 40 48
```mermaid
---
title: "FileDataSequenceHeader (48 bytes)"
---
packet
0-31: "file_hash (32 bytes)"
32-35: "file_flags (u32)"
36-39: "num_entries (u32)"
40-47: "_unused (8 bytes)"
```

### FileDataSequenceEntry
Expand All @@ -247,13 +252,16 @@ struct FileDataSequenceEntry {

**Memory Layout**:

```txt
┌────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ cas_hash (32 bytes) │cas_flags│unpacked │chunk_idx│chunk_idx│
│ CAS Block Hash │(4 bytes)│seg_bytes│start │end │
│ │ │(4 bytes)│(4 bytes)│(4 bytes)│
└────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┘
0 32 36 40 44 48
```mermaid
---
title: "FileDataSequenceEntry (48 bytes)"
---
packet
0-31: "cas_hash (32 bytes) — Xorb Hash"
32-35: "cas_flags (u32)"
36-39: "unpacked_segment_bytes (u32)"
40-43: "chunk_index_start (u32)"
44-47: "chunk_index_end (u32)"
```

### FileVerificationEntry (OPTIONAL)
Expand All @@ -271,12 +279,13 @@ struct FileVerificationEntry {

**Memory Layout**:

```txt
┌────────────────────────────────────────────────────────────────┬────────────────────────────────┐
│ range_hash (32 bytes) │ _unused (16 bytes) │
│ Verification Hash │ Reserved Space │
└────────────────────────────────────────────────────────────────┴────────────────────────────────┘
0 32 48
```mermaid
---
title: "FileVerificationEntry (48 bytes)"
---
packet
0-31: "range_hash (32 bytes) — Verification Hash"
32-47: "_unused (16 bytes)"
```

When a shard has verification entries, all file info sections MUST have verification entries.
Expand All @@ -302,12 +311,13 @@ struct FileMetadataExt {

**Memory Layout**:

```txt
┌────────────────────────────────────────────────────────────────┬────────────────────────────────┐
│ sha256 (32 bytes) │ _unused (16 bytes) │
│ SHA256 Hash │ Reserved Space │
└────────────────────────────────────────────────────────────────┴────────────────────────────────┘
0 32 48
```mermaid
---
title: "FileMetadataExt (48 bytes)"
---
packet
0-31: "sha256 (32 bytes) — SHA256 Hash"
32-47: "_unused (16 bytes)"
```

### File Info Bookend
Expand Down Expand Up @@ -381,13 +391,16 @@ struct CASChunkSequenceHeader {

**Memory Layout**:

```txt
┌────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ cas_hash (32 bytes) │cas_flags│num_ │num_bytes│num_bytes│
│ CAS Block Hash │(4 bytes)│entries │in_cas │on_disk │
│ │ │(4 bytes)│(4 bytes)│(4 bytes)│
└────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┘
0 32 36 40 44 48
```mermaid
---
title: "CASChunkSequenceHeader (48 bytes)"
---
packet
0-31: "cas_hash (32 bytes) — Xorb Hash"
32-35: "cas_flags (u32)"
36-39: "num_entries (u32)"
40-43: "num_bytes_in_cas (u32)"
44-47: "num_bytes_on_disk (u32)"
```

### CASChunkSequenceEntry
Expand All @@ -406,15 +419,15 @@ struct CASChunkSequenceEntry {

**Memory Layout**:

```txt
┌────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────────────┐
│ chunk_hash (32 bytes) │chunk_ │unpacked │ _unused │
│ Chunk Hash │byte_ │segment_ │ (8 bytes) │
│ │range_ │bytes │ │
│ │start │(4 bytes)│ │
│ │(4 bytes)│ │ │
└────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────────────┘
0 32 36 40 48
```mermaid
---
title: "CASChunkSequenceEntry (48 bytes)"
---
packet
0-31: "chunk_hash (32 bytes)"
32-35: "chunk_byte_range_start (u32)"
36-39: "unpacked_segment_bytes (u32)"
40-47: "_unused (8 bytes)"
```

### CAS Info Bookend
Expand Down Expand Up @@ -451,23 +464,20 @@ struct MDBShardFileFooter {

**Memory Layout**:

> [!NOTE]
> Fields are not exactly to scale

```txt
┌─────────┬─────────┬─────────┬─────────────────────────────────────────────────────────────┬─────────────────────────────────────┐
│ version │file_info│cas_info │ _buffer (reserved) │ chunk_hash_hmac_key │
│(8 bytes)│offset │offset │ (48 bytes) │ (32 bytes) │
│ │(8 bytes)│(8 bytes)│ │ │
└─────────┴─────────┴─────────┴─────────────────────────────────────────────────────────────┴─────────────────────────────────────┘
0 8 16 24 72 104

┌─────────┬──────────┬─────────────────────────────────────────────────────────────────────────────┬─────────┐
│creation │shard_ │ _buffer (reserved) │footer_ │
│timestamp│key_expiry│ (72 bytes) │offset │
│(8 bytes)│ (8 bytes)│ │(8 bytes)│
└─────────┴──────────┴─────────────────────────────────────────────────────────────────────────────┴─────────┘
104 112 120 192 200
```mermaid
---
title: "MDBShardFileFooter (200 bytes)"
---
packet
0-7: "version (u64)"
8-15: "file_info_offset (u64)"
16-23: "cas_info_offset (u64)"
24-71: "_buffer (48 bytes reserved)"
72-103: "chunk_hash_hmac_key (32 bytes)"
104-111: "shard_creation_timestamp (u64)"
112-119: "shard_key_expiry (u64)"
120-191: "_buffer2 (72 bytes reserved)"
192-199: "footer_offset (u64)"
```

**Deserialization steps**:
Expand Down
16 changes: 9 additions & 7 deletions docs/xet/xorb.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,13 +58,15 @@ the uncompressed size also being at a maximum of 128KiB.

#### Chunk Header Layout

```txt
┌─────────┬─────────────────────────────────┬──────────────┬─────────────────────────────────┐
│ Version │ Compressed Size │ Compression │ Uncompressed Size │
│ 1 byte │ 3 bytes │ Type │ 3 bytes │
│ │ (little-endian) │ 1 byte │ (little-endian) │
└─────────┴─────────────────────────────────┴──────────────┴─────────────────────────────────┘
0 1 4 5 8
```mermaid
---
title: "Chunk Header (8 bytes)"
---
packet
0-7: "Version (1 byte)"
8-31: "Compressed Size (3 bytes, LE)"
32-39: "Compression Type (1 byte)"
40-63: "Uncompressed Size (3 bytes, LE)"
```

### Chunk Compression Schemes
Expand Down
Loading