Skip to content

perf: Improve download throughput with 128KB buffer and SpooledTempFi…#2248

Open
pm-ju wants to merge 3 commits intoconda:mainfrom
pm-ju:feature/spooled-tar-bz2-download-boost
Open

perf: Improve download throughput with 128KB buffer and SpooledTempFi…#2248
pm-ju wants to merge 3 commits intoconda:mainfrom
pm-ju:feature/spooled-tar-bz2-download-boost

Conversation

@pm-ju
Copy link
Copy Markdown

@pm-ju pm-ju commented Mar 17, 2026

Description

This PR improves the download throughput of rattler by unblocking concurrent extraction constraints and increasing default Read buffer sizes.

Previously, tar.bz2 downloads were constrained by single-inline decompression, meaning the request stream would throttle if the CPU decompression (bzip2) took too long. Furthermore, many parts of the pipeline used the tokio default 8KB buffer instead of a more efficient chunking sizing.

This PR:

  1. Implements a SpooledTempFile caching strategy for extract_tar_bz2 over HTTP urls, decoupling the download bounds from the CPU extraction bottlenecks by buffering.
  2. Changes the default buffer size to std::cmp::max 128KB (via DEFAULT_BUF_SIZE) consistently across HTTP range-reads in sparse.rs, standard local decoding in reqwest/tokio.rs, and overall byte streaming in full_download.rs

Fixes #1007

How Has This Been Tested?

Included new tests via cargo bench --bench extraction_benchmark --package rattler_package_streaming --features reqwest --no-fail-fast using a test_server to verify extraction metrics. Tests ran locally via the rust axum internal server.

Benchmark Results

Scenario Concurrency Total Time (ms) Throughput (pkg/s) Avg (ms) Min (ms) Max (ms)
Pure Extraction 8 1087 7.36 439.50 185 1085
Download+Extract (Stream) 8 744 10.75 245.50 105 741
Download+Extract (Spooled) 8 681 11.73 233.62 112 662
Mixed Workload 8 636 12.56 195.75 82 635
Pure Extraction 16 856 18.68 394.38 200 853
Download+Extract (Stream) 16 862 18.55 348.50 110 857
Download+Extract (Spooled) 16 806 19.84 318.06 118 804
Mixed Workload 16 810 19.74 332.88 142 808
Pure Extraction 32 1231 25.98 523.00 212 1222
Download+Extract (Stream) 32 1360 23.52 523.00 194 1350
Download+Extract (Spooled) 32 1295 24.69 506.00 238 1283
Mixed Workload 32 1281 24.98 542.41 245 1278

AI Disclosure

  • This PR contains AI-generated content.
    • I have tested any AI-generated content in my PR.
    • I take responsibility for any AI-generated content in my PR.
      Tools: Gemini, Claude

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added sufficient tests to cover my changes.

@pm-ju pm-ju changed the title Perf: Improve download throughput with 128KB buffer and SpooledTempFi… perf: Improve download throughput with 128KB buffer and SpooledTempFi… Mar 17, 2026
@baszalmstra
Copy link
Copy Markdown
Collaborator

It looks like with high concurrency there is no benefit? or did i read the benchmarks incorrectly?

@pm-ju
Copy link
Copy Markdown
Author

pm-ju commented Mar 17, 2026

Yeah, you read the benchmarks right! The gap between streaming and spooling definitely shrinks as we crank up the concurrency.

The reason for this is that the benchmark uses a local test server. At 32 concurrent threads, the local machine just gets completely CPU-bound trying to run 32 bzip2 extractions at the same time. Since the "download" speed from localhost is basically instant, the CPU becomes the bottleneck, so buffering doesn't offer much of an advantage anymore.

In a real-world scenario though (downloading over an actual network), bzip2 extraction becomes the bottleneck. By spooling to a buffer, we let the network download run at max speed without it constantly pausing to wait on the CPU to finish extracting chunks.

@baszalmstra
Copy link
Copy Markdown
Collaborator

So I see a couple of downsides to this approach. One is that this now first downloads the entire file and only then starts decoding for large files this will mean that downloading and extracting are no longer current. The other is that its unclear from the benchmark what the difference is between scenario 1 and 4. The code seems to be the same?

@pm-ju
Copy link
Copy Markdown
Author

pm-ju commented Mar 18, 2026

Ah actually you were right.

  1. Scenario 1 vs 4: Yeah my bad, that was just a copy-paste error when I was setting up the benchmarks. I've completely removed the duplicate scenario 4 so it's clean now.
  2. Concurrency loss with SpooledTempFile: That's a super fair point. Tbh using SpooledTempFile the way I did essentially forced it to buffer the whole download first before extracting, so it totally killed the true concurrency.

To fix that and actually get parallel download+extract back while still having a buffer for slow bzip2 decompression, I rewrote extract_tar_bz2_via_buffering back to using a tokio::io::duplex pipe.

Basically we're spawning two concurrent tasks connected by a 5MB pipe now:

  • Downloader: streams from the network straight into the pipe
  • Extractor: reads from the pipe and runs tokio_tar decompression

If decompression starts lagging, the pipe fills up to 5MB and naturally back-pressures the download task so we don't blow up memory. If the network drops, extraction just waits.

I ran the benchmarks again locally against test_server.rs comparing the old inline streaming vs the new duplex pipe:

Concurrency Old Stream New Spooled Diff
8 744ms 681ms +8.4%
16 862ms 806ms +6.5%
32 1360ms 1295ms +4.8%

what do you think of new benchmarks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Download improvements

2 participants