studio: show HF model download progress in training start overlay by danielhanchen · Pull Request #4894 · unslothai/unsloth

danielhanchen · 2026-04-07T11:41:36Z

Summary

The training start overlay used to show a static "Loading model..." line while model weights were being pulled from Hugging Face. On slow connections this looked like Studio had frozen, with no indication that anything was happening.

This adds a small progress block inside TrainingStartOverlay that polls the existing GET /api/models/download-progress endpoint and shows bytes downloaded, total bytes, and percent complete with a Progress bar.

Single file change. No backend, worker, SSE, or runtime store edits.

Before / After

Before:

> unsloth training starts...
==((====))==
   \\   /|
O^O/ \_/ \
\        /
 "-____-"
> Preparing model and dataset...
> We are getting everything ready for your run...
> Loading model... | waiting for first step... (0)

After (mid download):

> unsloth training starts...
==((====))==
   \\   /|
O^O/ \_/ \
\        /
 "-____-"
> Preparing model and dataset...
> We are getting everything ready for your run...
> Loading model... | waiting for first step... (0)
  Downloading model weights...           1.2 GB / 4.5 GB - 27%
  [=========>                                            ]

Implementation

All changes are in studio/frontend/src/features/studio/training-start-overlay.tsx:

New useModelDownloadProgress(modelName) hook, kept local to this file since there is only one consumer.
Polls getDownloadProgress(modelName) every 1500 ms while the overlay is mounted and the runtime is in a starting or preparing phase (configuring, downloading_model, downloading_dataset, loading_model, loading_dataset).
Gated on the HF repo regex ^[A-Za-z0-9._-]+/[A-Za-z0-9._-]+$, the same regex the backend already uses in _VALID_REPO_ID. Local paths and empty form state never hit the endpoint.
Polling stops once progress >= 1.0 so the bar can stay at 100 until the overlay hides on the first training step.
Network errors are silently swallowed, matching the chat side flow in use-chat-model-runtime.ts. The bar simply freezes at the last value rather than disappearing.
Cleanup runs on unmount and on modelName change so a new run with a different model starts a fresh poll.
New formatBytes helper for B / KB / MB / GB output.
selectedModel is read directly from useTrainingConfigStore inside the overlay. live-training-view.tsx is unchanged and there is no prop drilling.

Reused, not added

getDownloadProgress from studio/frontend/src/features/chat/api/chat-api.ts
Progress from studio/frontend/src/components/ui/progress.tsx
useTrainingConfigStore and useTrainingRuntimeStore from @/features/training
GET /api/models/download-progress in studio/backend/routes/models.py (auth gated, scans the HF blob cache for completed and .incomplete files, returns {downloaded_bytes, expected_bytes, progress})

No new endpoints, no new dependencies, no backend restart required. Studio serves studio/frontend/dist/ as static files, so a fresh bun run build is picked up on the next page load.

Edge cases handled

Case	Behavior
Model already cached	Bar is hidden entirely (`downloadedBytes === 0` from the endpoint), overlay transitions straight to training. No flicker.
Fresh download	Bar appears within ~1.5 s, climbs from 0 to 100 percent.
Local path (`/models/foo`)	Regex rejects it, no polling, no bar. Existing behavior unchanged.
Empty model name	Regex rejects, no polling.
Network error from polling endpoint	`try/catch` swallows, bar freezes at last value.
HF API cannot determine size (private model with no token, etc.)	Endpoint returns `expected_bytes: 0`, UI falls back to "X.X GB downloaded" with no percent and no bar.
User cancels mid download	Overlay unmounts, React cleanup clears the interval.
User starts a second run with a different model	`modelName` dependency changes, cleanup fires, fresh poll starts.
Sharded model (multiple safetensors files)	Endpoint sums all blobs, so the total is accurate and the bar advances smoothly.

What is explicitly not changed

No backend Python code
No worker subprocess code
No pump thread or SSE generator changes
No new endpoints
No new dependencies
No backend restart
live-training-view.tsx is unchanged
The runtime store schema is unchanged
All existing training functionality is untouched

Test plan

cd studio/frontend && bun run build runs tsc -b && vite build cleanly with zero TypeScript errors
Cached model: start a training run with a model already in ~/.cache/huggingface/hub/. Confirm the overlay transitions straight to training with no progress block flash
Fresh download: clear ~/.cache/huggingface/hub/models--unsloth--<small-model> and start a training run with that model. Confirm the bar appears within ~1.5 s and advances smoothly to 100 percent. Confirm GET /api/models/download-progress?repo_id=... fires every ~1.5 s in the Network tab
Local path: start a training run with a model loaded from a local directory. Confirm no download-progress requests are made and no bar appears
Cancel mid download: click the X on the overlay during a download. Confirm the polling stops in the Network tab
Backend stability: confirm logs/studio_backend.log shows no new errors after the change

During the training setup phase, the overlay only displayed a static "Loading model..." line while model weights were being downloaded from Hugging Face. On slow connections this looked like the app had frozen. This adds a small self-contained progress block inside the existing TrainingStartOverlay that polls the existing GET /api/models/download-progress endpoint and renders a Progress bar with bytes downloaded, total bytes, and percent complete. Notes: - Frontend only change. No backend, worker, SSE, or runtime store edits. - Reuses the existing getDownloadProgress client wrapper and the existing /api/models/download-progress endpoint that already scans the HF blob cache for completed and .incomplete files. - selectedModel is read directly from useTrainingConfigStore inside the overlay, so no prop drilling and live-training-view.tsx is unchanged. - Polling runs at 1500 ms and is gated on the HF repo regex (^[A-Za-z0-9._-]+/[A-Za-z0-9._-]+$), the same regex the backend uses, so local paths and empty form state never hit the endpoint. - Polling stops once progress reaches 1.0 so the bar can stay at 100 until the overlay hides on the first training step. - Network errors are silently swallowed, matching the chat side flow (the bar simply freezes at the last value). - When downloadedBytes is 0 the block is hidden entirely, so cached models do not flash a progress bar. - When the HF API cannot determine the total size, the block falls back to "X downloaded" with no percent and no bar. Verified with bun run build (tsc -b plus vite build, no TypeScript errors).

gemini-code-assist

Code Review

This pull request introduces a model download progress indicator to the training start overlay. It adds a custom hook, useModelDownloadProgress, which polls the backend for download status and updates the UI with a progress bar and formatted byte counts. A review comment suggests resetting the download state when the model name changes to prevent the UI from displaying stale progress data from previous runs.

gemini-code-assist · 2026-04-07T11:43:26Z

studio/frontend/src/features/studio/training-start-overlay.tsx

+    if (!modelName || !HF_REPO_REGEX.test(modelName) || !shouldPoll) {
+      return;
+    }
+


When modelName changes or a new training run starts, the state should be reset to avoid showing stale progress data from a previous run while waiting for the first poll of the new model to complete. This prevents the progress bar from flickering with old values, ensuring a consistent user experience during transient states.

Suggested change

setState(EMPTY_DOWNLOAD_STATE);

References

When a UI element depends on data from the backend, provide a reasonable fallback or reset state to handle transient states like waiting for a backend response to avoid poor user experience.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f13e709ac2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-07T11:44:39Z

studio/frontend/src/features/studio/training-start-overlay.tsx

          <AnimatedSpan className="mt-2 text-muted-foreground">
            {`> ${message || "starting training..."} | waiting for first step... (${currentStep})`}
          </AnimatedSpan>
+          {download.downloadedBytes > 0 ? (


Avoid showing download banner for fully cached models

The new render guard download.downloadedBytes > 0 also matches models that are already fully cached, because /api/models/download-progress reports nonzero downloaded_bytes for completed blobs as well. In that case the overlay now shows “Downloading model weights...” with 100% even though no download is happening, which is misleading for every cached-model start and can look like unnecessary startup work.

Useful? React with 👍 / 👎.

danielhanchen requested review from Manan17, Shine1i and rolandtannous as code owners April 7, 2026 11:41

gemini-code-assist bot reviewed Apr 7, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

studio: show HF model download progress in training start overlay#4894

studio: show HF model download progress in training start overlay#4894
danielhanchen wants to merge 1 commit intomainfrom
studio/training-overlay-download-progress

danielhanchen commented Apr 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

danielhanchen commented Apr 7, 2026

Summary

Before / After

Implementation

Reused, not added

Edge cases handled

What is explicitly not changed

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant