Skip to content

studio: show HF model download progress in training start overlay#4894

Open
danielhanchen wants to merge 1 commit intomainfrom
studio/training-overlay-download-progress
Open

studio: show HF model download progress in training start overlay#4894
danielhanchen wants to merge 1 commit intomainfrom
studio/training-overlay-download-progress

Conversation

@danielhanchen
Copy link
Copy Markdown
Contributor

Summary

The training start overlay used to show a static "Loading model..." line while model weights were being pulled from Hugging Face. On slow connections this looked like Studio had frozen, with no indication that anything was happening.

This adds a small progress block inside TrainingStartOverlay that polls the existing GET /api/models/download-progress endpoint and shows bytes downloaded, total bytes, and percent complete with a Progress bar.

Single file change. No backend, worker, SSE, or runtime store edits.

Before / After

Before:

> unsloth training starts...
==((====))==
   \\   /|
O^O/ \_/ \
\        /
 "-____-"
> Preparing model and dataset...
> We are getting everything ready for your run...
> Loading model... | waiting for first step... (0)

After (mid download):

> unsloth training starts...
==((====))==
   \\   /|
O^O/ \_/ \
\        /
 "-____-"
> Preparing model and dataset...
> We are getting everything ready for your run...
> Loading model... | waiting for first step... (0)
  Downloading model weights...           1.2 GB / 4.5 GB - 27%
  [=========>                                            ]

Implementation

All changes are in studio/frontend/src/features/studio/training-start-overlay.tsx:

  • New useModelDownloadProgress(modelName) hook, kept local to this file since there is only one consumer.
  • Polls getDownloadProgress(modelName) every 1500 ms while the overlay is mounted and the runtime is in a starting or preparing phase (configuring, downloading_model, downloading_dataset, loading_model, loading_dataset).
  • Gated on the HF repo regex ^[A-Za-z0-9._-]+/[A-Za-z0-9._-]+$, the same regex the backend already uses in _VALID_REPO_ID. Local paths and empty form state never hit the endpoint.
  • Polling stops once progress >= 1.0 so the bar can stay at 100 until the overlay hides on the first training step.
  • Network errors are silently swallowed, matching the chat side flow in use-chat-model-runtime.ts. The bar simply freezes at the last value rather than disappearing.
  • Cleanup runs on unmount and on modelName change so a new run with a different model starts a fresh poll.
  • New formatBytes helper for B / KB / MB / GB output.
  • selectedModel is read directly from useTrainingConfigStore inside the overlay. live-training-view.tsx is unchanged and there is no prop drilling.

Reused, not added

  • getDownloadProgress from studio/frontend/src/features/chat/api/chat-api.ts
  • Progress from studio/frontend/src/components/ui/progress.tsx
  • useTrainingConfigStore and useTrainingRuntimeStore from @/features/training
  • GET /api/models/download-progress in studio/backend/routes/models.py (auth gated, scans the HF blob cache for completed and .incomplete files, returns {downloaded_bytes, expected_bytes, progress})

No new endpoints, no new dependencies, no backend restart required. Studio serves studio/frontend/dist/ as static files, so a fresh bun run build is picked up on the next page load.

Edge cases handled

Case Behavior
Model already cached Bar is hidden entirely (downloadedBytes === 0 from the endpoint), overlay transitions straight to training. No flicker.
Fresh download Bar appears within ~1.5 s, climbs from 0 to 100 percent.
Local path (/models/foo) Regex rejects it, no polling, no bar. Existing behavior unchanged.
Empty model name Regex rejects, no polling.
Network error from polling endpoint try/catch swallows, bar freezes at last value.
HF API cannot determine size (private model with no token, etc.) Endpoint returns expected_bytes: 0, UI falls back to "X.X GB downloaded" with no percent and no bar.
User cancels mid download Overlay unmounts, React cleanup clears the interval.
User starts a second run with a different model modelName dependency changes, cleanup fires, fresh poll starts.
Sharded model (multiple safetensors files) Endpoint sums all blobs, so the total is accurate and the bar advances smoothly.

What is explicitly not changed

  • No backend Python code
  • No worker subprocess code
  • No pump thread or SSE generator changes
  • No new endpoints
  • No new dependencies
  • No backend restart
  • live-training-view.tsx is unchanged
  • The runtime store schema is unchanged
  • All existing training functionality is untouched

Test plan

  • cd studio/frontend && bun run build runs tsc -b && vite build cleanly with zero TypeScript errors
  • Cached model: start a training run with a model already in ~/.cache/huggingface/hub/. Confirm the overlay transitions straight to training with no progress block flash
  • Fresh download: clear ~/.cache/huggingface/hub/models--unsloth--<small-model> and start a training run with that model. Confirm the bar appears within ~1.5 s and advances smoothly to 100 percent. Confirm GET /api/models/download-progress?repo_id=... fires every ~1.5 s in the Network tab
  • Local path: start a training run with a model loaded from a local directory. Confirm no download-progress requests are made and no bar appears
  • Cancel mid download: click the X on the overlay during a download. Confirm the polling stops in the Network tab
  • Backend stability: confirm logs/studio_backend.log shows no new errors after the change

During the training setup phase, the overlay only displayed a static
"Loading model..." line while model weights were being downloaded from
Hugging Face. On slow connections this looked like the app had frozen.

This adds a small self-contained progress block inside the existing
TrainingStartOverlay that polls the existing
GET /api/models/download-progress endpoint and renders a Progress bar
with bytes downloaded, total bytes, and percent complete.

Notes:

- Frontend only change. No backend, worker, SSE, or runtime store edits.
- Reuses the existing getDownloadProgress client wrapper and the
  existing /api/models/download-progress endpoint that already scans
  the HF blob cache for completed and .incomplete files.
- selectedModel is read directly from useTrainingConfigStore inside the
  overlay, so no prop drilling and live-training-view.tsx is unchanged.
- Polling runs at 1500 ms and is gated on the HF repo regex
  (^[A-Za-z0-9._-]+/[A-Za-z0-9._-]+$), the same regex the backend uses,
  so local paths and empty form state never hit the endpoint.
- Polling stops once progress reaches 1.0 so the bar can stay at 100
  until the overlay hides on the first training step.
- Network errors are silently swallowed, matching the chat side flow
  (the bar simply freezes at the last value).
- When downloadedBytes is 0 the block is hidden entirely, so cached
  models do not flash a progress bar.
- When the HF API cannot determine the total size, the block falls
  back to "X downloaded" with no percent and no bar.

Verified with bun run build (tsc -b plus vite build, no TypeScript
errors).
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a model download progress indicator to the training start overlay. It adds a custom hook, useModelDownloadProgress, which polls the backend for download status and updates the UI with a progress bar and formatted byte counts. A review comment suggests resetting the download state when the model name changes to prevent the UI from displaying stale progress data from previous runs.

if (!modelName || !HF_REPO_REGEX.test(modelName) || !shouldPoll) {
return;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When modelName changes or a new training run starts, the state should be reset to avoid showing stale progress data from a previous run while waiting for the first poll of the new model to complete. This prevents the progress bar from flickering with old values, ensuring a consistent user experience during transient states.

Suggested change
setState(EMPTY_DOWNLOAD_STATE);
References
  1. When a UI element depends on data from the backend, provide a reasonable fallback or reset state to handle transient states like waiting for a backend response to avoid poor user experience.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f13e709ac2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

<AnimatedSpan className="mt-2 text-muted-foreground">
{`> ${message || "starting training..."} | waiting for first step... (${currentStep})`}
</AnimatedSpan>
{download.downloadedBytes > 0 ? (
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid showing download banner for fully cached models

The new render guard download.downloadedBytes > 0 also matches models that are already fully cached, because /api/models/download-progress reports nonzero downloaded_bytes for completed blobs as well. In that case the overlay now shows “Downloading model weights...” with 100% even though no download is happening, which is misleading for every cached-model start and can look like unnecessary startup work.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant