Skip to content

[CLI] [API] Add HfApi.copy_files method to copy files remotely and update 'hf buckets cp' #3874

Open
Wauplin wants to merge 23 commits intomainfrom
feat/hfapi-copy-files
Open

[CLI] [API] Add HfApi.copy_files method to copy files remotely and update 'hf buckets cp' #3874
Wauplin wants to merge 23 commits intomainfrom
feat/hfapi-copy-files

Conversation

@Wauplin
Copy link
Copy Markdown
Contributor

@Wauplin Wauplin commented Mar 2, 2026

Note: requires https://github.com/huggingface-internal/moon-landing/pull/17593 to be merged first


This PR adds a new HfApi.copy_files API and extends hf buckets cp to support remote HF-handle copy workflows.

  • Copy from bucket to bucket (same bucket or different bucket)
  • Copy from repo (model/dataset/space) to bucket
  • Reject bucket->repo and repo->repo destinations (not supported yet)

If source is a file, copies it. If a directory, recursively copy files under source folder.

  • Repo source file with xet_hash: copied directly by hash
  • Repo source file without xet_hash (regular small file): download then re-upload
  • Bucket to bucket: always copied by hash

See https://github.com/huggingface-internal/moon-landing/pull/17593#issue-4201288199 PR description for working test.


Note

Medium Risk
Introduces new bucket mutation paths (copyFile operations) and expands CLI behavior to allow remote-to-remote copies, which could affect data placement/overwrites if path resolution or handle parsing is wrong. Changes are contained to bucket tooling but touch upload/batch logic and revision parsing edge-cases.

Overview
Adds a new public copy_files API to copy files/folders between Hub sources and bucket destinations using hf://... handles, performing server-side hash copies when Xet-backed and falling back to download+reupload for non-Xet repo files.

Extends batch_bucket_files and the internal /batch payload to support copy operations (copyFile) alongside add/delete, adjusts batching order (copy → add → delete), and tweaks add-file mtime handling.

Updates hf buckets cp to support remote hf:// → remote hf:// copy (including repo→bucket and bucket→bucket), exports copy_files from huggingface_hub, centralizes SPECIAL_REFS_REVISION_REGEX, and updates docs/tests to cover the new workflows and constraints (destination must be a bucket).

Reviewed by Cursor Bugbot for commit a996a0f. Bugbot is set up for automated code reviews on this repo. Configure here.

@bot-ci-comment
Copy link
Copy Markdown

bot-ci-comment bot commented Mar 2, 2026

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Wauplin
Copy link
Copy Markdown
Contributor Author

Wauplin commented Mar 3, 2026

Current status: x-bucket copies do not work. Need an extra call to CAS to tell about the xethash being registered in destination bucket

@Wauplin Wauplin changed the title [API] Add HfApi.copy_files method to copy files remotely [CLI] [API] Add HfApi.copy_files method to copy files remotely and update 'hf buckets cp' Apr 7, 2026
@Wauplin Wauplin requested a review from hanouticelina April 7, 2026 14:44
@Wauplin Wauplin marked this pull request as ready for review April 7, 2026 14:44
Comment on lines +12604 to +12605
else:
all_adds.append((_download_from_repo(file.path), target_path))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sub-optimal: could be parallelize but that's not something we want to optimize for now



def _parse_hf_copy_handle(hf_handle: str) -> _BucketCopyHandle | _RepoCopyHandle:
# TODO: Harmonize hf:// parsing. See https://github.com/huggingface/huggingface_hub/issues/3971
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, #3971 is getting high in my priorities 🙈

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b46f914. Configure here.

try:
self.mtime = int(os.path.getmtime(self.source) * 1000)
except FileNotFoundError:
pass
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path objects lose file mtime due to isinstance check

Medium Severity

The mtime detection in _BucketAddFile.__post_init__ now checks isinstance(self.source, str) but the old code checked not isinstance(self.source, bytes), which covered both str and Path. Since the public batch_bucket_files API accepts Path objects as source, passing a Path will now always use time.time() instead of the actual file modification time from os.path.getmtime.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b46f914. Configure here.

Copy link
Copy Markdown
Contributor

@hanouticelina hanouticelina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a first pass!

if isinstance(self.source, str):
try:
self.mtime = int(os.path.getmtime(self.source) * 1000)
except FileNotFoundError:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand why we are catching FileNotFoundError here

Comment on lines +444 to +457
if revision is None:
revision = constants.DEFAULT_REVISION
elif remaining_parts:
maybe_special_ref = f"{unquote(revision)}/{remaining_parts[0]}"
match = SPECIAL_REFS_REVISION_REGEX.match(maybe_special_ref)
if match is not None:
special_ref = match.group()
revision = special_ref
suffix = maybe_special_ref.removeprefix(special_ref).lstrip("/")
remaining_parts = ([suffix] if suffix else []) + remaining_parts[1:]
else:
revision = unquote(revision)
else:
revision = unquote(revision)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if revision is None:
revision = constants.DEFAULT_REVISION
elif remaining_parts:
maybe_special_ref = f"{unquote(revision)}/{remaining_parts[0]}"
match = SPECIAL_REFS_REVISION_REGEX.match(maybe_special_ref)
if match is not None:
special_ref = match.group()
revision = special_ref
suffix = maybe_special_ref.removeprefix(special_ref).lstrip("/")
remaining_parts = ([suffix] if suffix else []) + remaining_parts[1:]
else:
revision = unquote(revision)
else:
revision = unquote(revision)
if revision is None:
revision = constants.DEFAULT_REVISION
else:
revision = unquote(revision)
if remaining_parts:
maybe_special_ref = f"{revision}/{remaining_parts[0]}"
match = SPECIAL_REFS_REVISION_REGEX.match(maybe_special_ref)
if match is not None:
revision = match.group()
suffix = maybe_special_ref.removeprefix(revision).lstrip("/")
remaining_parts = ([suffix] if suffix else []) + remaining_parts[1:]

is not None
)

all_adds: list[_BucketAddFile | tuple[str, str]] = []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we only append tuples no? (line 12622 all_adds.append((_download_from_repo(file.path), target_path)))

Suggested change
all_adds: list[_BucketAddFile | tuple[str, str]] = []
all_adds: list[tuple[str, str]] = []

raise typer.BadParameter("Remote-to-remote copy not supported.")
# Remote to remote copy
if src_is_hf and dst_is_hf:
assert dst is not None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed, we already do dst_is_hf = dst is not None and _is_hf_handle(dst)

Suggested change
assert dst is not None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe it's the linter who's not happy? if it's the case, let's ignore the linting error instead of having an assert

return rel_path
return f"{destination_path.rstrip('/')}/{rel_path}"

def _copy_by_hash(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) _copy_by_hash name suggests it performs the copy but it only builds _BucketCopyFile, maybe we can rename it _build_copy_op or something like that

return bucket.bucket_id


@pytest.fixture(scope="function")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function scope is the default

Suggested change
@pytest.fixture(scope="function")

Comment on lines +12547 to +12557
destination_is_directory = (
next(
iter(
self.list_bucket_tree(
destination_bucket_id, prefix=destination_path, recursive=False, token=token
)
),
None,
)
is not None
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit)

Suggested change
destination_is_directory = (
next(
iter(
self.list_bucket_tree(
destination_bucket_id, prefix=destination_path, recursive=False, token=token
)
),
None,
)
is not None
)
destination_is_directory = any(
self.list_bucket_tree(destination_bucket_id, prefix=destination_path, recursive=False, token=token)
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any() returns True on first yield I think

if len(add) + len(delete) <= _BUCKET_BATCH_ADD_CHUNK_SIZE:
self._batch_bucket_files(bucket_id, add=add or None, delete=delete or None, token=token)
if len(add) + len(copy) + len(delete) <= _BUCKET_BATCH_ADD_CHUNK_SIZE:
self._batch_bucket_files(bucket_id, add=add or None, copy=copy or None, delete=delete or None, token=token) # type: ignore
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_batch_bucket_files already handles empty lists (not introduced by this PR )

Suggested change
self._batch_bucket_files(bucket_id, add=add or None, copy=copy or None, delete=delete or None, token=token) # type: ignore
self._batch_bucket_files(bucket_id, add=add, copy=copy, delete=delete, token=token) # type: ignore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants