[CLI] [API] Add HfApi.copy_files method to copy files remotely and update 'hf buckets cp' #3874
[CLI] [API] Add HfApi.copy_files method to copy files remotely and update 'hf buckets cp' #3874
HfApi.copy_files method to copy files remotely and update 'hf buckets cp' #3874Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Current status: x-bucket copies do not work. Need an extra call to CAS to tell about the xethash being registered in destination bucket |
HfApi.copy_files method to copy files remotelyHfApi.copy_files method to copy files remotely and update 'hf buckets cp'
| else: | ||
| all_adds.append((_download_from_repo(file.path), target_path)) |
There was a problem hiding this comment.
sub-optimal: could be parallelize but that's not something we want to optimize for now
|
|
||
|
|
||
| def _parse_hf_copy_handle(hf_handle: str) -> _BucketCopyHandle | _RepoCopyHandle: | ||
| # TODO: Harmonize hf:// parsing. See https://github.com/huggingface/huggingface_hub/issues/3971 |
There was a problem hiding this comment.
yes, #3971 is getting high in my priorities 🙈
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b46f914. Configure here.
| try: | ||
| self.mtime = int(os.path.getmtime(self.source) * 1000) | ||
| except FileNotFoundError: | ||
| pass |
There was a problem hiding this comment.
Path objects lose file mtime due to isinstance check
Medium Severity
The mtime detection in _BucketAddFile.__post_init__ now checks isinstance(self.source, str) but the old code checked not isinstance(self.source, bytes), which covered both str and Path. Since the public batch_bucket_files API accepts Path objects as source, passing a Path will now always use time.time() instead of the actual file modification time from os.path.getmtime.
Reviewed by Cursor Bugbot for commit b46f914. Configure here.
hanouticelina
left a comment
There was a problem hiding this comment.
Made a first pass!
| if isinstance(self.source, str): | ||
| try: | ||
| self.mtime = int(os.path.getmtime(self.source) * 1000) | ||
| except FileNotFoundError: |
There was a problem hiding this comment.
not sure I understand why we are catching FileNotFoundError here
| if revision is None: | ||
| revision = constants.DEFAULT_REVISION | ||
| elif remaining_parts: | ||
| maybe_special_ref = f"{unquote(revision)}/{remaining_parts[0]}" | ||
| match = SPECIAL_REFS_REVISION_REGEX.match(maybe_special_ref) | ||
| if match is not None: | ||
| special_ref = match.group() | ||
| revision = special_ref | ||
| suffix = maybe_special_ref.removeprefix(special_ref).lstrip("/") | ||
| remaining_parts = ([suffix] if suffix else []) + remaining_parts[1:] | ||
| else: | ||
| revision = unquote(revision) | ||
| else: | ||
| revision = unquote(revision) |
There was a problem hiding this comment.
| if revision is None: | |
| revision = constants.DEFAULT_REVISION | |
| elif remaining_parts: | |
| maybe_special_ref = f"{unquote(revision)}/{remaining_parts[0]}" | |
| match = SPECIAL_REFS_REVISION_REGEX.match(maybe_special_ref) | |
| if match is not None: | |
| special_ref = match.group() | |
| revision = special_ref | |
| suffix = maybe_special_ref.removeprefix(special_ref).lstrip("/") | |
| remaining_parts = ([suffix] if suffix else []) + remaining_parts[1:] | |
| else: | |
| revision = unquote(revision) | |
| else: | |
| revision = unquote(revision) | |
| if revision is None: | |
| revision = constants.DEFAULT_REVISION | |
| else: | |
| revision = unquote(revision) | |
| if remaining_parts: | |
| maybe_special_ref = f"{revision}/{remaining_parts[0]}" | |
| match = SPECIAL_REFS_REVISION_REGEX.match(maybe_special_ref) | |
| if match is not None: | |
| revision = match.group() | |
| suffix = maybe_special_ref.removeprefix(revision).lstrip("/") | |
| remaining_parts = ([suffix] if suffix else []) + remaining_parts[1:] |
| is not None | ||
| ) | ||
|
|
||
| all_adds: list[_BucketAddFile | tuple[str, str]] = [] |
There was a problem hiding this comment.
looks like we only append tuples no? (line 12622 all_adds.append((_download_from_repo(file.path), target_path)))
| all_adds: list[_BucketAddFile | tuple[str, str]] = [] | |
| all_adds: list[tuple[str, str]] = [] |
| raise typer.BadParameter("Remote-to-remote copy not supported.") | ||
| # Remote to remote copy | ||
| if src_is_hf and dst_is_hf: | ||
| assert dst is not None |
There was a problem hiding this comment.
not needed, we already do dst_is_hf = dst is not None and _is_hf_handle(dst)
| assert dst is not None |
There was a problem hiding this comment.
or maybe it's the linter who's not happy? if it's the case, let's ignore the linting error instead of having an assert
| return rel_path | ||
| return f"{destination_path.rstrip('/')}/{rel_path}" | ||
|
|
||
| def _copy_by_hash( |
There was a problem hiding this comment.
(nit) _copy_by_hash name suggests it performs the copy but it only builds _BucketCopyFile, maybe we can rename it _build_copy_op or something like that
| return bucket.bucket_id | ||
|
|
||
|
|
||
| @pytest.fixture(scope="function") |
There was a problem hiding this comment.
function scope is the default
| @pytest.fixture(scope="function") |
| destination_is_directory = ( | ||
| next( | ||
| iter( | ||
| self.list_bucket_tree( | ||
| destination_bucket_id, prefix=destination_path, recursive=False, token=token | ||
| ) | ||
| ), | ||
| None, | ||
| ) | ||
| is not None | ||
| ) |
There was a problem hiding this comment.
(nit)
| destination_is_directory = ( | |
| next( | |
| iter( | |
| self.list_bucket_tree( | |
| destination_bucket_id, prefix=destination_path, recursive=False, token=token | |
| ) | |
| ), | |
| None, | |
| ) | |
| is not None | |
| ) | |
| destination_is_directory = any( | |
| self.list_bucket_tree(destination_bucket_id, prefix=destination_path, recursive=False, token=token) | |
| ) |
There was a problem hiding this comment.
any() returns True on first yield I think
| if len(add) + len(delete) <= _BUCKET_BATCH_ADD_CHUNK_SIZE: | ||
| self._batch_bucket_files(bucket_id, add=add or None, delete=delete or None, token=token) | ||
| if len(add) + len(copy) + len(delete) <= _BUCKET_BATCH_ADD_CHUNK_SIZE: | ||
| self._batch_bucket_files(bucket_id, add=add or None, copy=copy or None, delete=delete or None, token=token) # type: ignore |
There was a problem hiding this comment.
_batch_bucket_files already handles empty lists (not introduced by this PR )
| self._batch_bucket_files(bucket_id, add=add or None, copy=copy or None, delete=delete or None, token=token) # type: ignore | |
| self._batch_bucket_files(bucket_id, add=add, copy=copy, delete=delete, token=token) # type: ignore |


Note: requires https://github.com/huggingface-internal/moon-landing/pull/17593 to be merged first
This PR adds a new
HfApi.copy_filesAPI and extendshf buckets cpto support remote HF-handle copy workflows.If source is a file, copies it. If a directory, recursively copy files under source folder.
xet_hash: copied directly by hashxet_hash(regular small file): download then re-uploadSee https://github.com/huggingface-internal/moon-landing/pull/17593#issue-4201288199 PR description for working test.
Note
Medium Risk
Introduces new bucket mutation paths (
copyFileoperations) and expands CLI behavior to allow remote-to-remote copies, which could affect data placement/overwrites if path resolution or handle parsing is wrong. Changes are contained to bucket tooling but touch upload/batch logic and revision parsing edge-cases.Overview
Adds a new public
copy_filesAPI to copy files/folders between Hub sources and bucket destinations usinghf://...handles, performing server-side hash copies when Xet-backed and falling back to download+reupload for non-Xet repo files.Extends
batch_bucket_filesand the internal/batchpayload to supportcopyoperations (copyFile) alongsideadd/delete, adjusts batching order (copy → add → delete), and tweaks add-file mtime handling.Updates
hf buckets cpto support remotehf://→ remotehf://copy (including repo→bucket and bucket→bucket), exportscopy_filesfromhuggingface_hub, centralizesSPECIAL_REFS_REVISION_REGEX, and updates docs/tests to cover the new workflows and constraints (destination must be a bucket).Reviewed by Cursor Bugbot for commit a996a0f. Bugbot is set up for automated code reviews on this repo. Configure here.