feat(cli): add datachain bucket status command#1717
feat(cli): add datachain bucket status command#1717amritghimire wants to merge 14 commits intomainfrom
Conversation
Add a new CLI command and public Python API to check bucket existence and access level without listing objects. Supports S3, GCS, and Azure. Closes #1715
There was a problem hiding this comment.
Pull request overview
Adds a new bucket/container “status” capability across CLI and Python to detect existence and access level (anonymous/authenticated/denied) without listing objects, for S3/GCS/Azure.
Changes:
- Introduces
BucketStatusplus provider-specificbucket_status()implementations for S3, GCS, and Azure. - Adds Python API
datachain.client.bucket_status(uri, **config)and exports it (andBucketStatus) from the top-leveldatachainpackage. - Adds CLI parsing + handler for
datachain bucket status <uri>and corresponding unit/functional tests.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_client_bucket_status.py | Adds unit tests for provider-specific bucket status scenarios. |
| tests/unit/test_cli_parsing.py | Adds CLI parser tests for bucket status and --anon. |
| tests/func/test_bucket_status.py | Adds functional tests for S3 bucket status behavior. |
| src/datachain/client/s3.py | Implements S3 bucket_status and S3 kwargs normalization helper. |
| src/datachain/client/gcs.py | Implements GCS bucket_status with anon→auth probing. |
| src/datachain/client/azure.py | Implements Azure bucket_status with auth→anon probing. |
| src/datachain/client/fsspec.py | Adds BucketStatus type and base Client.bucket_status contract. |
| src/datachain/client/init.py | Adds bucket_status() API and exports BucketStatus. |
| src/datachain/cli/parser/init.py | Adds bucket status CLI subcommand and arguments. |
| src/datachain/cli/commands/bucket.py | Adds CLI command implementation for printing status + exit code. |
| src/datachain/cli/commands/init.py | Exports bucket_status_cmd. |
| src/datachain/cli/init.py | Wires new bucket command into CLI dispatcher. |
| src/datachain/init.py | Exports BucketStatus and bucket_status at top-level. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Deploying datachain with
|
| Latest commit: |
3bc2e4d
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://7a74fb0a.datachain-2g6.pages.dev |
| Branch Preview URL: | https://feat-bucket-status-cmd.datachain-2g6.pages.dev |
…nd error semantics - S3: forward caller kwargs (endpoint_url, region, etc.) to anonymous probe - GCS: use create_fs for anonymous probe so endpoint config is preserved - GCS: narrow broad except Exception to google.api_core Forbidden/PermissionDenied - Azure: forward account_name and connection kwargs to anonymous probe - Azure: return exists=True (not False) when anonymous probe gets PermissionError - Update test assertion for Azure no-account-name + denied scenario - Fix bucket_status() docstring: clarify Azure behavior when account_name absent
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
is it just me or |
There was a problem hiding this comment.
@amritghimire The path component in the URI is silently ignored right now - s3://my_bucket/my_dir/dir1 only checks my_bucket and ignores the dirs. It should give an error when a path is presented.
Why? There is a good chance this functionality will be extended to file/directory existence checks in the future (like test -e / test -f / test -d in Unix), and silently discarding the path will make this comand backward incompatible.
Raise ValueError when a path component is present in the URI passed to bucket_status(), preventing silent data loss and future backward-incompatibility when directory/file existence checks are added. Uses client_cls.split_url() to detect the path component.
dmpetrov
left a comment
There was a problem hiding this comment.
LG but I didn't get deeper in the code
- Change ValueError message to "path in a bucket is not allowed" (dmpetrov) - Azure: try anonymous probe first, consistent with S3/GCS (shcheklein) - Azure: use BlobServiceClient directly for anon probe to prevent adlfs from picking up credentials via AZURE_STORAGE_CONNECTION_STRING env var - Azure: catch ClientAuthenticationError (HTTP 401) alongside PermissionError - Add --account-name CLI flag for Azure anonymous access detection - Extend func tests to cover gs and azure, not just s3 (shcheklein) - Remove redundant S3 unit tests covered by func tests (shcheklein) - Remove section-divider comment style in tests (shcheklein) - Add test for public container with incompatible credentials scenario
- Simplified error handling in Azure and GCS bucket status methods by removing the `anon_only` flag and associated logic. - Updated documentation for `bucket_status` to clarify the use of `client_config`. - Removed redundant tests related to anonymous access checks for S3, GCS, and Azure. - Adjusted CLI parser help messages for clarity on bucket URI formats.
| if account_name: | ||
| try: | ||
| url = f"https://{account_name}.blob.core.windows.net" | ||
| anon_client = BlobServiceClient(account_url=url) |
There was a problem hiding this comment.
Just to double check - how we build URL - is it only bucket name?
There was a problem hiding this comment.
Yes, if not we raise error as requested by dmitry in one of the comment.
| dest="account_name", | ||
| type=str, | ||
| default=None, | ||
| help="Azure storage account name (required for anonymous access detection).", |
There was a problem hiding this comment.
is it required for non anon access check also? in some cases I think key doesn't include account name
There was a problem hiding this comment.
I would leave as it is. In case key doesn't include account name, it raises error saying the same.
There was a problem hiding this comment.
We still have a lot of mocks here
I hope we can do most of the tests using mock FS implementations instead
There was a problem hiding this comment.
It was not possible because I wanted to cover all edge cases like all types of exceptions which was only possible with mock. They are parametrized as much as possible.
One most repeating mock is MagicMock(bucket_cmd="status", uri="s3://b", account_name=None) which is just to simulate the call to args for easy tests with cli arguments.
Summary
Closes #1715
Adds
datachain bucket status <uri>CLI command andbucket_status()Python API to check bucket existence and access level without listing objects. Supports S3, GCS, and Azure.Behavior
Exits 0 if the bucket exists, 1 if not found.
S3
GCS
Azure
Python API
Implementation notes
anonymousresults when credentials are present in the environmentgoogle.api_core.exceptions.Forbidden(not mapped toPermissionError) via a broad fallback handlerBucketStatusis exported from the top-leveldatachainnamespaceTesting