feat(cli): add datachain bucket status command by amritghimire · Pull Request #1717 · datachain-ai/datachain

amritghimire · 2026-04-08T09:34:18Z

Summary

Adds datachain bucket status <uri> CLI command and bucket_status() Python API to check bucket existence and access level without listing objects. Supports S3, GCS, and Azure.

Behavior

Exits 0 if the bucket exists, 1 if not found.

S3

# Public bucket
$ datachain bucket status s3://datachain-public
Status: exists
Access: anonymous

# Private bucket, no credentials
$ datachain bucket status s3://amrit-datachain-test
Status: exists
Access: denied
Error: Access denied to S3 bucket 'amrit-datachain-test' — check AWS credentials/permissions

# Non-existent bucket
$ datachain bucket status s3://amrit-does-not-exist
Status: not found
Error: S3 bucket 'amrit-does-not-exist' not found

# With correct credentials 
$ datachain bucket status s3://amrit-datachain-test
Status: exists
Access: authenticated

GCS

# Public bucket
$ datachain bucket status gs://datachain-demo
Status: exists
Access: anonymous

# Private bucket, no credentials
$ datachain bucket status gs://thomas-datachain-test
Status: exists
Access: denied
Error: Access denied to GCS bucket 'thomas-datachain-test' — check credentials/permissions

# Non-existent bucket
$ datachain bucket status gs://amrit-does-not-exists
Status: not found
Error: GCS bucket 'amrit-does-not-exists' not found

# Private bucket, authenticated
$ datachain bucket status gs://amrit-datachain-test
Status: exists
Access: authenticated

# Private bucket, credentials present but insufficient
$ datachain bucket status gs://thomas-datachain-test
Status: exists
Access: denied
Error: Forbidden: b/thomas-datachain-test/o ...

Azure

# No creds, no account name
$ datachain bucket status az://thomas-datachain-test
Status: not found
Error: unable to connect to account for Must provide either a connection_string or account_name with credentials!!

# Account name, no creds
$  datachain bucket status az://thomas-datachain-test --account-name datachain
Status: exists
Access: denied
Error: Access denied to Azure container 'thomas-datachain-test' — check credentials/configuration

# Authenticated
$ datachain bucket status az://amrit-test-az
Status: exists
Access: authenticated

Python API

from datachain.client import bucket_status

status = bucket_status("s3://my-bucket/")
print(status.exists)  # True / False
print(status.access)  # 'anonymous' | 'authenticated' | 'denied'
print(status.error)   # error message or None

Implementation notes

S3/GCS: probes anonymous access first, falls back to authenticated
Azure: probes authenticated first (DefaultAzureCredential), falls back to anonymous — avoids false anonymous results when credentials are present in the environment
GCS catches google.api_core.exceptions.Forbidden (not mapped to PermissionError) via a broad fallback handler
BucketStatus is exported from the top-level datachain namespace

Testing

17 unit tests covering all providers and access scenarios (mock-based)
2 functional tests against a live S3-compatible server
3 CLI parser tests

Add a new CLI command and public Python API to check bucket existence and access level without listing objects. Supports S3, GCS, and Azure. Closes #1715

Copilot

Pull request overview

Adds a new bucket/container “status” capability across CLI and Python to detect existence and access level (anonymous/authenticated/denied) without listing objects, for S3/GCS/Azure.

Changes:

Introduces BucketStatus plus provider-specific bucket_status() implementations for S3, GCS, and Azure.
Adds Python API datachain.client.bucket_status(uri, **config) and exports it (and BucketStatus) from the top-level datachain package.
Adds CLI parsing + handler for datachain bucket status <uri> and corresponding unit/functional tests.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
tests/unit/test_client_bucket_status.py	Adds unit tests for provider-specific bucket status scenarios.
tests/unit/test_cli_parsing.py	Adds CLI parser tests for `bucket status` and `--anon`.
tests/func/test_bucket_status.py	Adds functional tests for S3 bucket status behavior.
src/datachain/client/s3.py	Implements S3 `bucket_status` and S3 kwargs normalization helper.
src/datachain/client/gcs.py	Implements GCS `bucket_status` with anon→auth probing.
src/datachain/client/azure.py	Implements Azure `bucket_status` with auth→anon probing.
src/datachain/client/fsspec.py	Adds `BucketStatus` type and base `Client.bucket_status` contract.
src/datachain/client/init.py	Adds `bucket_status()` API and exports `BucketStatus`.
src/datachain/cli/parser/init.py	Adds `bucket status` CLI subcommand and arguments.
src/datachain/cli/commands/bucket.py	Adds CLI command implementation for printing status + exit code.
src/datachain/cli/commands/init.py	Exports `bucket_status_cmd`.
src/datachain/cli/init.py	Wires new `bucket` command into CLI dispatcher.
src/datachain/init.py	Exports `BucketStatus` and `bucket_status` at top-level.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cloudflare-workers-and-pages · 2026-04-08T09:51:23Z

Deploying datachain with Cloudflare Pages

Latest commit:	`3bc2e4d`
Status:	✅ Deploy successful!
Preview URL:	https://7a74fb0a.datachain-2g6.pages.dev
Branch Preview URL:	https://feat-bucket-status-cmd.datachain-2g6.pages.dev

View logs

…nd error semantics - S3: forward caller kwargs (endpoint_url, region, etc.) to anonymous probe - GCS: use create_fs for anonymous probe so endpoint config is preserved - GCS: narrow broad except Exception to google.api_core Forbidden/PermissionDenied - Azure: forward account_name and connection kwargs to anonymous probe - Azure: return exists=True (not False) when anonymous probe gets PermissionError - Update test assertion for Azure no-account-name + denied scenario - Fix bucket_status() docstring: clarify Azure behavior when account_name absent

codecov · 2026-04-08T12:12:34Z

Codecov Report

❌ Patch coverage is 93.93939% with 8 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/datachain/cli/commands/bucket.py	27.27%	8 Missing ⚠️

📢 Thoughts on this report? Let us know!

ilongin · 2026-04-09T12:32:00Z

is it just me or Status: exists sounds little bit strange? ... should exsistance be one of the statuses? It feels like it should be communicated differently e.g with some error ... in console it should maybe just show message like "Bucket doesn't exist" ... on the other hand when bucket exists then this status is not needed to be shown as it will be clear by default.

dmpetrov

@amritghimire The path component in the URI is silently ignored right now - s3://my_bucket/my_dir/dir1 only checks my_bucket and ignores the dirs. It should give an error when a path is presented.

Why? There is a good chance this functionality will be extended to file/directory existence checks in the future (like test -e / test -f / test -d in Unix), and silently discarding the path will make this comand backward incompatible.

Raise ValueError when a path component is present in the URI passed to bucket_status(), preventing silent data loss and future backward-incompatibility when directory/file existence checks are added. Uses client_cls.split_url() to detect the path component.

dmpetrov

LG but I didn't get deeper in the code

- Change ValueError message to "path in a bucket is not allowed" (dmpetrov) - Azure: try anonymous probe first, consistent with S3/GCS (shcheklein) - Azure: use BlobServiceClient directly for anon probe to prevent adlfs from picking up credentials via AZURE_STORAGE_CONNECTION_STRING env var - Azure: catch ClientAuthenticationError (HTTP 401) alongside PermissionError - Add --account-name CLI flag for Azure anonymous access detection - Extend func tests to cover gs and azure, not just s3 (shcheklein) - Remove redundant S3 unit tests covered by func tests (shcheklein) - Remove section-divider comment style in tests (shcheklein) - Add test for public container with incompatible credentials scenario

- Simplified error handling in Azure and GCS bucket status methods by removing the `anon_only` flag and associated logic. - Updated documentation for `bucket_status` to clarify the use of `client_config`. - Removed redundant tests related to anonymous access checks for S3, GCS, and Azure. - Adjusted CLI parser help messages for clarity on bucket URI formats.

shcheklein · 2026-04-24T23:42:08Z

+        if account_name:
+            try:
+                url = f"https://{account_name}.blob.core.windows.net"
+                anon_client = BlobServiceClient(account_url=url)


Just to double check - how we build URL - is it only bucket name?

Yes, if not we raise error as requested by dmitry in one of the comment.

shcheklein · 2026-04-24T23:47:10Z

+        dest="account_name",
+        type=str,
+        default=None,
+        help="Azure storage account name (required for anonymous access detection).",


is it required for non anon access check also? in some cases I think key doesn't include account name

I would leave as it is. In case key doesn't include account name, it raises error saying the same.

shcheklein · 2026-04-24T23:51:02Z

We still have a lot of mocks here

I hope we can do most of the tests using mock FS implementations instead

It was not possible because I wanted to cover all edge cases like all types of exceptions which was only possible with mock. They are parametrized as much as possible.
One most repeating mock is MagicMock(bucket_cmd="status", uri="s3://b", account_name=None) which is just to simulate the call to args for easy tests with cli arguments.

feat(cli): add datachain bucket status command

856bea6

Add a new CLI command and public Python API to check bucket existence and access level without listing objects. Supports S3, GCS, and Azure. Closes #1715

amritghimire requested review from Copilot and dmpetrov April 8, 2026 09:44

amritghimire self-assigned this Apr 8, 2026

Copilot started reviewing on behalf of amritghimire April 8, 2026 09:45 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

amritghimire marked this pull request as draft April 8, 2026 09:51

Merge branch 'main' into feat/bucket-status-cmd

717d344

Merge branch 'main' into feat/bucket-status-cmd

44a0d21

amritghimire marked this pull request as ready for review April 8, 2026 12:56

dmpetrov reviewed Apr 9, 2026

View reviewed changes

dmpetrov approved these changes Apr 10, 2026

View reviewed changes

Comment thread src/datachain/client/__init__.py Outdated