Skip to content

feat(cli): add datachain bucket status command#1717

Open
amritghimire wants to merge 14 commits intomainfrom
feat/bucket-status-cmd
Open

feat(cli): add datachain bucket status command#1717
amritghimire wants to merge 14 commits intomainfrom
feat/bucket-status-cmd

Conversation

@amritghimire
Copy link
Copy Markdown
Contributor

@amritghimire amritghimire commented Apr 8, 2026

Summary

Closes #1715

Adds datachain bucket status <uri> CLI command and bucket_status() Python API to check bucket existence and access level without listing objects. Supports S3, GCS, and Azure.

Behavior

Exits 0 if the bucket exists, 1 if not found.

S3

# Public bucket
$ datachain bucket status s3://datachain-public
Status: exists
Access: anonymous

# Private bucket, no credentials
$ datachain bucket status s3://amrit-datachain-test
Status: exists
Access: denied
Error: Access denied to S3 bucket 'amrit-datachain-test' — check AWS credentials/permissions

# Non-existent bucket
$ datachain bucket status s3://amrit-does-not-exist
Status: not found
Error: S3 bucket 'amrit-does-not-exist' not found

# With correct credentials 
$ datachain bucket status s3://amrit-datachain-test
Status: exists
Access: authenticated

GCS

# Public bucket
$ datachain bucket status gs://datachain-demo
Status: exists
Access: anonymous

# Private bucket, no credentials
$ datachain bucket status gs://thomas-datachain-test
Status: exists
Access: denied
Error: Access denied to GCS bucket 'thomas-datachain-test' — check credentials/permissions

# Non-existent bucket
$ datachain bucket status gs://amrit-does-not-exists
Status: not found
Error: GCS bucket 'amrit-does-not-exists' not found

# Private bucket, authenticated
$ datachain bucket status gs://amrit-datachain-test
Status: exists
Access: authenticated

# Private bucket, credentials present but insufficient
$ datachain bucket status gs://thomas-datachain-test
Status: exists
Access: denied
Error: Forbidden: b/thomas-datachain-test/o ...

Azure

# No creds, no account name
$ datachain bucket status az://thomas-datachain-test
Status: not found
Error: unable to connect to account for Must provide either a connection_string or account_name with credentials!!

# Account name, no creds
$  datachain bucket status az://thomas-datachain-test --account-name datachain
Status: exists
Access: denied
Error: Access denied to Azure container 'thomas-datachain-test' — check credentials/configuration

# Authenticated
$ datachain bucket status az://amrit-test-az
Status: exists
Access: authenticated

Python API

from datachain.client import bucket_status

status = bucket_status("s3://my-bucket/")
print(status.exists)  # True / False
print(status.access)  # 'anonymous' | 'authenticated' | 'denied'
print(status.error)   # error message or None

Implementation notes

  • S3/GCS: probes anonymous access first, falls back to authenticated
  • Azure: probes authenticated first (DefaultAzureCredential), falls back to anonymous — avoids false anonymous results when credentials are present in the environment
  • GCS catches google.api_core.exceptions.Forbidden (not mapped to PermissionError) via a broad fallback handler
  • BucketStatus is exported from the top-level datachain namespace

Testing

  • 17 unit tests covering all providers and access scenarios (mock-based)
  • 2 functional tests against a live S3-compatible server
  • 3 CLI parser tests

Add a new CLI command and public Python API to check bucket existence
and access level without listing objects. Supports S3, GCS, and Azure.

Closes #1715
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new bucket/container “status” capability across CLI and Python to detect existence and access level (anonymous/authenticated/denied) without listing objects, for S3/GCS/Azure.

Changes:

  • Introduces BucketStatus plus provider-specific bucket_status() implementations for S3, GCS, and Azure.
  • Adds Python API datachain.client.bucket_status(uri, **config) and exports it (and BucketStatus) from the top-level datachain package.
  • Adds CLI parsing + handler for datachain bucket status <uri> and corresponding unit/functional tests.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/unit/test_client_bucket_status.py Adds unit tests for provider-specific bucket status scenarios.
tests/unit/test_cli_parsing.py Adds CLI parser tests for bucket status and --anon.
tests/func/test_bucket_status.py Adds functional tests for S3 bucket status behavior.
src/datachain/client/s3.py Implements S3 bucket_status and S3 kwargs normalization helper.
src/datachain/client/gcs.py Implements GCS bucket_status with anon→auth probing.
src/datachain/client/azure.py Implements Azure bucket_status with auth→anon probing.
src/datachain/client/fsspec.py Adds BucketStatus type and base Client.bucket_status contract.
src/datachain/client/init.py Adds bucket_status() API and exports BucketStatus.
src/datachain/cli/parser/init.py Adds bucket status CLI subcommand and arguments.
src/datachain/cli/commands/bucket.py Adds CLI command implementation for printing status + exit code.
src/datachain/cli/commands/init.py Exports bucket_status_cmd.
src/datachain/cli/init.py Wires new bucket command into CLI dispatcher.
src/datachain/init.py Exports BucketStatus and bucket_status at top-level.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/datachain/client/s3.py Outdated
Comment thread src/datachain/client/azure.py Outdated
Comment thread src/datachain/client/azure.py Outdated
Comment thread src/datachain/client/gcs.py Outdated
Comment thread tests/unit/test_client_bucket_status.py Outdated
Comment thread src/datachain/client/__init__.py Outdated
Comment thread src/datachain/client/gcs.py Outdated
@amritghimire amritghimire marked this pull request as draft April 8, 2026 09:51
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 8, 2026

Deploying datachain with  Cloudflare Pages  Cloudflare Pages

Latest commit: 3bc2e4d
Status: ✅  Deploy successful!
Preview URL: https://7a74fb0a.datachain-2g6.pages.dev
Branch Preview URL: https://feat-bucket-status-cmd.datachain-2g6.pages.dev

View logs

…nd error semantics

- S3: forward caller kwargs (endpoint_url, region, etc.) to anonymous probe
- GCS: use create_fs for anonymous probe so endpoint config is preserved
- GCS: narrow broad except Exception to google.api_core Forbidden/PermissionDenied
- Azure: forward account_name and connection kwargs to anonymous probe
- Azure: return exists=True (not False) when anonymous probe gets PermissionError
- Update test assertion for Azure no-account-name + denied scenario
- Fix bucket_status() docstring: clarify Azure behavior when account_name absent
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 93.93939% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/datachain/cli/commands/bucket.py 27.27% 8 Missing ⚠️

📢 Thoughts on this report? Let us know!

@amritghimire amritghimire marked this pull request as ready for review April 8, 2026 12:56
@ilongin
Copy link
Copy Markdown
Contributor

ilongin commented Apr 9, 2026

is it just me or Status: exists sounds little bit strange? ... should exsistance be one of the statuses? It feels like it should be communicated differently e.g with some error ... in console it should maybe just show message like "Bucket doesn't exist" ... on the other hand when bucket exists then this status is not needed to be shown as it will be clear by default.

Copy link
Copy Markdown
Contributor

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amritghimire The path component in the URI is silently ignored right now - s3://my_bucket/my_dir/dir1 only checks my_bucket and ignores the dirs. It should give an error when a path is presented.

Why? There is a good chance this functionality will be extended to file/directory existence checks in the future (like test -e / test -f / test -d in Unix), and silently discarding the path will make this comand backward incompatible.

Raise ValueError when a path component is present in the URI passed to
bucket_status(), preventing silent data loss and future
backward-incompatibility when directory/file existence checks are added.

Uses client_cls.split_url() to detect the path component.
Copy link
Copy Markdown
Contributor

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG but I didn't get deeper in the code

Comment thread src/datachain/client/__init__.py Outdated
Comment thread tests/unit/test_client_bucket_status.py Outdated
Comment thread tests/func/test_bucket_status.py Outdated
Comment thread tests/unit/test_client_bucket_status.py Outdated
Comment thread src/datachain/client/azure.py
Comment thread src/datachain/client/azure.py
Comment thread src/datachain/client/azure.py Outdated
Comment thread src/datachain/client/fsspec.py
amritghimire and others added 4 commits April 11, 2026 09:00
- Change ValueError message to "path in a bucket is not allowed" (dmpetrov)
- Azure: try anonymous probe first, consistent with S3/GCS (shcheklein)
- Azure: use BlobServiceClient directly for anon probe to prevent adlfs
  from picking up credentials via AZURE_STORAGE_CONNECTION_STRING env var
- Azure: catch ClientAuthenticationError (HTTP 401) alongside PermissionError
- Add --account-name CLI flag for Azure anonymous access detection
- Extend func tests to cover gs and azure, not just s3 (shcheklein)
- Remove redundant S3 unit tests covered by func tests (shcheklein)
- Remove section-divider comment style in tests (shcheklein)
- Add test for public container with incompatible credentials scenario
@amritghimire amritghimire requested a review from shcheklein April 15, 2026 09:20
Comment thread src/datachain/client/__init__.py Outdated
Comment thread src/datachain/client/azure.py
Comment thread src/datachain/client/azure.py Outdated
Comment thread src/datachain/client/gcs.py Outdated
Comment thread tests/unit/test_client_bucket_status.py Outdated
- Simplified error handling in Azure and GCS bucket status methods by removing the `anon_only` flag and associated logic.
- Updated documentation for `bucket_status` to clarify the use of `client_config`.
- Removed redundant tests related to anonymous access checks for S3, GCS, and Azure.
- Adjusted CLI parser help messages for clarity on bucket URI formats.
@amritghimire amritghimire requested a review from shcheklein April 24, 2026 13:13
if account_name:
try:
url = f"https://{account_name}.blob.core.windows.net"
anon_client = BlobServiceClient(account_url=url)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double check - how we build URL - is it only bucket name?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if not we raise error as requested by dmitry in one of the comment.

dest="account_name",
type=str,
default=None,
help="Azure storage account name (required for anonymous access detection).",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it required for non anon access check also? in some cases I think key doesn't include account name

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave as it is. In case key doesn't include account name, it raises error saying the same.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have a lot of mocks here

I hope we can do most of the tests using mock FS implementations instead

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not possible because I wanted to cover all edge cases like all types of exceptions which was only possible with mock. They are parametrized as much as possible.
One most repeating mock is MagicMock(bucket_cmd="status", uri="s3://b", account_name=None) which is just to simulate the call to args for easy tests with cli arguments.

@amritghimire amritghimire requested a review from shcheklein April 27, 2026 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add fast bucket status check

5 participants