feat(infrastructure): add optional ADLS Gen2 data lake storage account by jjottar · Pull Request #398 · microsoft/physical-ai-toolchain

jjottar · 2026-04-07T08:54:22Z

Pull Request

Description

Add an optional dedicated ADLS Gen2 storage account with hierarchical namespace (HNS) for domain data (datasets, model checkpoints), separate from the existing AzureML workspace storage. The data lake is gated behind should_create_data_lake_storage (default: false) and follows existing patterns for naming, networking, RBAC, and lifecycle policies.

Closes #385

Type of Change

🐛 Bug fix (non-breaking change fixing an issue)
✨ New feature (non-breaking change adding functionality)
💥 Breaking change (fix or feature causing existing functionality to change)
📚 Documentation update
🏗️ Infrastructure change (Terraform/IaC)
♻️ Refactoring (no functional changes)

Component(s) Affected

infrastructure/terraform/prerequisites/ - Azure subscription setup
infrastructure/terraform/ - Terraform infrastructure
infrastructure/setup/ - OSMO control plane / Helm
workflows/ - Training and evaluation workflows
training/ - Training pipelines and scripts
docs/ - Documentation

Testing Performed

Terraform plan reviewed (no unexpected changes)
Terraform apply tested in dev environment
Training scripts tested locally with Isaac Sim
OSMO workflow submitted successfully
Smoke tests passed (smoke_test_azure.py)

Terraform Plan (with `should_create_data_lake_storage = true`)

Plan: 11 to add, 2 to change, 1 to destroy.

Action	Resource
create	`module.platform.azurerm_storage_account.data_lake[0]`
create	`module.platform.azurerm_storage_container.datasets[0]`
create	`module.platform.azurerm_storage_container.models[0]`
create	`module.platform.azurerm_storage_management_policy.data_lake[0]`
create	`module.platform.azurerm_private_dns_zone.core["storage_dfs"]`
create	`module.platform.azurerm_private_dns_zone_virtual_network_link.core["storage_dfs"]`
create	`module.platform.azurerm_private_endpoint.data_lake_blob[0]`
create	`module.platform.azurerm_private_endpoint.data_lake_dfs[0]`
create	`module.platform.azurerm_role_assignment.user_data_lake_blob[0]`
create	`module.platform.azurerm_role_assignment.ml_data_lake_blob[0]`
create	`module.platform.azurerm_role_assignment.osmo_data_lake_blob[0]`
update	`module.platform.azurerm_key_vault.main` (in-place, pre-existing drift)
update	`module.platform.azurerm_storage_account.main` (in-place, pre-existing drift)
destroy	`module.platform.azurerm_storage_management_policy.main`

The 2 in-place updates and the destroy are expected:

Updates: Pre-existing drift on Key Vault and ML storage account — not caused by this feature.
Destroy: ML storage lifecycle policy transitions to conditional (count = 0 when data lake is enabled). Existing deployments without the data lake retain their lifecycle rules.

Terraform Apply

All 11 data lake resources created successfully in rg-roboticsch-dev-001 (switzerlandnorth):

Resource	Name / ID
Storage Account	`stdlroboticschdev001` (HNS enabled)
Container: datasets	`datasets` (private)
Container: models	`models` (private)
Lifecycle Policy	3 rules (raw bags delete, datasets cool, reports cool→archive)
DNS Zone	`privatelink.dfs.core.windows.net`
PE: blob	`pe-datalake-blob-roboticsch-dev-001`
PE: dfs	`pe-datalake-dfs-roboticsch-dev-001`
RBAC: current user	Storage Blob Data Contributor on `stdl*`
RBAC: ML identity	Storage Blob Data Contributor on `stdl*`
RBAC: OSMO identity	Storage Blob Data Contributor on `stdl*`

Terraform Test

Total: 160 passed, 0 failed, 0 errors

Lint & Validation

Check	Result
`npm run lint:tf`	0 issues
`npm run lint:tf:validate`	All directories passed
`npm run spell-check`	0 issues
`npm run lint:md`	0 errors

What Changed

Platform Module (`infrastructure/terraform/modules/platform/`)

storage.tf — New azurerm_storage_account.data_lake with is_hns_enabled = true, datasets and models containers, data lake lifecycle policy, blob and DFS private endpoints. ML storage lifecycle policy gated with count = var.should_create_data_lake_storage ? 0 : 1 to avoid regression for existing deployments.
main.tf — Added storage_dfs = "privatelink.dfs.core.windows.net" to base_dns_zones (7 base zones, up from 6).
variables.tf — New should_create_data_lake_storage variable (bool, default false).
role-assignments.tf — Added Storage Blob Data Contributor on data lake for current user, ML identity, and OSMO identity. All gated on the data lake flag.
outputs.tf — New data_lake_storage_account and data_lake_storage_account_access outputs (null when disabled).

Dataviewer Module (`infrastructure/terraform/modules/dataviewer/`)

variables.deps.tf — New optional data_lake_storage_account input (nullable).
role-assignments.tf — Conditional Storage Blob Data Contributor on data lake for dataviewer identity.

Root Module (`infrastructure/terraform/`)

variables.tf — New should_create_data_lake_storage root variable.
main.tf — Pass should_create_data_lake_storage to platform module.
outputs.tf — New data_lake_storage_account root output.
terraform.tfvars.example — Added should_create_data_lake_storage with documentation.

Tests (`infrastructure/terraform/modules/platform/tests/`)

dns-zones.tftest.hcl — Updated zone counts (6→7 base zones).
security.tftest.hcl — Added data_lake_security and data_lake_disabled_by_default test runs.
conditionals.tftest.hcl — Added data_lake_enabled and data_lake_disabled test runs.

Documentation

docs/cloud/blob-storage-structure.md — Rewritten for two-account architecture: ML workspace storage vs data lake storage, new container/folder structure, updated lifecycle policy references.
.cspell/general-technical.txt — Added stdl (data lake naming prefix).

Documentation Impact

Documentation updated in this PR

Checklist

My code follows the project conventions
Commit messages follow conventional commit format
I have performed a self-review
Documentation impact assessed above
No new linting warnings introduced

- add data lake storage account with HNS behind should_create_data_lake_storage flag - add datasets and models containers with lifecycle policies - add storage_dfs private DNS zone and data lake private endpoints - add Storage Blob Data Contributor role assignments for ML, OSMO, user, and dataviewer identities - update blob storage architecture docs for two-account layout 🗄️ - Generated by Copilot

codecov-commenter · 2026-04-07T08:55:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 50.46%. Comparing base (acfe20d) to head (44ad760).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #398   +/-   ##
=======================================
  Coverage   50.46%   50.46%           
=======================================
  Files         267      267           
  Lines       18098    18098           
  Branches     1855     1855           
=======================================
  Hits         9134     9134           
  Misses       8674     8674           
  Partials      290      290

Flag	Coverage Δ	*Carryforward flag
pester	`81.96% <ø> (ø)`
pytest	`6.89% <ø> (ø)`	Carriedforward from b999299
pytest-dataviewer	`61.97% <ø> (ø)`
vitest	`50.72% <ø> (ø)`

*This pull request uses carry forward flags. Click here to find out more.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

katriendg

Thank you @jjottar for this contribution!

I've left one comment in the review, and two small requests:

We have added Terraform docs generation (still to be documented though, so not something you knew): could you run npm run docs:generate:tf locally to update TERRAFORM.md file(s) before we merge?
We typically document variables in the file infrastructure/terraform/terraform.tfvars.example, could you add this new one there as well?

katriendg · 2026-04-07T14:56:53Z

infrastructure/terraform/modules/platform/storage.tf


-resource "azurerm_storage_management_policy" "main" {
-  storage_account_id = azurerm_storage_account.main.id
+resource "azurerm_storage_account" "data_lake" {


Lifecycle policy regression — should_create_data_lake_storage = false silently removes cost controls

The removal of azurerm_storage_management_policy.main introduces a silent regression for all existing deployments that are not simultaneously opting in to the data lake.

What breaks:

On the current main branch, the three lifecycle rules (raw/ delete, converted/ cool tier, reports/ cool→archive) live on azurerm_storage_account.main. This PR moves them exclusively to azurerm_storage_management_policy.data_lake, which is gated by should_create_data_lake_storage. With should_create_data_lake_storage = false (the default), the next terraform apply on any existing deployment will:

Destroy azurerm_storage_management_policy.main — raw bags will accumulate with no automated deletion, converted datasets stay on Hot tier indefinitely.

Silently NOP should_enable_raw_bags_lifecycle_policy = true and the other two lifecycle variables — they wire into the data lake resource that doesn't exist, so they have zero effect without an error.

The PR's own type-of-change checklist marks this as a non-breaking new feature, which is inconsistent with this destroy.

Recommended fix: independent policies per account

The cleanest model is two independent lifecycle policies, each managing its own concern, with no coupling between them. To avoid the regression for existing deployments during the migration window, add a fallback policy on the ML storage account that is active only when the data lake is disabled:

resource "azurerm_storage_management_policy" "main" { count = var.should_create_data_lake_storage ? 0 : 1 storage_account_id = azurerm_storage_account.main.id # same rules as before ... }

This is zero-regression: existing deployments keep their lifecycle policy until they explicitly enable the data lake. At that point Terraform destroys the ML policy and creates the data lake policy in the same apply — an intentional, visible transition. The three should_enable_*_lifecycle_policy variables remain meaningful in both states.

Alternatively, if the intent is that no one should be writing raw/, converted/, or reports/ blobs to the ML storage account anymore (architecturally correct), then the regression is acceptable — but should_enable_raw_bags_lifecycle_policy and its siblings should produce a precondition error when set to true without should_create_data_lake_storage = true, so the break is explicit rather than silent.

Thanks a lot @katriendg for the review and on-point catch and suggestions!

All three items addressed (and PR description updated accordingly):

Lifecycle policy regression — Fixed in commit 837bbd8. azurerm_storage_management_policy.main is now retained with count = var.should_create_data_lake_storage ? 0 : 1, so existing deployments keep their lifecycle rules until the data lake is explicitly enabled. Zero-regression path as you described.

terraform.tfvars.example — Added should_create_data_lake_storage in the same commit.

TERRAFORM.md regeneration — Done in a separate commit (44ad760). Note: npm run docs:generate:tf regenerated all 9 directories, not just the 3 we modified. The extra 6 files are pre-existing formatting normalization from the new pipeline (PR feat(build): add terraform-docs generation pipeline #378). I kept them in a standalone commit so you can assess whether to include them or squash/drop that commit if you'd prefer to handle the bulk regeneration separately.

- add conditional azurerm_storage_management_policy.main (active when data lake off) - add should_create_data_lake_storage to terraform.tfvars.example

…npm run docs:generate:tf

jjottar marked this pull request as ready for review April 7, 2026 08:59

jjottar requested a review from a team as a code owner April 7, 2026 08:59

jjottar mentioned this pull request Apr 7, 2026

feat(infra): Add dedicated ADLS Gen2 storage account #385

Open

katriendg reviewed Apr 7, 2026

View reviewed changes

Juan Jottar added 2 commits April 7, 2026 17:24

address reviewer (katriendg) comments:

837bbd8

- add conditional azurerm_storage_management_policy.main (active when data lake off) - add should_create_data_lake_storage to terraform.tfvars.example

address reviewer (katriendg) suggestion: regenerate TERRAFORM.md via …

44ad760

…npm run docs:generate:tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(infrastructure): add optional ADLS Gen2 data lake storage account#398

feat(infrastructure): add optional ADLS Gen2 data lake storage account#398
jjottar wants to merge 3 commits intomicrosoft:mainfrom
jjottar:feat/adls-gen2-storage

jjottar commented Apr 7, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 7, 2026 •

edited

Loading

Uh oh!

katriendg left a comment

Uh oh!

katriendg Apr 7, 2026

Uh oh!

jjottar Apr 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jjottar commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Description

Type of Change

Component(s) Affected

Testing Performed

Terraform Plan (with should_create_data_lake_storage = true)

Terraform Apply

Terraform Test

Lint & Validation

What Changed

Platform Module (infrastructure/terraform/modules/platform/)

Dataviewer Module (infrastructure/terraform/modules/dataviewer/)

Root Module (infrastructure/terraform/)

Tests (infrastructure/terraform/modules/platform/tests/)

Documentation

Documentation Impact

Checklist

Uh oh!

codecov-commenter commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

katriendg left a comment

Choose a reason for hiding this comment

Uh oh!

katriendg Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

jjottar Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jjottar commented Apr 7, 2026 •

edited

Loading

Terraform Plan (with `should_create_data_lake_storage = true`)

Platform Module (`infrastructure/terraform/modules/platform/`)

Dataviewer Module (`infrastructure/terraform/modules/dataviewer/`)

Root Module (`infrastructure/terraform/`)

Tests (`infrastructure/terraform/modules/platform/tests/`)

codecov-commenter commented Apr 7, 2026 •

edited

Loading

jjottar Apr 7, 2026 •

edited

Loading