Skip to content

feat(infrastructure): add optional ADLS Gen2 data lake storage account#398

Open
jjottar wants to merge 3 commits intomicrosoft:mainfrom
jjottar:feat/adls-gen2-storage
Open

feat(infrastructure): add optional ADLS Gen2 data lake storage account#398
jjottar wants to merge 3 commits intomicrosoft:mainfrom
jjottar:feat/adls-gen2-storage

Conversation

@jjottar
Copy link
Copy Markdown
Contributor

@jjottar jjottar commented Apr 7, 2026

Pull Request

Description

Add an optional dedicated ADLS Gen2 storage account with hierarchical namespace (HNS) for domain data (datasets, model checkpoints), separate from the existing AzureML workspace storage. The data lake is gated behind should_create_data_lake_storage (default: false) and follows existing patterns for naming, networking, RBAC, and lifecycle policies.

Closes #385

Type of Change

  • 🐛 Bug fix (non-breaking change fixing an issue)
  • ✨ New feature (non-breaking change adding functionality)
  • 💥 Breaking change (fix or feature causing existing functionality to change)
  • 📚 Documentation update
  • 🏗️ Infrastructure change (Terraform/IaC)
  • ♻️ Refactoring (no functional changes)

Component(s) Affected

  • infrastructure/terraform/prerequisites/ - Azure subscription setup
  • infrastructure/terraform/ - Terraform infrastructure
  • infrastructure/setup/ - OSMO control plane / Helm
  • workflows/ - Training and evaluation workflows
  • training/ - Training pipelines and scripts
  • docs/ - Documentation

Testing Performed

  • Terraform plan reviewed (no unexpected changes)
  • Terraform apply tested in dev environment
  • Training scripts tested locally with Isaac Sim
  • OSMO workflow submitted successfully
  • Smoke tests passed (smoke_test_azure.py)

Terraform Plan (with should_create_data_lake_storage = true)

Plan: 11 to add, 2 to change, 1 to destroy.

Action Resource
create module.platform.azurerm_storage_account.data_lake[0]
create module.platform.azurerm_storage_container.datasets[0]
create module.platform.azurerm_storage_container.models[0]
create module.platform.azurerm_storage_management_policy.data_lake[0]
create module.platform.azurerm_private_dns_zone.core["storage_dfs"]
create module.platform.azurerm_private_dns_zone_virtual_network_link.core["storage_dfs"]
create module.platform.azurerm_private_endpoint.data_lake_blob[0]
create module.platform.azurerm_private_endpoint.data_lake_dfs[0]
create module.platform.azurerm_role_assignment.user_data_lake_blob[0]
create module.platform.azurerm_role_assignment.ml_data_lake_blob[0]
create module.platform.azurerm_role_assignment.osmo_data_lake_blob[0]
update module.platform.azurerm_key_vault.main (in-place, pre-existing drift)
update module.platform.azurerm_storage_account.main (in-place, pre-existing drift)
destroy module.platform.azurerm_storage_management_policy.main

The 2 in-place updates and the destroy are expected:

  • Updates: Pre-existing drift on Key Vault and ML storage account — not caused by this feature.
  • Destroy: ML storage lifecycle policy transitions to conditional (count = 0 when data lake is enabled). Existing deployments without the data lake retain their lifecycle rules.

Terraform Apply

All 11 data lake resources created successfully in rg-roboticsch-dev-001 (switzerlandnorth):

Resource Name / ID
Storage Account stdlroboticschdev001 (HNS enabled)
Container: datasets datasets (private)
Container: models models (private)
Lifecycle Policy 3 rules (raw bags delete, datasets cool, reports cool→archive)
DNS Zone privatelink.dfs.core.windows.net
PE: blob pe-datalake-blob-roboticsch-dev-001
PE: dfs pe-datalake-dfs-roboticsch-dev-001
RBAC: current user Storage Blob Data Contributor on stdl*
RBAC: ML identity Storage Blob Data Contributor on stdl*
RBAC: OSMO identity Storage Blob Data Contributor on stdl*

Terraform Test

Total: 160 passed, 0 failed, 0 errors

Lint & Validation

Check Result
npm run lint:tf 0 issues
npm run lint:tf:validate All directories passed
npm run spell-check 0 issues
npm run lint:md 0 errors

What Changed

Platform Module (infrastructure/terraform/modules/platform/)

  • storage.tf — New azurerm_storage_account.data_lake with is_hns_enabled = true, datasets and models containers, data lake lifecycle policy, blob and DFS private endpoints. ML storage lifecycle policy gated with count = var.should_create_data_lake_storage ? 0 : 1 to avoid regression for existing deployments.
  • main.tf — Added storage_dfs = "privatelink.dfs.core.windows.net" to base_dns_zones (7 base zones, up from 6).
  • variables.tf — New should_create_data_lake_storage variable (bool, default false).
  • role-assignments.tf — Added Storage Blob Data Contributor on data lake for current user, ML identity, and OSMO identity. All gated on the data lake flag.
  • outputs.tf — New data_lake_storage_account and data_lake_storage_account_access outputs (null when disabled).

Dataviewer Module (infrastructure/terraform/modules/dataviewer/)

  • variables.deps.tf — New optional data_lake_storage_account input (nullable).
  • role-assignments.tf — Conditional Storage Blob Data Contributor on data lake for dataviewer identity.

Root Module (infrastructure/terraform/)

  • variables.tf — New should_create_data_lake_storage root variable.
  • main.tf — Pass should_create_data_lake_storage to platform module.
  • outputs.tf — New data_lake_storage_account root output.
  • terraform.tfvars.example — Added should_create_data_lake_storage with documentation.

Tests (infrastructure/terraform/modules/platform/tests/)

  • dns-zones.tftest.hcl — Updated zone counts (6→7 base zones).
  • security.tftest.hcl — Added data_lake_security and data_lake_disabled_by_default test runs.
  • conditionals.tftest.hcl — Added data_lake_enabled and data_lake_disabled test runs.

Documentation

  • docs/cloud/blob-storage-structure.md — Rewritten for two-account architecture: ML workspace storage vs data lake storage, new container/folder structure, updated lifecycle policy references.
  • .cspell/general-technical.txt — Added stdl (data lake naming prefix).

Documentation Impact

  • Documentation updated in this PR

Checklist

- add data lake storage account with HNS behind should_create_data_lake_storage flag
- add datasets and models containers with lifecycle policies
- add storage_dfs private DNS zone and data lake private endpoints
- add Storage Blob Data Contributor role assignments for ML, OSMO, user, and dataviewer identities
- update blob storage architecture docs for two-account layout

🗄️ - Generated by Copilot
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 50.46%. Comparing base (acfe20d) to head (44ad760).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #398   +/-   ##
=======================================
  Coverage   50.46%   50.46%           
=======================================
  Files         267      267           
  Lines       18098    18098           
  Branches     1855     1855           
=======================================
  Hits         9134     9134           
  Misses       8674     8674           
  Partials      290      290           
Flag Coverage Δ *Carryforward flag
pester 81.96% <ø> (ø)
pytest 6.89% <ø> (ø) Carriedforward from b999299
pytest-dataviewer 61.97% <ø> (ø)
vitest 50.72% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jjottar jjottar marked this pull request as ready for review April 7, 2026 08:59
@jjottar jjottar requested a review from a team as a code owner April 7, 2026 08:59
Copy link
Copy Markdown
Contributor

@katriendg katriendg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jjottar for this contribution!

I've left one comment in the review, and two small requests:

  1. We have added Terraform docs generation (still to be documented though, so not something you knew): could you run npm run docs:generate:tf locally to update TERRAFORM.md file(s) before we merge?
  2. We typically document variables in the file infrastructure/terraform/terraform.tfvars.example, could you add this new one there as well?


resource "azurerm_storage_management_policy" "main" {
storage_account_id = azurerm_storage_account.main.id
resource "azurerm_storage_account" "data_lake" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lifecycle policy regression — should_create_data_lake_storage = false silently removes cost controls

The removal of azurerm_storage_management_policy.main introduces a silent regression for all existing deployments that are not simultaneously opting in to the data lake.

What breaks:

On the current main branch, the three lifecycle rules (raw/ delete, converted/ cool tier, reports/ cool→archive) live on azurerm_storage_account.main. This PR moves them exclusively to azurerm_storage_management_policy.data_lake, which is gated by should_create_data_lake_storage. With should_create_data_lake_storage = false (the default), the next terraform apply on any existing deployment will:

  1. Destroy azurerm_storage_management_policy.main — raw bags will accumulate with no automated deletion, converted datasets stay on Hot tier indefinitely.
  2. Silently NOP should_enable_raw_bags_lifecycle_policy = true and the other two lifecycle variables — they wire into the data lake resource that doesn't exist, so they have zero effect without an error.

The PR's own type-of-change checklist marks this as a non-breaking new feature, which is inconsistent with this destroy.

Recommended fix: independent policies per account

The cleanest model is two independent lifecycle policies, each managing its own concern, with no coupling between them. To avoid the regression for existing deployments during the migration window, add a fallback policy on the ML storage account that is active only when the data lake is disabled:

resource "azurerm_storage_management_policy" "main" {
  count              = var.should_create_data_lake_storage ? 0 : 1
  storage_account_id = azurerm_storage_account.main.id
  # same rules as before
  ...
}

This is zero-regression: existing deployments keep their lifecycle policy until they explicitly enable the data lake. At that point Terraform destroys the ML policy and creates the data lake policy in the same apply — an intentional, visible transition. The three should_enable_*_lifecycle_policy variables remain meaningful in both states.

Alternatively, if the intent is that no one should be writing raw/, converted/, or reports/ blobs to the ML storage account anymore (architecturally correct), then the regression is acceptable — but should_enable_raw_bags_lifecycle_policy and its siblings should produce a precondition error when set to true without should_create_data_lake_storage = true, so the break is explicit rather than silent.

Copy link
Copy Markdown
Contributor Author

@jjottar jjottar Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @katriendg for the review and on-point catch and suggestions!

All three items addressed (and PR description updated accordingly):

  1. Lifecycle policy regression — Fixed in commit 837bbd8. azurerm_storage_management_policy.main is now retained with count = var.should_create_data_lake_storage ? 0 : 1, so existing deployments keep their lifecycle rules until the data lake is explicitly enabled. Zero-regression path as you described.

  2. terraform.tfvars.example — Added should_create_data_lake_storage in the same commit.

  3. TERRAFORM.md regeneration — Done in a separate commit (44ad760). Note: npm run docs:generate:tf regenerated all 9 directories, not just the 3 we modified. The extra 6 files are pre-existing formatting normalization from the new pipeline (PR feat(build): add terraform-docs generation pipeline #378). I kept them in a standalone commit so you can assess whether to include them or squash/drop that commit if you'd prefer to handle the bulk regeneration separately.

Juan Jottar added 2 commits April 7, 2026 17:24
- add conditional azurerm_storage_management_policy.main (active when data lake off)
- add should_create_data_lake_storage to terraform.tfvars.example
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(infra): Add dedicated ADLS Gen2 storage account

3 participants