feat(infrastructure): add optional ADLS Gen2 data lake storage account#398
feat(infrastructure): add optional ADLS Gen2 data lake storage account#398jjottar wants to merge 3 commits intomicrosoft:mainfrom
Conversation
- add data lake storage account with HNS behind should_create_data_lake_storage flag - add datasets and models containers with lifecycle policies - add storage_dfs private DNS zone and data lake private endpoints - add Storage Blob Data Contributor role assignments for ML, OSMO, user, and dataviewer identities - update blob storage architecture docs for two-account layout 🗄️ - Generated by Copilot
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #398 +/- ##
=======================================
Coverage 50.46% 50.46%
=======================================
Files 267 267
Lines 18098 18098
Branches 1855 1855
=======================================
Hits 9134 9134
Misses 8674 8674
Partials 290 290
*This pull request uses carry forward flags. Click here to find out more. 🚀 New features to boost your workflow:
|
katriendg
left a comment
There was a problem hiding this comment.
Thank you @jjottar for this contribution!
I've left one comment in the review, and two small requests:
- We have added Terraform docs generation (still to be documented though, so not something you knew): could you run
npm run docs:generate:tflocally to updateTERRAFORM.mdfile(s) before we merge? - We typically document variables in the file
infrastructure/terraform/terraform.tfvars.example, could you add this new one there as well?
|
|
||
| resource "azurerm_storage_management_policy" "main" { | ||
| storage_account_id = azurerm_storage_account.main.id | ||
| resource "azurerm_storage_account" "data_lake" { |
There was a problem hiding this comment.
Lifecycle policy regression — should_create_data_lake_storage = false silently removes cost controls
The removal of azurerm_storage_management_policy.main introduces a silent regression for all existing deployments that are not simultaneously opting in to the data lake.
What breaks:
On the current main branch, the three lifecycle rules (raw/ delete, converted/ cool tier, reports/ cool→archive) live on azurerm_storage_account.main. This PR moves them exclusively to azurerm_storage_management_policy.data_lake, which is gated by should_create_data_lake_storage. With should_create_data_lake_storage = false (the default), the next terraform apply on any existing deployment will:
- Destroy
azurerm_storage_management_policy.main— raw bags will accumulate with no automated deletion, converted datasets stay on Hot tier indefinitely. - Silently NOP
should_enable_raw_bags_lifecycle_policy = trueand the other two lifecycle variables — they wire into the data lake resource that doesn't exist, so they have zero effect without an error.
The PR's own type-of-change checklist marks this as a non-breaking new feature, which is inconsistent with this destroy.
Recommended fix: independent policies per account
The cleanest model is two independent lifecycle policies, each managing its own concern, with no coupling between them. To avoid the regression for existing deployments during the migration window, add a fallback policy on the ML storage account that is active only when the data lake is disabled:
resource "azurerm_storage_management_policy" "main" {
count = var.should_create_data_lake_storage ? 0 : 1
storage_account_id = azurerm_storage_account.main.id
# same rules as before
...
}This is zero-regression: existing deployments keep their lifecycle policy until they explicitly enable the data lake. At that point Terraform destroys the ML policy and creates the data lake policy in the same apply — an intentional, visible transition. The three should_enable_*_lifecycle_policy variables remain meaningful in both states.
Alternatively, if the intent is that no one should be writing raw/, converted/, or reports/ blobs to the ML storage account anymore (architecturally correct), then the regression is acceptable — but should_enable_raw_bags_lifecycle_policy and its siblings should produce a precondition error when set to true without should_create_data_lake_storage = true, so the break is explicit rather than silent.
There was a problem hiding this comment.
Thanks a lot @katriendg for the review and on-point catch and suggestions!
All three items addressed (and PR description updated accordingly):
-
Lifecycle policy regression — Fixed in commit 837bbd8. azurerm_storage_management_policy.main is now retained with count = var.should_create_data_lake_storage ? 0 : 1, so existing deployments keep their lifecycle rules until the data lake is explicitly enabled. Zero-regression path as you described.
-
terraform.tfvars.example — Added should_create_data_lake_storage in the same commit.
-
TERRAFORM.md regeneration — Done in a separate commit (44ad760). Note: npm run docs:generate:tf regenerated all 9 directories, not just the 3 we modified. The extra 6 files are pre-existing formatting normalization from the new pipeline (PR feat(build): add terraform-docs generation pipeline #378). I kept them in a standalone commit so you can assess whether to include them or squash/drop that commit if you'd prefer to handle the bulk regeneration separately.
- add conditional azurerm_storage_management_policy.main (active when data lake off) - add should_create_data_lake_storage to terraform.tfvars.example
…npm run docs:generate:tf
Pull Request
Description
Add an optional dedicated ADLS Gen2 storage account with hierarchical namespace (HNS) for domain data (datasets, model checkpoints), separate from the existing AzureML workspace storage. The data lake is gated behind
should_create_data_lake_storage(default:false) and follows existing patterns for naming, networking, RBAC, and lifecycle policies.Closes #385
Type of Change
Component(s) Affected
infrastructure/terraform/prerequisites/- Azure subscription setupinfrastructure/terraform/- Terraform infrastructureinfrastructure/setup/- OSMO control plane / Helmworkflows/- Training and evaluation workflowstraining/- Training pipelines and scriptsdocs/- DocumentationTesting Performed
planreviewed (no unexpected changes)applytested in dev environmentsmoke_test_azure.py)Terraform Plan (with
should_create_data_lake_storage = true)Plan: 11 to add, 2 to change, 1 to destroy.
module.platform.azurerm_storage_account.data_lake[0]module.platform.azurerm_storage_container.datasets[0]module.platform.azurerm_storage_container.models[0]module.platform.azurerm_storage_management_policy.data_lake[0]module.platform.azurerm_private_dns_zone.core["storage_dfs"]module.platform.azurerm_private_dns_zone_virtual_network_link.core["storage_dfs"]module.platform.azurerm_private_endpoint.data_lake_blob[0]module.platform.azurerm_private_endpoint.data_lake_dfs[0]module.platform.azurerm_role_assignment.user_data_lake_blob[0]module.platform.azurerm_role_assignment.ml_data_lake_blob[0]module.platform.azurerm_role_assignment.osmo_data_lake_blob[0]module.platform.azurerm_key_vault.main(in-place, pre-existing drift)module.platform.azurerm_storage_account.main(in-place, pre-existing drift)module.platform.azurerm_storage_management_policy.mainThe 2 in-place updates and the destroy are expected:
count = 0when data lake is enabled). Existing deployments without the data lake retain their lifecycle rules.Terraform Apply
All 11 data lake resources created successfully in
rg-roboticsch-dev-001(switzerlandnorth):stdlroboticschdev001(HNS enabled)datasets(private)models(private)privatelink.dfs.core.windows.netpe-datalake-blob-roboticsch-dev-001pe-datalake-dfs-roboticsch-dev-001stdl*stdl*stdl*Terraform Test
Lint & Validation
npm run lint:tfnpm run lint:tf:validatenpm run spell-checknpm run lint:mdWhat Changed
Platform Module (
infrastructure/terraform/modules/platform/)storage.tf— Newazurerm_storage_account.data_lakewithis_hns_enabled = true,datasetsandmodelscontainers, data lake lifecycle policy, blob and DFS private endpoints. ML storage lifecycle policy gated withcount = var.should_create_data_lake_storage ? 0 : 1to avoid regression for existing deployments.main.tf— Addedstorage_dfs = "privatelink.dfs.core.windows.net"tobase_dns_zones(7 base zones, up from 6).variables.tf— Newshould_create_data_lake_storagevariable (bool, defaultfalse).role-assignments.tf— Added Storage Blob Data Contributor on data lake for current user, ML identity, and OSMO identity. All gated on the data lake flag.outputs.tf— Newdata_lake_storage_accountanddata_lake_storage_account_accessoutputs (null when disabled).Dataviewer Module (
infrastructure/terraform/modules/dataviewer/)variables.deps.tf— New optionaldata_lake_storage_accountinput (nullable).role-assignments.tf— Conditional Storage Blob Data Contributor on data lake for dataviewer identity.Root Module (
infrastructure/terraform/)variables.tf— Newshould_create_data_lake_storageroot variable.main.tf— Passshould_create_data_lake_storageto platform module.outputs.tf— Newdata_lake_storage_accountroot output.terraform.tfvars.example— Addedshould_create_data_lake_storagewith documentation.Tests (
infrastructure/terraform/modules/platform/tests/)dns-zones.tftest.hcl— Updated zone counts (6→7 base zones).security.tftest.hcl— Addeddata_lake_securityanddata_lake_disabled_by_defaulttest runs.conditionals.tftest.hcl— Addeddata_lake_enabledanddata_lake_disabledtest runs.Documentation
docs/cloud/blob-storage-structure.md— Rewritten for two-account architecture: ML workspace storage vs data lake storage, new container/folder structure, updated lifecycle policy references..cspell/general-technical.txt— Addedstdl(data lake naming prefix).Documentation Impact
Checklist