feat: add opensource vLLM deployment pathway with container apps and gpu#184
Open
feat: add opensource vLLM deployment pathway with container apps and gpu#184
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Portal Preview
Deployed by run #358 — destroyed automatically on PR close. |
Implements azureml_registry as a second model source option in the
vllm-service module, complementing the default huggingface source.
Changes:
- modules/vllm-service/variables.tf: add model_source enum var and
azureml_registry nullable object var; update descriptions
- modules/vllm-service/main.tf:
- new locals: use_azureml_registry_source, azureml_download_parent,
azureml_model_root, vllm_model_arg, azureml_init_image
- azurerm_user_assigned_identity.azureml_downloader (count-conditional)
created before Container App to avoid RBAC propagation race
- null_resource.build_azureml_init_image builds Dockerfile.azureml-init
into module ACR; filemd5 triggers on Dockerfile + script changes
- dynamic identity block on Container App for UA identity
- vllm_model_arg replaces hardcoded var.model_id in args; adds
--served-model-name for azureml_registry source to keep API name stable
- HF secret, HF_HUB_OFFLINE/TRANSFORMERS_OFFLINE, HF_TOKEN env gated on
model_source == huggingface
- dynamic init_container block: downloads model via azureml-init image,
mounts model-cache volume, passes UA client_id for managed identity login
- check block validates azureml_registry != null when source is azureml_registry
- modules/vllm-service/Dockerfile.azureml-init: pins azure-cli:2.67.0,
pre-bakes az ml extension, CMDs azureml-init.sh
- modules/vllm-service/azureml-init.sh: idempotent download with
.download-complete marker, atomic move, config.json validation
- modules/vllm-service/outputs.tf: add azureml_downloader_principal_id;
update descriptions
- stacks/vllm/locals.tf: add use_azureml_source local
- stacks/vllm/main.tf: thread model_source + azureml_registry from
vllm_config; add azurerm_role_assignment.azureml_registry_user
- stacks/vllm/README.md: document AzureML Registry Source section
- model-deployments.md: document azureml_registry source option with
registration steps, format requirements, and source comparison table
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
RBAC race (MAJOR): move azurerm_role_assignment.azureml_registry_user from stacks/vllm into modules/vllm-service so it can be placed in the Container App depends_on chain. Previously the init container could start before the AzureML Registry User assignment propagated. The Container App now explicitly waits on the role assignment resource. Remove use_azureml_source local from stacks/vllm/locals.tf — no longer referenced after role assignment moved to module. NSG bypass (MAJOR): replace the broad AllowVnetInbound-* dynamic rules (all address spaces) on the vllm-aca NSG with three targeted rules: - AllowAzureLoadBalancerInbound (priority 200): ACA health probes - AllowPeSubnetInbound (priority 210): APIM backend calls via PE NIC - AllowApimSubnetInbound (priority 220): direct APIM calls before PE DNS Broad VNet inbound is removed; only APIM (via PE and direct) and platform health probes can reach the vLLM Container App. Update stacks/vllm/README.md RBAC wiring section and .github/skills/network/references/REFERENCE.md NSG table accordingly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, and align network docs
- Add vLLM Open-Source Models section to services.html with nav card,
model table, and cold-start warning
- Add vllm-service module and vllm stack sections to terraform-reference.html
(also added missing pii-redaction stack)
- Add vllm to Phase 3 deployment table in iac-coder SKILL.md
- Align network SKILL.md with vllm-aca-subnet: update output contract,
doc sync rules, code locations, known subnets, env allocation tables
- Add vllm-aca-subnet to network variables.tf validation block
- Add vLLM subnet-allocation-{env}.json entry to workflows.html
- Rebuild published docs via build.sh
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…offline mode Add a HuggingFace download init container that pre-downloads the model to the persistent Azure Files cache before vLLM starts. A .download-complete marker skips re-download on subsequent restarts. The main container then runs with HF_HUB_OFFLINE=1 and TRANSFORMERS_OFFLINE=1, guaranteeing zero network calls at runtime regardless of model source. Changes: - offline_mode default changed from false to true (module + stack fallback) - HF init container added (conditional on model_source=huggingface + offline_mode) - HF_HUB_OFFLINE/TRANSFORMERS_OFFLINE now set for all sources when offline - HF token secret created whenever token is provided (init container needs it) - AzureML init validates tokenizer assets (tokenizer_config.json + tokenizer) - huggingface-cli fallback to python snapshot_download if CLI unavailable - Updated stacks/vllm/README.md with Model Caching section - Updated services.html and terraform-reference.html docs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tate guard, tenant-info - Fix gpu_memory_utilization error message to match exclusive upper bound (< 1) - Add completion_tokens assertion to SSE streaming integration test - Add check block warning when tenant enables vLLM but stack not deployed - Gate vllm_models in tenant-info response on vllm_enabled flag Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
AI Hub Infra Changes
Summary: 2 to add, 20 to change, 0 to destroy (across 4 stack(s))
Show plan details
Updated by CI — plan against
testenvironment (run #358) at 2026-04-13 03:22:30 UTC.