-
Notifications
You must be signed in to change notification settings - Fork 27
[Feature]: UAT assert-recipe.yaml tests are too weak — assert expected component names, not just count > 0 #498
Description
Prerequisites
- I searched existing issues
Feature Summary
Problem
The UAT assert-recipe.yaml files across all CUJ tests only check that the recipe has a non-zero number of components:
(length(componentRefs) > `0`): trueThis is a trivially passing assertion — a recipe with a single wrong component would pass. These tests give a false sense of coverage: they verify
the recipe exists but not that it contains the right components.
Affected files (all 4 are identical in weakness):
tests/uat/aws/tests/cuj1-training/assert-recipe.yamltests/uat/aws/tests/cuj2-inference/assert-recipe.yamltests/uat/azure/tests/cuj1-training/assert-recipe.yamltests/uat/azure/tests/cuj2-inference/assert-recipe.yaml
Step 3 ("Validate deployment against live snapshot") doesn't validate much
The validate-deployment step runs aicr validate but:
- Uses
|| trueto swallow non-zero exit codes — validation failures are silently ignored - The assertion only checks that the output file exists and contains
reportFormat: CTRF - No assertion on pass/fail counts, individual test names, or that any tests actually passed
The assert-validate-multiphase.yaml has the same issue — it only asserts summary.tests > 0, not that any tests passed:
results:
summary:
tests: (@ > 0)Problem/Use Case
This pattern is being copy-pasted to new cloud/accelerator variants. The Azure AKS tests (#476) were copied directly from the AWS tests with the same weak assertions. As we add H200 and other variants, every new UAT suite will inherit these gaps and false assurances unless we fix the template now.
Proposed Solution
Proposed fix
1. Assert expected component names in assert-recipe.yaml
Each CUJ should assert the specific component names expected for that recipe configuration. For example, the EKS/H100/training/kubeflow recipe
assertion should look something like:
kind: RecipeResult
apiVersion: aicr.nvidia.com/v1alpha1
criteria:
service: eks
accelerator: h100
intent: training
os: ubuntu
platform: kubeflow
# Assert expected components are present by name
(componentRefs[?name == 'gpu-operator']): (length(@) > `0`)
(componentRefs[?name == 'network-operator']): (length(@) > `0`)
# ... all expected components for this recipeThe exact component list should be derived from what aicr recipe actually produces for each CUJ configuration.
2. Strengthen validation assertions
- Remove
|| truefrom validate steps, or at minimum assert on the exit code - Assert
summary.passed > 0(not justsummary.tests > 0) so all-fail runs don't silently pass - Consider asserting individual check names exist in the results, similar to the component name approach
Success Criteria
- CUJ asserts check the actual expected content for specific, important, recipes
- Validation assertions check that the tests pass
Alternatives Considered
No response
Component
Multiple components
Priority
Important (would improve my workflow)
Compatibility / Breaking Changes
No response
Operational Considerations
No response
Are you willing to contribute?
Yes, I can open a PR