feat: workflowtasksets for resource monitors. fixes #16126#16125
Open
isubasinghe wants to merge 17 commits into
Open
feat: workflowtasksets for resource monitors. fixes #16126#16125isubasinghe wants to merge 17 commits into
isubasinghe wants to merge 17 commits into
Conversation
Signed-off-by: isubasinghe <isitha@pipekit.io>
…h/list on workflows Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
c37c799 to
2011809
Compare
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
- agent Role now has secrets:get so archiveAgentLogs can init the artifact driver to upload main-logs (TestResourceLog/Basic). - include NodeTypeResourceMonitor wherever HTTP/Plugin are handled: getOutboundNodes, isExecutionNode (cli + workflow/util), progress updater.executable, and tmpl.GetNodeType so initialiseExecutableNode resolves to ResourceMonitor instead of NodeTypePod. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
…ugin sidecars to agent pod runKubectl mutates os.Args and kubectlutil.BehaviorOnFatal — both process globals — so concurrent agent task workers were racing and writing each other's manifest paths, leaving created resources with the wrong monitored-resource node-ID label and stranding events in handleDone. archiveAgentLogs called plugin.NewDriver unbounded; the driver polls 120s for a unix socket, dwarfing the 90s e2e timeout and silently hanging every resource template that resolved to a plugin-backed archive location. Wrap NewDriver in a 30s context. Resource templates with plugin-backed archive locations couldn't archive main-logs at all because the agent pod had no plugin sidecars. Scan the workflow's resource templates for plugin needs and attach the same sidecar + emptyDir + socket-mount triple that the wait container uses, plus EnvVarArtifactPluginNames so the driver path keys off it identically. Signed-off-by: isubasinghe <isitha@pipekit.io> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e at agent pod for resource templates, drop hardcoded /tmp - agent init now copies argoexec to /var/run/argo when artifact plugin sidecars are attached; without it the sidecar crashloops on exec: "/var/run/argo/argoexec": no such file or directory and archiveAgentLogs hangs the resource template (TestResourceLogPlugin). - pod.name for resource templates now resolves to the agent pod (the pod actually running kubectl), fixing the k8s-patch-pod/-merge/-json examples that previously patched a non-existent resource pod name. - Replace hardcoded /tmp in the agent's manifest path with os.TempDir() so the integration tests (TestProcessTask_ResourceTemplate_*) pass on Windows runners where /tmp doesn't exist. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: isubasinghe <isitha@pipekit.io>
…monitor labels into JSON patch ops - The artifact-plugin sidecar's command is /var/run/argo/argoexec, but artifactSidecarContainer() only adds the socket-dir mount. Workflow pods get var-run-argo added in the createWorkflowPod loop; the agent pod has no such loop, so the sidecar crashlooped on missing argoexec even though the init container staged it. Mount it explicitly. - k8s-patch-json-pod failed because injectMonitoredLabel unmarshaled the manifest as a Pod, but JSON Patch (mergeStrategy=json) is an array of ops. Append `add` ops for the monitor labels using JSON Pointer ~1 escaping so kubectl applies them alongside the user's. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: isubasinghe <isitha@pipekit.io>
The agent pod runs with AutomountServiceAccountToken: false, so plugin sidecars had no token at /var/run/secrets/kubernetes.io/serviceaccount. Artifact driver plugins (e.g. artifact-driver-s3) that build an in-cluster client to fetch credential Secrets failed during Save, causing main-logs archival for resource templates with plugin-backed artifact repos to be silently skipped (TestArtifactsSuite/TestResourceLogPlugin). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: isubasinghe <isitha@pipekit.io>
…in-logs archiveAgentLogs created /tmp/agent-main-logs-*.log inside the agent main container's emptyDir and asked the plugin sidecar to upload that path. The sidecar opens the path on its own filesystem and gets ENOENT, dropping the main-logs artifact. The default S3/in-process drivers worked because Save runs in-process; the plugin path didn't because the sidecar is a separate container. Wire a single agent-plugin-share emptyDir on the agent pod (only when there are plugin sidecars). The agent main mounts the root at common.AgentPluginShareDir; each sidecar mounts SubPath=<plugin-name> at the same path. archiveAgentLogs writes plugin-bound temp files under <root>/<plugin-name>/ so the path string the agent passes to driver.Save resolves to the same bytes inside the sidecar. Each sidecar only sees its own slice. Fixes ArtifactsSuite/TestResourceLogPlugin/Basic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: isubasinghe <isitha@pipekit.io>
Member
Author
|
These defaults are wayy too permisive to let the tests pass. Opening this PR not because this is the state I want it merged in but more to open a discussion about the direction to go in order to reduce the permissions here. We could also consider drop support for logs in this mode of execution perhaps. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #16126
Motivation
Resource templates currently spawn a wait pod per invocation just to poll the created object. For workflow-of-workflows that's a pod per child, which adds up fast.
This moves the polling into the workflow's agent pod, same way HTTP and Plugin templates already work.
Modifications
NodeTypeResourceMonitor. The operator no longer creates a pod for resource templates; it puts the template on the WorkflowTaskSet and the agent picksit up.
workflows.argoproj.io/monitored-resource(workflow name, used as the informer's selector) andworkflows.argoproj.io/monitored-resource-node-id(used to route events back to the originating task).
MonitoredResourceInformer: one dynamic informer per GVR, label-filtered, with a single handler that dispatches by node-ID label.kubectlfor the action, infers the GVR from the response, then watches. Terminal phase is posted when a success/failure condition matches.jsonPath,jqFilter, anddefault.manifestFrom.artifactis resolved inside the agent via the artifact driver. Single-file archives only; multi-file archives are rejected with a pointer back to thelegacy path.
/tmpso kubectl scratch files work underReadOnlyRootFilesystem=true.ResourceMonitoradded to the DAG genre map.Verification
make test, including new informer unit tests and an agent integration test (fake dynamic client plus stubbed kubectl) that covers success match, failure match,JSONPath/JQ outputs, and the delete short-circuit.
list/watchfrom the agentrole and confirmed the failure mode matches what the docs say.
Documentation
docs/workflow-rbac.mdhas the new RBAC section with a worked example role. The walk-through points at it. The quick-start manifest matches the docs.AI
The majority of the code was written by me. Claude Code (Opus) was used for boilerplate around the informer and for help drafting docs and this PR description.