Skip to content

feat: workflowtasksets for resource monitors. fixes #16126#16125

Open
isubasinghe wants to merge 17 commits into
argoproj:mainfrom
pipekit:feat-workflowtasksets-for-resource-monitors
Open

feat: workflowtasksets for resource monitors. fixes #16126#16125
isubasinghe wants to merge 17 commits into
argoproj:mainfrom
pipekit:feat-workflowtasksets-for-resource-monitors

Conversation

@isubasinghe
Copy link
Copy Markdown
Member

@isubasinghe isubasinghe commented May 21, 2026

Fixes #16126

Motivation

Resource templates currently spawn a wait pod per invocation just to poll the created object. For workflow-of-workflows that's a pod per child, which adds up fast.

This moves the polling into the workflow's agent pod, same way HTTP and Plugin templates already work.

Modifications

  • New node type NodeTypeResourceMonitor. The operator no longer creates a pod for resource templates; it puts the template on the WorkflowTaskSet and the agent picks
    it up.
  • Two new label keys: workflows.argoproj.io/monitored-resource (workflow name, used as the informer's selector) and workflows.argoproj.io/monitored-resource-node-id
    (used to route events back to the originating task).
  • Agent gets a MonitoredResourceInformer: one dynamic informer per GVR, label-filtered, with a single handler that dispatches by node-ID label.
  • Agent shells out to kubectl for the action, infers the GVR from the response, then watches. Terminal phase is posted when a success/failure condition matches.
  • Output parameter extraction supports jsonPath, jqFilter, and default.
  • manifestFrom.artifact is resolved inside the agent via the artifact driver. Single-file archives only; multi-file archives are rejected with a pointer back to the
    legacy path.
  • Agent pod gets a 64Mi tmpfs /tmp so kubectl scratch files work under ReadOnlyRootFilesystem=true.
  • UI: ResourceMonitor added to the DAG genre map.

Verification

  • make test, including new informer unit tests and an agent integration test (fake dynamic client plus stubbed kubectl) that covers success match, failure match,
    JSONPath/JQ outputs, and the delete short-circuit.
  • Ran the new examples on a kind cluster against the quick-start manifests. Workflow-of-workflows runs without per-step wait pods. Stripped list/watch from the agent
    role and confirmed the failure mode matches what the docs say.

Documentation

docs/workflow-rbac.md has the new RBAC section with a worked example role. The walk-through points at it. The quick-start manifest matches the docs.

AI

The majority of the code was written by me. Claude Code (Opus) was used for boilerplate around the informer and for help drafting docs and this PR description.

Signed-off-by: isubasinghe <isitha@pipekit.io>
…h/list on workflows

Signed-off-by: isubasinghe <isitha@pipekit.io>
@isubasinghe isubasinghe changed the title Feat workflowtasksets for resource monitors feat: workflowtasksets for resource monitors May 21, 2026
@isubasinghe isubasinghe changed the title feat: workflowtasksets for resource monitors feat: workflowtasksets for resource monitors. fixes #16126 May 21, 2026
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
@isubasinghe isubasinghe force-pushed the feat-workflowtasksets-for-resource-monitors branch from c37c799 to 2011809 Compare May 22, 2026 00:55
isubasinghe and others added 10 commits May 22, 2026 12:32
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
- agent Role now has secrets:get so archiveAgentLogs can init the
  artifact driver to upload main-logs (TestResourceLog/Basic).
- include NodeTypeResourceMonitor wherever HTTP/Plugin are handled:
  getOutboundNodes, isExecutionNode (cli + workflow/util), progress
  updater.executable, and tmpl.GetNodeType so initialiseExecutableNode
  resolves to ResourceMonitor instead of NodeTypePod.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
…ugin sidecars to agent pod

runKubectl mutates os.Args and kubectlutil.BehaviorOnFatal — both process
globals — so concurrent agent task workers were racing and writing each
other's manifest paths, leaving created resources with the wrong
monitored-resource node-ID label and stranding events in handleDone.

archiveAgentLogs called plugin.NewDriver unbounded; the driver polls 120s
for a unix socket, dwarfing the 90s e2e timeout and silently hanging
every resource template that resolved to a plugin-backed archive
location. Wrap NewDriver in a 30s context.

Resource templates with plugin-backed archive locations couldn't archive
main-logs at all because the agent pod had no plugin sidecars. Scan the
workflow's resource templates for plugin needs and attach the same
sidecar + emptyDir + socket-mount triple that the wait container uses,
plus EnvVarArtifactPluginNames so the driver path keys off it identically.

Signed-off-by: isubasinghe <isitha@pipekit.io>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e at agent pod for resource templates, drop hardcoded /tmp

- agent init now copies argoexec to /var/run/argo when artifact plugin
  sidecars are attached; without it the sidecar crashloops on
  exec: "/var/run/argo/argoexec": no such file or directory and
  archiveAgentLogs hangs the resource template (TestResourceLogPlugin).
- pod.name for resource templates now resolves to the agent pod (the
  pod actually running kubectl), fixing the k8s-patch-pod/-merge/-json
  examples that previously patched a non-existent resource pod name.
- Replace hardcoded /tmp in the agent's manifest path with os.TempDir()
  so the integration tests (TestProcessTask_ResourceTemplate_*) pass on
  Windows runners where /tmp doesn't exist.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: isubasinghe <isitha@pipekit.io>
…monitor labels into JSON patch ops

- The artifact-plugin sidecar's command is /var/run/argo/argoexec, but
  artifactSidecarContainer() only adds the socket-dir mount. Workflow
  pods get var-run-argo added in the createWorkflowPod loop; the agent
  pod has no such loop, so the sidecar crashlooped on missing argoexec
  even though the init container staged it. Mount it explicitly.
- k8s-patch-json-pod failed because injectMonitoredLabel unmarshaled
  the manifest as a Pod, but JSON Patch (mergeStrategy=json) is an
  array of ops. Append `add` ops for the monitor labels using JSON
  Pointer ~1 escaping so kubectl applies them alongside the user's.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: isubasinghe <isitha@pipekit.io>
The agent pod runs with AutomountServiceAccountToken: false, so plugin
sidecars had no token at /var/run/secrets/kubernetes.io/serviceaccount.
Artifact driver plugins (e.g. artifact-driver-s3) that build an in-cluster
client to fetch credential Secrets failed during Save, causing main-logs
archival for resource templates with plugin-backed artifact repos to be
silently skipped (TestArtifactsSuite/TestResourceLogPlugin).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: isubasinghe <isitha@pipekit.io>
…in-logs

archiveAgentLogs created /tmp/agent-main-logs-*.log inside the agent main
container's emptyDir and asked the plugin sidecar to upload that path. The
sidecar opens the path on its own filesystem and gets ENOENT, dropping the
main-logs artifact. The default S3/in-process drivers worked because Save
runs in-process; the plugin path didn't because the sidecar is a separate
container.

Wire a single agent-plugin-share emptyDir on the agent pod (only when there
are plugin sidecars). The agent main mounts the root at
common.AgentPluginShareDir; each sidecar mounts SubPath=<plugin-name> at the
same path. archiveAgentLogs writes plugin-bound temp files under
<root>/<plugin-name>/ so the path string the agent passes to driver.Save
resolves to the same bytes inside the sidecar. Each sidecar only sees its
own slice.

Fixes ArtifactsSuite/TestResourceLogPlugin/Basic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: isubasinghe <isitha@pipekit.io>
@isubasinghe
Copy link
Copy Markdown
Member Author

These defaults are wayy too permisive to let the tests pass.

Opening this PR not because this is the state I want it merged in but more to open a discussion about the direction to go in order to reduce the permissions here.

We could also consider drop support for logs in this mode of execution perhaps.

@isubasinghe isubasinghe marked this pull request as ready for review May 26, 2026 01:57
@isubasinghe isubasinghe requested review from a team as code owners May 26, 2026 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Run resource monitoring pods as agents.

1 participant