Skip to content

ENGINE-1383: Handle Windows exit code 3010 and fix reboot after MCR install#624

Open
james-nesbitt wants to merge 9 commits intomainfrom
ENGINE-1383-installps1-update
Open

ENGINE-1383: Handle Windows exit code 3010 and fix reboot after MCR install#624
james-nesbitt wants to merge 9 commits intomainfrom
ENGINE-1383-installps1-update

Conversation

@james-nesbitt
Copy link
Copy Markdown
Collaborator

Summary

Fixes two bugs in the Windows MCR installation path and adds a reliable reboot mechanism.

Bug 1: Exit code 3010 treated as hard failure

When the MCR installer exits with code 3010 (ERROR_SUCCESS_REBOOT_REQUIRED), rig wraps the non-zero exit as ErrCommandFailed. InstallMCR was returning immediately on any non-zero exit, so the reboot-detection logic was unreachable.

Fix: Intercept the error in InstallMCR via isExitCode3010() and treat it as a reboot-required signal instead of a hard failure.

Bug 2: Successful reboot fell through to error return

After a successful rh.Reboot() call, execution fell through to the errRebootRequired return, making every successful reboot appear as a failure.

Fix: Return nil after a successful reboot.

Bug 3: rig's Windows Reboot uses shutdown /r /t 5

rig's Windows.Reboot() issues shutdown /r /t 5 which is silently ignored when Windows has a pending-reboot state (as is the case after a 3010 exit). The host never goes offline and the wait loop times out.

Fix: Override Reboot() on WindowsConfigurer to use Restart-Computer -Force via PowerShell, which bypasses pending-reboot locks.

A TODO comment marks this for upstreaming into k0sproject/rig.

Additional fix: GetInstaller dead code

A dead-code branch in GetInstaller made the installer download path unreachable when no cached path existed.

Fix: Remove the unreachable branch.

Files changed

File Change
pkg/configurer/installer.go Remove dead-code branch blocking installer download
pkg/configurer/common.go Hoist rebootable interface to package scope
pkg/configurer/windows.go Handle exit 3010, fix reboot fall-through, override Reboot() with Restart-Computer -Force

Testing

End-to-end tested against fresh AWS infrastructure (1x Ubuntu 22.04 manager, 3x Windows workers: 2019/2022/2025) using the ENGINE1383 custom installer script.

- installer.go: remove dead-code branch that blocked installer download
  path; GetInstaller() now always attempts the download when no cached
  path exists.

- common.go: hoist rebootable interface to package scope so it is
  accessible from both windows.go and any future configurers.

- windows.go (InstallMCR):
  * Detect exit code 3010 (ERROR_SUCCESS_REBOOT_REQUIRED) via
    isExitCode3010() helper and treat it as a reboot-required signal
    instead of a hard failure.
  * Preserve fallback: if the installer exits 0 but prints
    'Your machine needs to be rebooted', still trigger a reboot.
  * Fix reboot success fall-through: after rh.Reboot() succeeds,
    return nil instead of falling through to the 'host isn't
    rebootable' error return.
rig's Windows Reboot implementation uses 'shutdown /r /t 5' which is
silently ignored when Windows has a pending-reboot state (e.g. after an
MCR install that exits 3010). Override Reboot() on WindowsConfigurer to
use PowerShell's Restart-Computer -Force which bypasses pending-reboot
locks and reliably triggers the restart.

TODO: move this fix upstream into k0sproject/rig.
@james-nesbitt
Copy link
Copy Markdown
Collaborator Author

The unit tests did not fail, but rather the unit test tear down failed after the tests passed.

shutdown /r /t 0 returns immediately (exit 0) before Windows has begun
tearing down its network stack. waitForHost starts polling right after
Reboot() returns, so all 60 echo probes (3s apart) succeed before the
host ever drops WinRM — the offline window is never observed.

Adding a 15-second sleep gives Windows time to start its shutdown
sequence so the subsequent waitForHost(false) poll loop actually catches
the host going offline.
The /f flag forces running applications to close, which is necessary on
Windows Server 2025 where processes can block a pending reboot. Without
it, 'shutdown /r /t 0' completes but the host never actually reboots.
AWS EC2 WinRM sessions run under a filtered Administrator token that
lacks SeShutdownPrivilege. 'shutdown /r /f /t 0' succeeds (exit 0) but
is silently ignored because the token has insufficient privilege.

Fix: create a one-shot scheduled task running as SYSTEM (which always
holds SeShutdownPrivilege) and trigger it immediately. SYSTEM-context
tasks bypass the WinRM token restriction and reliably trigger the
reboot.
/sc once /st 00:00 causes schtasks to write a warning to stderr when
the scheduled time is in the past. Rig treats any stderr output as an
error, causing Reboot() to fail even though the task was created
successfully. /sc onstart requires no start time and creates the task
silently.
Each worker locked mutex at entry (line 106) and deferred unlock (line
107), then attempted a second mutex.Lock() on the error path (line 114).
The second lock deadlocked the goroutine since it already held the mutex.
workerpool.StopWait() then blocked forever waiting for the deadlocked
worker to finish.

Fix: remove the outer lock/defer and only lock when recording an error,
using an early-return guard so only the first error is kept.
The schtask is scheduled with /sc onstart, meaning it fires on every
system startup. Without cleanup, the task triggers a second reboot when
the machine comes back up after the MCR-install reboot, causing
docker swarm join (and any subsequent operation) to fail because the
host reboots again mid-flight.

Delete the task immediately after rh.Reboot() returns (machine is back
up and WinRM is reconnected) to prevent it from firing on subsequent
startups.
The ONSTART schtask fires on every startup, causing repeated reboots.
The post-reboot cleanup in InstallMCR is too late -- the task has already
triggered a second reboot by the time the cleanup runs.

Fix: use /t 5 (5-second countdown) so the task can be deleted immediately
after it is triggered but before the OS actually executes the shutdown.
This prevents the task from re-firing on subsequent startups.

The post-reboot cleanup in InstallMCR is kept as a fallback in case the
pre-delete fails (e.g. the WinRM session is dropped in the 5s window).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes multiple issues in the Windows Mirantis Container Runtime (MCR) install/reboot path (reboot-required exit code handling, reboot fall-through, and a more reliable reboot implementation), and removes dead code that prevented installer downloads.

Changes:

  • Treat Windows installer exit code 3010 as “reboot required” and return nil after a successful reboot.
  • Replace/override Windows reboot behavior with a scheduled-task-driven restart approach and task cleanup.
  • Remove dead branch in GetInstaller that made the download path unreachable.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
pkg/docker/image.go Adjusts parallel image pull error handling (but introduces a return-value timing issue; see comment).
pkg/configurer/windows.go Handles exit code 3010, fixes reboot fall-through, and adds a Windows reboot implementation plus cleanup.
pkg/configurer/installer.go Removes unreachable early-return to allow installer download; concurrency concern remains (see comment).
pkg/configurer/common.go Hoists rebootable interface to package scope for reuse.
Comments suppressed due to low confidence (1)

pkg/configurer/installer.go:32

  • downloadedInstallers is a package-level map accessed without synchronization. GetInstaller is called from Windows install/uninstall paths that run across hosts in parallel, so concurrent reads/writes can panic with "concurrent map read and map write". Protect this cache with a sync.Mutex/sync.RWMutex, use sync.Map, or remove the global cache and let callers manage caching.
func GetInstaller(source string) (string, error) {
	path, ok := downloadedInstallers[source]
	if ok {
		return path, nil
	}

	path, getErr := downloadInstaller(source)
	if getErr != nil {
		return "", fmt.Errorf("%w, installer download failed; %s", ErrInstallerDownloadFailed, getErr.Error())
	}
	downloadedInstallers[source] = path
	return path, nil

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/docker/image.go
Comment on lines 103 to 107
for _, image := range images {
i := image // So we can safely pass i forward to pool without it getting mutated
wp.Submit(func() {
mutex.Lock()
defer mutex.Unlock()
if lastError != nil {
return
}

err := i.Pull(h)
if err != nil {
if err := i.Pull(h); err != nil {
mutex.Lock()
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer wp.StopWait() means the expression in return lastError is evaluated before the pool has finished running the submitted pulls. As a result, this function can return nil even if a later worker sets lastError. Call wp.StopWait() (or wp.StopWait() + wp.Stop()) before reading/returning lastError, or switch to a named return value so the deferred wait sees the final error.

Copilot uses AI. Check for mistakes.
Comment thread pkg/configurer/windows.go
Comment on lines +121 to +125
// isExitCode3010 checks if the error is a command failure with Windows exit
// code 3010 (ERROR_SUCCESS_REBOOT_REQUIRED).
func isExitCode3010(err error) bool {
return err != nil && strings.Contains(err.Error(), "3010")
}
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isExitCode3010 currently checks strings.Contains(err.Error(), "3010"), which is overly broad and can misclassify unrelated failures whose error strings happen to include those digits. Prefer extracting the actual exit code from the underlying error type (e.g., via errors.As to a command/exit error type) or at least match an exact token like "exit code 3010"/"exit status 3010" with a word-boundary regex.

Copilot uses AI. Check for mistakes.
Comment thread pkg/configurer/windows.go
Comment on lines +127 to +152
// Reboot triggers an immediate forced restart by scheduling a SYSTEM-context
// task that runs 'shutdown /r /f /t 0', then immediately triggering it.
// Running via a scheduled task bypasses the filtered Administrator token used
// by WinRM sessions on AWS EC2, which lacks SeShutdownPrivilege. The rig
// implementation uses 'shutdown /r /t 5' directly in the WinRM session,
// which is silently ignored in that context.
//
// /sc onstart is used instead of /sc once to avoid schtasks writing a
// stderr warning about the start time being in the past, which rig treats
// as an error.
//
// After scheduling the task we sleep briefly so that Windows has time to
// begin its shutdown sequence before the caller's waitForHost poll loop starts.
//
// TODO: move this fix upstream into the k0sproject/rig Windows configurer.
func (c WindowsConfigurer) Reboot(h os.Host) error {
const taskName = "LaunchpadReboot"
// Create a SYSTEM-context ONSTART task that runs 'shutdown /r /f /t 5'.
// The 5-second delay gives us time to delete the task before the OS
// actually executes the reboot, preventing it from firing again on the
// next startup.
create := fmt.Sprintf(`schtasks /create /tn "%s" /tr "shutdown /r /f /t 5" /sc onstart /f /ru SYSTEM`, taskName)
if err := h.Exec(create); err != nil {
return fmt.Errorf("failed to create reboot task: %w", err)
}
run := fmt.Sprintf(`schtasks /run /tn "%s"`, taskName)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says Windows reboot is overridden to use Restart-Computer -Force, but this implementation still relies on shutdown (via schtasks). Either update the implementation to match the described Restart-Computer -Force approach (and explain why it works under pending-reboot states), or adjust the PR description/comments to reflect the actual mechanism.

Copilot uses AI. Check for mistakes.
Comment thread pkg/configurer/windows.go
Comment on lines +144 to +168
// Create a SYSTEM-context ONSTART task that runs 'shutdown /r /f /t 5'.
// The 5-second delay gives us time to delete the task before the OS
// actually executes the reboot, preventing it from firing again on the
// next startup.
create := fmt.Sprintf(`schtasks /create /tn "%s" /tr "shutdown /r /f /t 5" /sc onstart /f /ru SYSTEM`, taskName)
if err := h.Exec(create); err != nil {
return fmt.Errorf("failed to create reboot task: %w", err)
}
run := fmt.Sprintf(`schtasks /run /tn "%s"`, taskName)
if err := h.Exec(run); err != nil {
// Tolerate connection-level errors; the OS may kill WinRM as it starts
// rebooting before the run command returns.
errMsg := err.Error()
if !strings.Contains(errMsg, "connection") && !strings.Contains(errMsg, "closed") && !strings.Contains(errMsg, "EOF") {
return fmt.Errorf("failed to run reboot task: %w", err)
}
}
// Delete the task immediately while the 5-second shutdown timer is still
// counting down. This prevents it from re-firing on subsequent startups.
del := fmt.Sprintf(`schtasks /delete /tn "%s" /f`, taskName)
if err := h.Exec(del); err != nil {
// Best-effort: warn but don't fail — the post-reboot cleanup in
// InstallMCR will attempt deletion again once the host is back up.
log.Warnf("%v: failed to pre-delete reboot task (will retry after reboot): %s", h, err)
}
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an /sc onstart scheduled task that performs shutdown /r /f /t 5 can create a reboot loop if the task isn't deleted before the next boot (e.g., if WinRM drops and the pre-delete fails). The post-reboot cleanup in InstallMCR may not run within 5 seconds of startup, so the machine can immediately reboot again and never become reachable. Consider using an /sc once task scheduled slightly in the future (compute /st dynamically) or increase the delay substantially (or use Restart-Computer -Force) so there's enough time to reconnect and delete the task reliably.

Copilot uses AI. Check for mistakes.
Comment thread pkg/configurer/windows.go
Comment on lines +127 to +148
// Reboot triggers an immediate forced restart by scheduling a SYSTEM-context
// task that runs 'shutdown /r /f /t 0', then immediately triggering it.
// Running via a scheduled task bypasses the filtered Administrator token used
// by WinRM sessions on AWS EC2, which lacks SeShutdownPrivilege. The rig
// implementation uses 'shutdown /r /t 5' directly in the WinRM session,
// which is silently ignored in that context.
//
// /sc onstart is used instead of /sc once to avoid schtasks writing a
// stderr warning about the start time being in the past, which rig treats
// as an error.
//
// After scheduling the task we sleep briefly so that Windows has time to
// begin its shutdown sequence before the caller's waitForHost poll loop starts.
//
// TODO: move this fix upstream into the k0sproject/rig Windows configurer.
func (c WindowsConfigurer) Reboot(h os.Host) error {
const taskName = "LaunchpadReboot"
// Create a SYSTEM-context ONSTART task that runs 'shutdown /r /f /t 5'.
// The 5-second delay gives us time to delete the task before the OS
// actually executes the reboot, preventing it from firing again on the
// next startup.
create := fmt.Sprintf(`schtasks /create /tn "%s" /tr "shutdown /r /f /t 5" /sc onstart /f /ru SYSTEM`, taskName)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the scheduled task runs shutdown /r /f /t 0, but the actual command created is shutdown /r /f /t 5. Please fix the comment (or the command) so they match, since the timing is important for the subsequent task deletion logic.

Copilot uses AI. Check for mistakes.
@smerkviladze
Copy link
Copy Markdown

@james-nesbitt Overall, it looks good. Could you please address the Copilot review comments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants