Run per-ASG AWS calls in parallel across rolling tasks by erimicel · Pull Request #17 · KentaaNL/capistrano-asg-rolling

erimicel · 2026-05-19T15:50:26Z

Summary

The rolling deployment tasks currently run their AWS calls one ASG at a time, but the work is independent per ASG. With N Auto Scaling Groups the wall-clock time scales linearly. This change uses the existing Parallel.run helper so the total time is bounded by the slowest single ASG instead.

What runs in parallel now:

rolling:setup — instance launches (each previously blocked on wait_until_running, ~30-90s)
rolling:update and rolling:trigger_instance_refresh — start_instance_refresh calls
rolling:cleanup — old Launch Template versions, AMIs and snapshots
AMI#delete — EBS snapshot deletions
AutoscaleGroups#launch_templates — per-group launch template lookups

Notes

Dedupe of rolling:setup launches by image_id is preserved by pre-filtering before the parallel run.
The deleted_amis tracker in cleanup is now a Concurrent::Array so AMIs referenced by multiple Launch Templates are still only deleted once.
Parallel.run already aggregates errors (Aggregate errors in Parallel.run instead of abandoning in-flight threads #14), so one failing ASG no longer abandons in-flight work.

Test plan

bundle exec rspec (149 examples, 0 failures)
bundle exec rubocop clean
Smoke-test against a multi-ASG deployment

Opening as a draft for review before marking ready.

Line 210 of rolling.rake, introduced in KentaaNL#15, used a U+2014 em-dash inside the new logger.warning for the AWS throttling recovery branch. Capistrano loads .rake files in environments where the Ruby source encoding falls back to US-ASCII, which then aborts the deploy with: SyntaxError: .../lib/capistrano/asg/tasks/rolling.rake:210: syntax error found > 210 | ... - retrying on next poll." ^~~ invalid multibyte character 0xE2 Swap the em-dash for an ASCII hyphen so the file stays ASCII-clean regardless of source-encoding defaults on the host.

The rolling deployment was running its expensive AWS calls one ASG at a time, even though each call is independent: instance launches in `rolling:setup` waited on `wait_until_running` (~30-90s) per group, `start_instance_refresh` ran serially in both `rolling:update` and `rolling:trigger_instance_refresh`, and `rolling:cleanup` deleted old Launch Template versions, AMIs and EBS snapshots one after another. For a fleet with N ASGs this multiplied wall-clock time by N. Run all of these via the existing `Parallel.run` helper so total time is bounded by the slowest single operation. Dedupe of launches by image_id is preserved by pre-filtering before the parallel run, and the shared `deleted_amis` tracker in cleanup is now a `Concurrent::Array` so AMIs referenced by multiple Launch Templates are still only deleted once. `AutoscaleGroups#launch_templates` resolves each group's launch template concurrently as well.

The existing snapshot deletion spec only exercised a single-snapshot AMI, which falls through the `image_snapshots.size > 1` guard. Add a stub with two EBS mappings and assert both `DeleteSnapshot` calls are issued.

`Parallel.run` re-raises only after every thread has joined, which meant the post-loop `config.instances << instance` in the previous version was skipped on partial failure. That left successfully-launched EC2 instances unrecorded, so the at_exit cleanup hook could not find them to terminate, and they would survive indefinitely in AWS. Move the registration inside the parallel block (guarded by a mutex, because `Instances#<<` wraps a plain Array) so every successful launch is tracked the moment it completes. The Capistrano `server` registration stays after the parallel run because the DSL is not safe to call from multiple threads.

Parallel rolling:setup made every instance finish wait_until_running at roughly the same moment, but sshd typically only binds port 22 some 20-40s later once cloud-init completes. The subsequent SSH probe loop in the rake task then spun through dozens of 'Connection refused' retries waiting for sshd. Add a `wait_until(:instance_status_ok)` after `wait_until_running` so the launch only returns once AWS reports system + instance status as 'ok' (i.e. boot has finished). This shifts the wait inside the parallel launch block, which still runs concurrently, and lets the SSH probe succeed on its first attempt. A waiter timeout is downgraded to a warning so the SSH probe (which has its own 5-minute budget) can still catch genuine failures.

A QA deploy with the status-check waiter in place showed the pre-deploy phase actually got slower, not faster: clean log (no SSH retry noise) but the wall-clock for rolling:setup grew by roughly 60 seconds compared to the version with the loud SSH probe loop. Two reasons the waiter loses: * `instance_status_ok` polls AWS every 15 seconds by default, while the SSH probe in `rolling:setup` polls the instance directly every 1 second. The SSH probe sees sshd become available much sooner. * The waiter requires both system and instance status checks to report 'ok'. That is a higher bar than 'sshd is bound to port 22', which is all the deploy actually needs. The "Connection refused" lines were stderr chatter from Net::SSH; the underlying loop was already correct and fast. Reverting the change so we keep the cheap probe and the better wall-clock time.

erimicel · 2026-05-19T20:46:59Z

It saves 20-30 seconds if 4 ASG runs sametime, but sometimes it’s trivial. Happy any decision you make for this 🙏

ppostma · 2026-05-22T14:16:02Z

Thanks for you PR! Definitely an interesting idea to parallelize more tasks. I think parallel instance launches would probably provide the biggest benefit here. For the other tasks, I’m not yet convinced the gains outweigh the added complexity.

At the moment, the code in rolling.rake is completely untested. Before making changes to this logic, I’d really like us to add proper test coverage first. AI could probably help generate an initial set of tests to get us started.

Once we have tests in place, I think it makes sense to start by parallelizing the instance launches first, and then evaluate whether it’s worth applying the same approach to the other tasks as well.

@erimicel

Launch instances for rolling Auto Scaling Groups in parallel using `Parallel.run`, reducing setup time when multiple groups are configured. Groups sharing the same launch template image are deduplicated via a new `AutoscaleGroups#with_unique_images` method, which replaces the previous per-image tracking check. Two new `AutoscaleGroups` methods — `#rolling` and `#standard` — replace the inline partition loop, making the rake task a thin orchestration script. `Instances#<<` is now backed by `Concurrent::Array` for thread-safe concurrent appends. `Configuration` state is eagerly initialized at module load time, eliminating lazy-init races without requiring synchronization. Thanks to @erimicel for the original idea and PR (#17).

erimicel added 2 commits May 19, 2026 16:22

erimicel force-pushed the speed-improvements branch from b132ea6 to 42cd92a Compare May 19, 2026 15:51

erimicel added 4 commits May 19, 2026 16:53

Test AMI#delete parallel snapshot path.

3aa9615

The existing snapshot deletion spec only exercised a single-snapshot AMI, which falls through the `image_snapshots.size > 1` guard. Add a stub with two EBS mappings and assert both `DeleteSnapshot` calls are issued.

erimicel marked this pull request as ready for review May 19, 2026 20:47

Merge branch 'KentaaNL:master' into speed-improvements

e695525

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run per-ASG AWS calls in parallel across rolling tasks#17

Run per-ASG AWS calls in parallel across rolling tasks#17
erimicel wants to merge 7 commits into
KentaaNL:masterfrom
OLIOEX:speed-improvements

erimicel commented May 19, 2026

Uh oh!

erimicel commented May 19, 2026

Uh oh!

ppostma commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erimicel commented May 19, 2026

Summary

Notes

Test plan

Uh oh!

erimicel commented May 19, 2026

Uh oh!

ppostma commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants