Skip to content

Run per-ASG AWS calls in parallel across rolling tasks#17

Open
erimicel wants to merge 7 commits into
KentaaNL:masterfrom
OLIOEX:speed-improvements
Open

Run per-ASG AWS calls in parallel across rolling tasks#17
erimicel wants to merge 7 commits into
KentaaNL:masterfrom
OLIOEX:speed-improvements

Conversation

@erimicel

Copy link
Copy Markdown
Contributor

Summary

The rolling deployment tasks currently run their AWS calls one ASG at a time, but the work is independent per ASG. With N Auto Scaling Groups the wall-clock time scales linearly. This change uses the existing Parallel.run helper so the total time is bounded by the slowest single ASG instead.

What runs in parallel now:

  • rolling:setup — instance launches (each previously blocked on wait_until_running, ~30-90s)
  • rolling:update and rolling:trigger_instance_refreshstart_instance_refresh calls
  • rolling:cleanup — old Launch Template versions, AMIs and snapshots
  • AMI#delete — EBS snapshot deletions
  • AutoscaleGroups#launch_templates — per-group launch template lookups

Notes

  • Dedupe of rolling:setup launches by image_id is preserved by pre-filtering before the parallel run.
  • The deleted_amis tracker in cleanup is now a Concurrent::Array so AMIs referenced by multiple Launch Templates are still only deleted once.
  • Parallel.run already aggregates errors (Aggregate errors in Parallel.run instead of abandoning in-flight threads #14), so one failing ASG no longer abandons in-flight work.

Test plan

  • bundle exec rspec (149 examples, 0 failures)
  • bundle exec rubocop clean
  • Smoke-test against a multi-ASG deployment

Opening as a draft for review before marking ready.

erimicel added 2 commits May 19, 2026 16:22
Line 210 of rolling.rake, introduced in KentaaNL#15, used a U+2014 em-dash
inside the new logger.warning for the AWS throttling recovery branch.
Capistrano loads .rake files in environments where the Ruby source
encoding falls back to US-ASCII, which then aborts the deploy with:

    SyntaxError: .../lib/capistrano/asg/tasks/rolling.rake:210:
    syntax error found
    > 210 | ... - retrying on next poll."
                ^~~ invalid multibyte character 0xE2

Swap the em-dash for an ASCII hyphen so the file stays ASCII-clean
regardless of source-encoding defaults on the host.
The rolling deployment was running its expensive AWS calls one ASG
at a time, even though each call is independent: instance launches in
`rolling:setup` waited on `wait_until_running` (~30-90s) per group,
`start_instance_refresh` ran serially in both `rolling:update` and
`rolling:trigger_instance_refresh`, and `rolling:cleanup` deleted old
Launch Template versions, AMIs and EBS snapshots one after another.
For a fleet with N ASGs this multiplied wall-clock time by N.

Run all of these via the existing `Parallel.run` helper so total time
is bounded by the slowest single operation. Dedupe of launches by
image_id is preserved by pre-filtering before the parallel run, and
the shared `deleted_amis` tracker in cleanup is now a `Concurrent::Array`
so AMIs referenced by multiple Launch Templates are still only deleted
once. `AutoscaleGroups#launch_templates` resolves each group's launch
template concurrently as well.
@erimicel erimicel force-pushed the speed-improvements branch from b132ea6 to 42cd92a Compare May 19, 2026 15:51
erimicel added 4 commits May 19, 2026 16:53
The existing snapshot deletion spec only exercised a single-snapshot AMI,
which falls through the `image_snapshots.size > 1` guard. Add a stub with
two EBS mappings and assert both `DeleteSnapshot` calls are issued.
`Parallel.run` re-raises only after every thread has joined, which meant
the post-loop `config.instances << instance` in the previous version was
skipped on partial failure. That left successfully-launched EC2 instances
unrecorded, so the at_exit cleanup hook could not find them to terminate,
and they would survive indefinitely in AWS.

Move the registration inside the parallel block (guarded by a mutex,
because `Instances#<<` wraps a plain Array) so every successful launch
is tracked the moment it completes. The Capistrano `server` registration
stays after the parallel run because the DSL is not safe to call from
multiple threads.
Parallel rolling:setup made every instance finish wait_until_running at
roughly the same moment, but sshd typically only binds port 22 some
20-40s later once cloud-init completes. The subsequent SSH probe loop
in the rake task then spun through dozens of 'Connection refused'
retries waiting for sshd.

Add a `wait_until(:instance_status_ok)` after `wait_until_running` so
the launch only returns once AWS reports system + instance status as
'ok' (i.e. boot has finished). This shifts the wait inside the parallel
launch block, which still runs concurrently, and lets the SSH probe
succeed on its first attempt. A waiter timeout is downgraded to a
warning so the SSH probe (which has its own 5-minute budget) can still
catch genuine failures.
A QA deploy with the status-check waiter in place showed the pre-deploy
phase actually got slower, not faster: clean log (no SSH retry noise) but
the wall-clock for rolling:setup grew by roughly 60 seconds compared to
the version with the loud SSH probe loop.

Two reasons the waiter loses:

  * `instance_status_ok` polls AWS every 15 seconds by default, while the
    SSH probe in `rolling:setup` polls the instance directly every 1
    second. The SSH probe sees sshd become available much sooner.
  * The waiter requires both system and instance status checks to report
    'ok'. That is a higher bar than 'sshd is bound to port 22', which is
    all the deploy actually needs.

The "Connection refused" lines were stderr chatter from Net::SSH; the
underlying loop was already correct and fast. Reverting the change so we
keep the cheap probe and the better wall-clock time.
@erimicel

Copy link
Copy Markdown
Contributor Author

It saves 20-30 seconds if 4 ASG runs sametime, but sometimes it’s trivial. Happy any decision you make for this 🙏

@erimicel erimicel marked this pull request as ready for review May 19, 2026 20:47
@ppostma

ppostma commented May 22, 2026

Copy link
Copy Markdown
Member

Thanks for you PR! Definitely an interesting idea to parallelize more tasks. I think parallel instance launches would probably provide the biggest benefit here. For the other tasks, I’m not yet convinced the gains outweigh the added complexity.

At the moment, the code in rolling.rake is completely untested. Before making changes to this logic, I’d really like us to add proper test coverage first. AI could probably help generate an initial set of tests to get us started.

Once we have tests in place, I think it makes sense to start by parallelizing the instance launches first, and then evaluate whether it’s worth applying the same approach to the other tasks as well.

ppostma added a commit that referenced this pull request Jun 5, 2026
Launch instances for rolling Auto Scaling Groups in parallel using
`Parallel.run`, reducing setup time when multiple groups are configured.
Groups sharing the same launch template image are deduplicated via a new
`AutoscaleGroups#with_unique_images` method, which replaces the previous
per-image tracking check.

Two new `AutoscaleGroups` methods — `#rolling` and `#standard` — replace
the inline partition loop, making the rake task a thin orchestration
script. `Instances#<<` is now backed by `Concurrent::Array` for
thread-safe concurrent appends.

`Configuration` state is eagerly initialized at module load time,
eliminating lazy-init races without requiring synchronization.

Thanks to @erimicel for the original idea and PR (#17).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants