Run per-ASG AWS calls in parallel across rolling tasks#17
Conversation
Line 210 of rolling.rake, introduced in KentaaNL#15, used a U+2014 em-dash inside the new logger.warning for the AWS throttling recovery branch. Capistrano loads .rake files in environments where the Ruby source encoding falls back to US-ASCII, which then aborts the deploy with: SyntaxError: .../lib/capistrano/asg/tasks/rolling.rake:210: syntax error found > 210 | ... - retrying on next poll." ^~~ invalid multibyte character 0xE2 Swap the em-dash for an ASCII hyphen so the file stays ASCII-clean regardless of source-encoding defaults on the host.
The rolling deployment was running its expensive AWS calls one ASG at a time, even though each call is independent: instance launches in `rolling:setup` waited on `wait_until_running` (~30-90s) per group, `start_instance_refresh` ran serially in both `rolling:update` and `rolling:trigger_instance_refresh`, and `rolling:cleanup` deleted old Launch Template versions, AMIs and EBS snapshots one after another. For a fleet with N ASGs this multiplied wall-clock time by N. Run all of these via the existing `Parallel.run` helper so total time is bounded by the slowest single operation. Dedupe of launches by image_id is preserved by pre-filtering before the parallel run, and the shared `deleted_amis` tracker in cleanup is now a `Concurrent::Array` so AMIs referenced by multiple Launch Templates are still only deleted once. `AutoscaleGroups#launch_templates` resolves each group's launch template concurrently as well.
b132ea6 to
42cd92a
Compare
The existing snapshot deletion spec only exercised a single-snapshot AMI, which falls through the `image_snapshots.size > 1` guard. Add a stub with two EBS mappings and assert both `DeleteSnapshot` calls are issued.
`Parallel.run` re-raises only after every thread has joined, which meant the post-loop `config.instances << instance` in the previous version was skipped on partial failure. That left successfully-launched EC2 instances unrecorded, so the at_exit cleanup hook could not find them to terminate, and they would survive indefinitely in AWS. Move the registration inside the parallel block (guarded by a mutex, because `Instances#<<` wraps a plain Array) so every successful launch is tracked the moment it completes. The Capistrano `server` registration stays after the parallel run because the DSL is not safe to call from multiple threads.
Parallel rolling:setup made every instance finish wait_until_running at roughly the same moment, but sshd typically only binds port 22 some 20-40s later once cloud-init completes. The subsequent SSH probe loop in the rake task then spun through dozens of 'Connection refused' retries waiting for sshd. Add a `wait_until(:instance_status_ok)` after `wait_until_running` so the launch only returns once AWS reports system + instance status as 'ok' (i.e. boot has finished). This shifts the wait inside the parallel launch block, which still runs concurrently, and lets the SSH probe succeed on its first attempt. A waiter timeout is downgraded to a warning so the SSH probe (which has its own 5-minute budget) can still catch genuine failures.
A QA deploy with the status-check waiter in place showed the pre-deploy
phase actually got slower, not faster: clean log (no SSH retry noise) but
the wall-clock for rolling:setup grew by roughly 60 seconds compared to
the version with the loud SSH probe loop.
Two reasons the waiter loses:
* `instance_status_ok` polls AWS every 15 seconds by default, while the
SSH probe in `rolling:setup` polls the instance directly every 1
second. The SSH probe sees sshd become available much sooner.
* The waiter requires both system and instance status checks to report
'ok'. That is a higher bar than 'sshd is bound to port 22', which is
all the deploy actually needs.
The "Connection refused" lines were stderr chatter from Net::SSH; the
underlying loop was already correct and fast. Reverting the change so we
keep the cheap probe and the better wall-clock time.
|
It saves 20-30 seconds if 4 ASG runs sametime, but sometimes it’s trivial. Happy any decision you make for this 🙏 |
|
Thanks for you PR! Definitely an interesting idea to parallelize more tasks. I think parallel instance launches would probably provide the biggest benefit here. For the other tasks, I’m not yet convinced the gains outweigh the added complexity. At the moment, the code in Once we have tests in place, I think it makes sense to start by parallelizing the instance launches first, and then evaluate whether it’s worth applying the same approach to the other tasks as well. |
Launch instances for rolling Auto Scaling Groups in parallel using `Parallel.run`, reducing setup time when multiple groups are configured. Groups sharing the same launch template image are deduplicated via a new `AutoscaleGroups#with_unique_images` method, which replaces the previous per-image tracking check. Two new `AutoscaleGroups` methods — `#rolling` and `#standard` — replace the inline partition loop, making the rake task a thin orchestration script. `Instances#<<` is now backed by `Concurrent::Array` for thread-safe concurrent appends. `Configuration` state is eagerly initialized at module load time, eliminating lazy-init races without requiring synchronization. Thanks to @erimicel for the original idea and PR (#17).
Summary
The rolling deployment tasks currently run their AWS calls one ASG at a time, but the work is independent per ASG. With N Auto Scaling Groups the wall-clock time scales linearly. This change uses the existing
Parallel.runhelper so the total time is bounded by the slowest single ASG instead.What runs in parallel now:
rolling:setup— instance launches (each previously blocked onwait_until_running, ~30-90s)rolling:updateandrolling:trigger_instance_refresh—start_instance_refreshcallsrolling:cleanup— old Launch Template versions, AMIs and snapshotsAMI#delete— EBS snapshot deletionsAutoscaleGroups#launch_templates— per-group launch template lookupsNotes
rolling:setuplaunches by image_id is preserved by pre-filtering before the parallel run.deleted_amistracker in cleanup is now aConcurrent::Arrayso AMIs referenced by multiple Launch Templates are still only deleted once.Parallel.runalready aggregates errors (Aggregate errors in Parallel.run instead of abandoning in-flight threads #14), so one failing ASG no longer abandons in-flight work.Test plan
bundle exec rspec(149 examples, 0 failures)bundle exec rubocopcleanOpening as a draft for review before marking ready.