Add check_aws_events health_checks plugin (IMDSv2 maintenance event poll) (#139)#139
Merged
meta-codesync[bot] merged 1 commit intoMay 7, 2026
Conversation
|
@kenerwin88 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104060213. |
CI CommandsThe following CI workflows run automatically on every push and pull request:
The following commands can be used by maintainers to trigger additional tests that require access to secrets:
|
Contributor
|
@kenerwin88 thanks for the contribution! Overall LGTM just 2 things: |
…oll) (facebookresearch#139) Summary: Adds `check-aws-events`, a new GCM `health_checks` Click subcommand that polls EC2 IMDSv2 (`/latest/meta-data/events/maintenance/scheduled`) for pending instance maintenance / retirement events scheduled against the local node. Surfaces them as a node condition via NPD's exit-code translation so operators can drain / cordon / replace the instance ahead of AWS's enforced `NotBefore` rather than letting workloads be killed when AWS rotates the host. Endpoint, method, and headers match the AWS IMDSv2 spec used by `aws-node-termination-handler` for the scheduled-events endpoint (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html#viewing_scheduled_events). Conservative fail-safe semantics: any transport / unreachable / non-list / non-dict / malformed-payload response returns `ExitCode.OK` so a transient IMDS blip can never trigger a fleet-wide drain. Only a `200 + non-empty events array` exits `WARN` with a one-line summary (`Code NotBefore=... State=... EventId=...`), which NPD records as the condition message. Files: - `gcm/health_checks/checks/check_aws_events.py` (new) — Click subcommand + two helpers (`fetch_imds_token`, `fetch_scheduled_events`) that the test suite injects fakes for via `click.pass_obj`. Always passes `proxies={"http": "", "https": ""}` so `requests` never honors `HTTP_PROXY` and accidentally routes IMDS at a proxy server's metadata. - `gcm/health_checks/checks/__init__.py` + `gcm/health_checks/cli/health_checks.py` — register. - `gcm/schemas/health_check/health_check_name.py` — new `HealthCheckName.CHECK_AWS_EVENTS = "check aws events"` for telemetry. - `gcm/tests/health_checks_tests/test_check_aws_events.py` (new) — 19 tests covering token fetch (happy / off-EC2 / 5xx / empty body / proxies bypass / trailing-slash), events response (200-empty / 404 / one-pending / multi-pending / non-list / non-dict / unreachable / 5xx / garbage / proxies bypass / trailing-slash), and full Click command exit codes (off-EC2 → OK, pending → WARN with summary). - `BUCK` — add `requests` to `:health_checks` library deps and `requests` + `requests-mock` to `:health_checks_pytest`. Differential Revision: D104060213
faab1b9 to
454af8a
Compare
Contributor
Author
|
Added killswitch and docs :). Ty @luccabb ! |
c2396d3
into
facebookresearch:main
21 of 23 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Adds
check-aws-events, a new GCMhealth_checksClick subcommand that polls EC2 IMDSv2 (/latest/meta-data/events/maintenance/scheduled) for pending instance maintenance / retirement events scheduled against the local node. Surfaces them as a node condition via NPD's exit-code translation so operators can drain / cordon / replace the instance ahead of AWS's enforcedNotBeforerather than letting workloads be killed when AWS rotates the host.Endpoint, method, and headers match the AWS IMDSv2 spec used by
aws-node-termination-handlerfor the scheduled-events endpoint (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html#viewing_scheduled_events). Conservative fail-safe semantics: any transport / unreachable / non-list / non-dict / malformed-payload response returnsExitCode.OKso a transient IMDS blip can never trigger a fleet-wide drain. Only a200 + non-empty events arrayexitsWARNwith a one-line summary (Code NotBefore=... State=... EventId=...), which NPD records as the condition message.Files:
gcm/health_checks/checks/check_aws_events.py(new) — Click subcommand + two helpers (fetch_imds_token,fetch_scheduled_events) that the test suite injects fakes for viaclick.pass_obj. Always passesproxies={"http": "", "https": ""}sorequestsnever honorsHTTP_PROXYand accidentally routes IMDS at a proxy server's metadata.gcm/health_checks/checks/__init__.py+gcm/health_checks/cli/health_checks.py— register.gcm/schemas/health_check/health_check_name.py— newHealthCheckName.CHECK_AWS_EVENTS = "check aws events"for telemetry.gcm/tests/health_checks_tests/test_check_aws_events.py(new) — 19 tests covering token fetch (happy / off-EC2 / 5xx / empty body / proxies bypass / trailing-slash), events response (200-empty / 404 / one-pending / multi-pending / non-list / non-dict / unreachable / 5xx / garbage / proxies bypass / trailing-slash), and full Click command exit codes (off-EC2 → OK, pending → WARN with summary).BUCK— addrequeststo:health_checkslibrary deps andrequests+requests-mockto:health_checks_pytest.Differential Revision: D104060213