feat: detect endpoint-switch deployments and persist rollback metadata (Step 1)#1776
feat: detect endpoint-switch deployments and persist rollback metadata (Step 1)#1776Vipul-5517 wants to merge 1 commit intomainfrom
Conversation
| Path filePath = getDeploymentDirectoryPath().resolve(ENDPOINT_SWITCH_METADATA_FILE); | ||
| AtomicReference<EndpointSwitchMetadata> ref = new AtomicReference<>(); | ||
| CommitableReader.of(filePath).read(in -> { | ||
| ref.set(SerializerFactory.getFailSafeJsonObjectMapper().readValue(in, EndpointSwitchMetadata.class)); |
There was a problem hiding this comment.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
The ObjectMapper.readValue() method can throw a JsonProcessingException, but this exception is not handled properly here. By simply logging the error and returning an empty value, the exception is ignored. This can mask potential JSON parsing issues, leading to unexpected application behavior or data corruption because the system may continue with invalid or incomplete data. To fix this, catch JsonProcessingException and either handle it appropriately or rethrow it as a custom runtime exception. https://www.ibm.com/support/pages/best-practice-catching-and-re-throwing-java-exceptions
|
The addition of |
| } catch (IOException e) { | ||
| logger.atWarn().setCause(e).log("Failed to check endpoint switch metadata"); | ||
| return false; |
There was a problem hiding this comment.
IOException is being swallowed here. If the metadata file was persisted but can't be read (broken symlink, disk error), the deployment silently loses rollback protection — no snapshot, no rollback on failure. The device could get stuck with an unreachable endpoint and no way to recover. Consider letting the IOException propagate and failing the deployment with FAILED_NO_STATE_CHANGE instead, since no config has been applied yet.
08feeb7 to
b2e75fb
Compare
|
Binary incompatibility detected for commit 7d6caed. com.aws.greengrass.tes.CredentialRequestHandler is binary incompatible and is source incompatible because of FIELD_REMOVED Produced by binaryCompatability.py |
|
Unit Tests Coverage Report
Minimum allowed coverage is Generated by 🐒 cobertura-action against 7d6caed |
|
Integration Tests Coverage Report
Minimum allowed coverage is Generated by 🐒 cobertura-action against 7d6caed |
| boolean endpointSwitch; | ||
| try { | ||
| endpointSwitch = isEndpointSwitchDeployment(); | ||
| } catch (IOException e) { | ||
| logger.atError(MERGE_CONFIG_EVENT_KEY).setCause(e) | ||
| .log("Failed to check endpoint switch metadata"); | ||
| totallyCompleteFuture.complete( | ||
| new DeploymentResult(DeploymentResult.DeploymentStatus.FAILED_NO_STATE_CHANGE, e)); | ||
| return; | ||
| } | ||
| if (endpointSwitch) { | ||
| logger.atInfo(MERGE_CONFIG_EVENT_KEY) | ||
| .log("Endpoint switch deployment: forcing config snapshot and rollback support"); | ||
| } |
There was a problem hiding this comment.
This logic is essentially to trigger a rollback on endpoint switch deployment irrespective of auto rollback failure handling policy configured (ROLLBACK or DO_NOTHING). We can add a comment clarifying it. Also, how about executing this logic only if isAutoRollbackRequested(deploymentDocument) is false so that it doesn't get triggered on every endpoint switch deployment
There was a problem hiding this comment.
Addressed in the new revision
| /** | ||
| * Safe variant for use after config is applied — defaults to true (attempt rollback) on IO error. | ||
| */ | ||
| private boolean isEndpointSwitchSafe() { |
There was a problem hiding this comment.
Why have a separate method isEndpointSwitchSafe instead of re-using isEndpointSwitchDeployment?
There was a problem hiding this comment.
Addressed in the new revision
| static final String BOOTSTRAP_TASK_FILE = "bootstrap_task.json"; | ||
| static final String ROLLBACK_BOOTSTRAP_TASK_FILE = "rollback_bootstrap_task.json"; | ||
| static final String DEPLOYMENT_METADATA_FILE = "deployment_metadata.json"; | ||
| static final String ENDPOINT_SWITCH_METADATA_FILE = "endpoint_switch_metadata.json"; |
There was a problem hiding this comment.
Have we considered writing source endpoint info to config.tlog?
There was a problem hiding this comment.
Discussed it offline. There is a room for error if another file is being introduced to be managed by Greengrass Nucleus. Instead, we can have the source IoT data endpoint written to runtime configuration within config.tlog file
There was a problem hiding this comment.
Addressed in the new revision
|
I am concerned if we are introducing race conditions through some edge cases. I am looking at scenarios where
Let's talk about situations where a bad endpoint change gets deployed. Specifically, Nucleus loses all connectivity if it is configured to use this endpoint. That is, its bricked. For the cancellation scenario, have we considered implementing a forced rollback if Nucleus cancels a deployment between config update and waiting for services to start? For the power loss scenario, have we considered what happens if rollback snapshot is stale? |
Discussed this offline. The risk of bad IoT data endpoint being deployed is offset by having the pre-flight MQTT connectivity check to verify if the device is successfully able to establish the connectivity. |
4212ba0 to
7c7f1f8
Compare
| // Endpoint switch deployments force config snapshot and rollback regardless of FailureHandlingPolicy, | ||
| // so that we can restore the original endpoint if the new one is unreachable after applying the change. | ||
| // Only check when auto-rollback is not already requested to avoid redundant metadata reads. | ||
| boolean endpointSwitch = !autoRollback && isEndpointSwitch(); |
There was a problem hiding this comment.
If autoRollback is set to true and isEndpointSwitch() is false, then essentially endpointSwitch variable would be set to false. This would be confusing.
The main purpose of defining this variable is to know whether config snapshot should be taken or not. We can name it accordingly instead of endpointSwitch
7c7f1f8 to
055c8ab
Compare
| // so the MQTT BatchedSubscriber does not trigger reconnection. Step 5 uses this value to | ||
| // report deployment status back to the source account. clearSourceIotDataEndpoint() is | ||
| // called after status reporting; if cleanup is skipped (crash, Step 5 not yet implemented), |
There was a problem hiding this comment.
This is talking about the future implementation steps. We can update the comment to not have any step details
There was a problem hiding this comment.
this is transient state so should be fine to keep it here as before final merge this will be removed
| /** | ||
| * Persist the source IoT data endpoint for an endpoint-switch deployment. | ||
| * | ||
| * @param endpoint the source IoT data endpoint to persist | ||
| */ | ||
| public void setSourceIotDataEndpoint(String endpoint) { | ||
| config.lookup(DEVICE_PARAM_SOURCE_IOT_DATA_ENDPOINT).withValue(endpoint); | ||
| } | ||
|
|
||
| /** | ||
| * Remove the persisted source IoT data endpoint after deployment completes. | ||
| */ | ||
| public void clearSourceIotDataEndpoint() { | ||
| Topic t = config.find(DEVICE_PARAM_SOURCE_IOT_DATA_ENDPOINT); | ||
| if (t != null) { | ||
| t.remove(); | ||
| } | ||
| } |
There was a problem hiding this comment.
sourceIotDataEndpoint is internal deployment state, not a customer-facing Nucleus config parameter. It should live under services.DeploymentService.runtime — persisted to config.tlog so it survives device crashes mid-endpoint-switch, and not customer-configurable.
fee0b75 to
c5c4730
Compare
…a (Step 1) - Detect endpoint-switch deployments in DeploymentConfigMerger when iotDataEndpoint value differs from current DeviceConfiguration - Persist sourceIotDataEndpoint in config.tlog (distinct key, does not trigger MQTT BatchedSubscriber reconnection) - Force config snapshot for endpoint-switch deployments regardless of FailureHandlingPolicy - Add DeviceConfiguration.set/get/clearSourceIotDataEndpoint() methods - Guard isEndpointSwitch() with !autoRollback to skip redundant reads - Consolidate isEndpointSwitchDeployment()/isEndpointSwitchSafe() into single isEndpointSwitch() method PR review changes: - Comment #1 (sahith): skip endpoint switch check when autoRollback already true - Comment #2 (sahith): consolidate into single method, no IOException - Comment #3 (saranyailla): store in config.tlog instead of separate endpoint_switch_metadata.json file
c5c4730 to
7d6caed
Compare
This is addressed and reviewed by Sahith
src/main/java/com/aws/greengrass/deployment/activator/DefaultActivator.java
Show resolved
Hide resolved
| public static final String MERGE_ERROR_LOG_EVENT_KEY = "config-update-error"; | ||
| public static final String DEPLOYMENT_ID_LOG_KEY = "deploymentId"; | ||
| public static final String SERVICE_NAME_LOG_KEY = "serviceName"; | ||
| public static final String SOURCE_IOT_DATA_ENDPOINT_KEY = "sourceIotDataEndpoint"; |
There was a problem hiding this comment.
nit: the data endpoint is a device configuration and not related to the deployment configuration merger. Please move this to the device configuration class.
alter-mage
left a comment
There was a problem hiding this comment.
Left a nitpicking comment but the overall PR looks good to me
| public static final String MERGE_ERROR_LOG_EVENT_KEY = "config-update-error"; | ||
| public static final String DEPLOYMENT_ID_LOG_KEY = "deploymentId"; | ||
| public static final String SERVICE_NAME_LOG_KEY = "serviceName"; | ||
| public static final String SOURCE_IOT_DATA_ENDPOINT_KEY = "sourceIotDataEndpoint"; |
There was a problem hiding this comment.
nit: do we need sourceIotCredEndpoint?
IoT Endpoint Switch — Step 1: Detect endpoint-switch deployments and persist rollback metadata
Summary
Adds the foundation for IoT endpoint switch deployments in Greengrass Nucleus Classic. When a deployment changes
iotDataEndpoint, the deployment service detects it, persists the source endpoint for later status reporting and rollback, and forces config snapshot creation regardless of the deployment'sFailureHandlingPolicy.This is Step 1 of the IoT Endpoint Switch design.
Changes
DeploymentConfigMerger— AddisEndpointSwitchDeployment()to detect when incoming config changesiotDataEndpoint. Persist source endpoint metadata before activation.DeploymentDirectoryManager— AddpersistSourceEndpoints(),readSourceEndpoints(),hasEndpointSwitchMetadata()for endpoint switch metadata lifecycle.DefaultActivator— Force config snapshot and rollback support for endpoint-switch deployments even whenFailureHandlingPolicyisDO_NOTHING.EndpointSwitchMetadata— Model class for persisted source endpoint data.E2E Test Results (14/14 pass)
Tested with locally built Nucleus 2.16.0-SNAPSHOT in Podman containers, cross-region endpoint switch (us-east-1 → us-west-2).
Testing
isEndpointSwitchDeployment, metadata persistence, and forced rollback.DeploymentDirectoryManagerconstructor parameter.Related