Conversation
| CONSTRAINT failure_reason_iff_failed CHECK ( | ||
| (failure_reason IS NOT NULL AND state = 'failed') | ||
| OR (failure_reason IS NULL) |
There was a problem hiding this comment.
I skimmed through other Omicron bits alongside this change and I think with this change there really is no way an instance should be failed without a reason?
I guess this is not NOT (failure_reason IS NULL AND state == 'failed')` because you don't want a bug in Nexus to trip the constraint and prevent us from failing an instance that should be failed. Since failing an instance means something is Really Broken, refusing to be broken because of a bug would be bad?
that seems worth a note on the schema maybe, even though this is just for debugging?
There was a problem hiding this comment.
yeah, that's correct, i should probably write that down here
| WHERE state = 'failed' | ||
| AND failure_reason IS NULL; |
There was a problem hiding this comment.
oh, right, I suppose any failures we've migrated into the future, we don't have a "legacy" reason.. though maybe could be "failure_reason = "predates precise information"?
| // /// Returns the runtime state of this VMM. | ||
| // pub fn runtime(&self) -> VmmRuntimeState { | ||
| // VmmRuntimeState { | ||
| // time_state_updated: self.time_state_updated, | ||
| // generation: self.generation, | ||
| // state: self.state, | ||
| // } | ||
| // } |
There was a problem hiding this comment.
I'm surprised this isn't used? but also what led you to noticing this, and do you want to just delete it lol
especially since it seems like Instance::runtime is the .runtime() that's actually used?
There was a problem hiding this comment.
yeah it should be deleted. i was surprised to discover this also
| state: s.state.into(), | ||
| generation: s.gen_, | ||
| time_updated: s.time_updated, | ||
| failure_reason: None, // TODO(eliza) |
There was a problem hiding this comment.
fwiw I'm ... not sure what this means. taking a swing: "vmm_register() or vmm_get_state() returned a SledVmmState that was immediately Failed"? would we expect Propolis to fill in a reason here, or would we expect sled-agent to invent a reason on Propolis' behalf when it sees the failure?
| const UPDATED: &'static str = " updated at"; | ||
| const INSTANCE_ID: &'static str = "instance ID"; | ||
| const SLED_ID: &'static str = "sled ID"; | ||
| const SLED_SERIAL: &'static str = "sled serial"; | ||
| const CPU_PLATFORM: &'static str = "CPU platform"; | ||
| const ADDRESS: &'static str = "propolis address"; | ||
| const STATE: &'static str = "state"; | ||
| const FAILURE_REASON: &'static str = " failed because:"; |
There was a problem hiding this comment.
dumb question, but aren't we already juggling the ident below? Why add spaces as prefixes here?
|
|
||
| /// Attempts to transition an instance to to the `Failed` state in | ||
| /// response to an error returned from a call to a sled | ||
| /// agent instance API, supplied in `reason`. |
There was a problem hiding this comment.
| /// agent instance API, supplied in `error`. |
| time_state_updated: chrono::Utc::now(), | ||
| generation: db::model::Generation(vmm.generation.next()), | ||
| failure_reason: Some( | ||
| "VMM no longer known to sled-agent".to_string(), |
There was a problem hiding this comment.
Should we be using error to populate this string?
|
|
||
| self.fail_vmm_and_terminate().await; | ||
| self.fail_vmm_and_terminate( | ||
| "propolis denies knowledge of VM", |
There was a problem hiding this comment.
nit; I might change the wording here. By itself this string is a bit mysterious?
Maybe: "Proposlis refused to change state; VM is missing (might have restarted?)"
In preparation for working on #10275, I thought it would be nice if there was a way for Nexus to write down why a VMM record was moved to the
Failedstate, for debugging purposes. This seemed worthwhile since #10275 proposes that we add more cases under which Nexus will mark a VMM as failed, and it would be useful to have a record for debugging any potential issues. This is intended to be displayed in OMDB, rather than presented directly to the user --- for user-facing diagnostics, I think we will want something beyond an unstructured string that gets jammed into thevmmdatabase record when it's moved toFailed, especially because that is a circumstance under which that record will shortly be soft-deleted. Thus, this change is not intended to solve #9185; I've started (and still must finish) a proposal for a user-facing instance event log in RFD 631. Instead, this is a stop-gap temporary solution to record something for Oxide support and engineering to use in possible debugging scenarios, which is why it's so stringly-typed.