Skip to content

DAOS-19008 control: Erase formatting after failed format --replace#18446

Open
tanabarr wants to merge 3 commits into
masterfrom
tanabarr/control-fmtreplace-rank-erase
Open

DAOS-19008 control: Erase formatting after failed format --replace#18446
tanabarr wants to merge 3 commits into
masterfrom
tanabarr/control-fmtreplace-rank-erase

Conversation

@tanabarr

@tanabarr tanabarr commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@tanabarr tanabarr requested review from a team as code owners June 5, 2026 14:11
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

Ticket title is 'Aurora daos_user: PMEM Device should Unmount and revert the --replace operation fully if it fails'
Status is 'In Progress'
Labels: 'ALCF'
https://daosio.atlassian.net/browse/DAOS-19008

@daosbuild3

Copy link
Copy Markdown
Collaborator

@daosbuild3

Copy link
Copy Markdown
Collaborator

Comment thread src/control/cmd/dmg/storage.go Outdated
cmd.Debugf("Invoking SystemErase to clean up after failed format operation")

eraseReq := &control.SystemEraseReq{}
eraseResp, err := control.SystemErase(ctx, cmd.ctlInvoker, eraseReq)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work... SystemErase doesn't allow you to choose ranks or nodes.

I think you'll need to handle this from the daos_server that owns the engine. If the engine fails to join, and it's a replace operation, blow the storage away. The failure that triggered this request was happening at the join stage.

If the format itself fails, I don't think there's any risk of the engine coming up. If there's a partial failure, it's not a bad idea to clean up, but I think that would have to happen from the server side, too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right , this needs a rework

@tanabarr tanabarr force-pushed the tanabarr/control-fmtreplace-rank-erase branch 3 times, most recently from d8cf409 to 6785032 Compare June 8, 2026 20:46
@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18446/4/testReport/

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18446/4/execution/node/1394/log

@tanabarr tanabarr force-pushed the tanabarr/control-fmtreplace-rank-erase branch from 6785032 to 309615f Compare June 9, 2026 11:34
@daosbuild3

Copy link
Copy Markdown
Collaborator

@tanabarr tanabarr force-pushed the tanabarr/control-fmtreplace-rank-erase branch from 309615f to a593fcb Compare June 9, 2026 13:55
@daosbuild3

Copy link
Copy Markdown
Collaborator

@tanabarr tanabarr changed the title Erase formatting after failed format --replace DAOS-19008 control: Erase formatting after failed format --replace Jun 9, 2026
@tanabarr tanabarr force-pushed the tanabarr/control-fmtreplace-rank-erase branch 2 times, most recently from 21e8fb7 to 862f9d5 Compare June 9, 2026 15:57
@daosbuild3

Copy link
Copy Markdown
Collaborator

tanabarr added 3 commits June 10, 2026 12:25
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr force-pushed the tanabarr/control-fmtreplace-rank-erase branch from 862f9d5 to 550d853 Compare June 12, 2026 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants