Skip to content

state: Don't delete .new files in State::load()#2675

Open
m-blaha wants to merge 1 commit intorpm-software-management:mainfrom
m-blaha:system-state-race
Open

state: Don't delete .new files in State::load()#2675
m-blaha wants to merge 1 commit intorpm-software-management:mainfrom
m-blaha:system-state-race

Conversation

@m-blaha
Copy link
Copy Markdown
Member

@m-blaha m-blaha commented Apr 7, 2026

When multiple libdnf5 processes run concurrently (e.g. dnf5 transaction + PackageKit/GNOME Software loading the system repo), the .new file recovery code in State::load() can race with State::save() in another process. The recovery code would delete or rename .new files that the concurrent save() just wrote, causing save()'s subsequent rename() to fail with "cannot copy/rename" errors.

Fix by making load() purely read-only with respect to .new files. In case there are any .new files present, just log a warning and keep using the non-.new state files from the last successful save().

An alternative approach using Locker to synchronize State::load() and State::save() was considered. This would preserve crash recovery for the case where save() was interrupted during the rename phase (all .new files fully written). However, a crash during the write phase still leaves state inconsistent with rpmdb, requiring a full state rebuild (see #1610). Also, Locker requires write access to create the lock file in the state directory, which would break non-root read-only operations (e.g. dnf5 repoquery) unless fallback logic was added. The added complexity was not justified given these limitations.

Resolves: #2601

When multiple libdnf5 processes run concurrently (e.g. dnf5 transaction
+ PackageKit/GNOME Software loading the system repo), the .new file
recovery code in State::load() can race with State::save() in another
process. The recovery code would delete or rename .new files that the
concurrent save() just wrote, causing save()'s subsequent rename() to
fail with "cannot copy/rename" errors.

Fix by making load() purely read-only with respect to .new files. In
case there are any .new files present, just log a warning and keep using
the non-.new state files from the last successful save().

An alternative approach using Locker to synchronize State::load() and
State::save() was considered. This would preserve crash recovery for the
case where save() was interrupted during the rename phase (all .new
files fully written). However, a crash during the write phase still
leaves state inconsistent with rpmdb, requiring a full state rebuild
(see rpm-software-management#1610). Also, Locker requires write access to create the lock file
in the state directory, which would break non-root read-only operations
(e.g. dnf5 repoquery) unless fallback logic was added. The added
complexity was not justified given these limitations.

Resolves: rpm-software-management#2601

Signed-off-by: Marek Blaha <mblaha@redhat.com>
@m-blaha m-blaha requested a review from a team as a code owner April 7, 2026 13:03
@m-blaha m-blaha requested review from evan-goode and removed request for a team April 7, 2026 13:03
Copy link
Copy Markdown
Member

@evan-goode evan-goode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the investigation and welcome back from the Packit exchange :)

The fix looks correct to me.

@evan-goode
Copy link
Copy Markdown
Member

An alternative approach using Locker to synchronize State::load() and State::save() was considered. This would preserve crash recovery for the case where save() was interrupted during the rename phase (all .new files fully written). However, a crash during the write phase still leaves state inconsistent with rpmdb, requiring a full state rebuild (see #1610).

IMO ideally we would have locking in addition to this change.

Also, Locker requires write access to create the lock file in the state directory, which would break non-root read-only operations (e.g. dnf5 repoquery) unless fallback logic was added. The added complexity was not justified given these limitations.

For the "system repo" lock (#2519), we worked around that by having the lock file (/usr/lib/sysimage/libdnf5/system-repo.lock) be persistent and owned by root with 0664. Then unprivileged users can obtain read locks but not write locks.

Maybe standard practice should be to use the system repo lock for these state files too? The system repo lock is obtained in Context::load_repos; it's not automatically obtained for libdnf5 API users. The libdnf5 tutorial was updated to recommend obtaining the lock. I guess dnf5daemon consumers and the new PackageKit backend are not using it (yet).

I would hesitate to somehow enforce obtaining the system repo lock in libdnf5, since there are use cases where it's better to read a soon-to-be-invalid state than to wait for a long DNF5 process to finish and release a write lock. But maybe it could be opt-out instead of opt-in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Intermittent 'cannot copy' errors for /usr/lib/sysimage/libdnf5/packages.toml/new in dnf5-5.3.0.0-7.fc44

2 participants