state: Don't delete .new files in State::load()#2675
state: Don't delete .new files in State::load()#2675m-blaha wants to merge 1 commit intorpm-software-management:mainfrom
Conversation
When multiple libdnf5 processes run concurrently (e.g. dnf5 transaction + PackageKit/GNOME Software loading the system repo), the .new file recovery code in State::load() can race with State::save() in another process. The recovery code would delete or rename .new files that the concurrent save() just wrote, causing save()'s subsequent rename() to fail with "cannot copy/rename" errors. Fix by making load() purely read-only with respect to .new files. In case there are any .new files present, just log a warning and keep using the non-.new state files from the last successful save(). An alternative approach using Locker to synchronize State::load() and State::save() was considered. This would preserve crash recovery for the case where save() was interrupted during the rename phase (all .new files fully written). However, a crash during the write phase still leaves state inconsistent with rpmdb, requiring a full state rebuild (see rpm-software-management#1610). Also, Locker requires write access to create the lock file in the state directory, which would break non-root read-only operations (e.g. dnf5 repoquery) unless fallback logic was added. The added complexity was not justified given these limitations. Resolves: rpm-software-management#2601 Signed-off-by: Marek Blaha <mblaha@redhat.com>
evan-goode
left a comment
There was a problem hiding this comment.
Thanks for the investigation and welcome back from the Packit exchange :)
The fix looks correct to me.
IMO ideally we would have locking in addition to this change.
For the "system repo" lock (#2519), we worked around that by having the lock file ( Maybe standard practice should be to use the system repo lock for these state files too? The system repo lock is obtained in I would hesitate to somehow enforce obtaining the system repo lock in libdnf5, since there are use cases where it's better to read a soon-to-be-invalid state than to wait for a long DNF5 process to finish and release a write lock. But maybe it could be opt-out instead of opt-in. |
When multiple libdnf5 processes run concurrently (e.g. dnf5 transaction + PackageKit/GNOME Software loading the system repo), the .new file recovery code in State::load() can race with State::save() in another process. The recovery code would delete or rename .new files that the concurrent save() just wrote, causing save()'s subsequent rename() to fail with "cannot copy/rename" errors.
Fix by making load() purely read-only with respect to .new files. In case there are any .new files present, just log a warning and keep using the non-.new state files from the last successful save().
An alternative approach using Locker to synchronize State::load() and State::save() was considered. This would preserve crash recovery for the case where save() was interrupted during the rename phase (all .new files fully written). However, a crash during the write phase still leaves state inconsistent with rpmdb, requiring a full state rebuild (see #1610). Also, Locker requires write access to create the lock file in the state directory, which would break non-root read-only operations (e.g. dnf5 repoquery) unless fallback logic was added. The added complexity was not justified given these limitations.
Resolves: #2601