Adjust mod multiplier recalculation processes to minimise number of required database writes#374
Conversation
By itself this does nothing, the old API worked *fine* as far as the backpopulation was concerned. However doing this allows for an opportunity of automatically handling the change to mania key mod multipliers that was done live without backwards adjustments. More on this later.
…y number of rows As the inline commentary states, the number of rows that require writes of `total_score_without_mods` can be significantly reduced because of two facts: - Firstly, when it comes to scores set on stable, `total_score_without_mods` as well as `total_score` are quantities derived directly from `legacy_total_score` as well as beatmap attributes. Therefore, it is not actually necessary to write `total_score_without_mods` for those scores, and instead the score conversion algorithm can be used to recalculate the multipliers instead. - Secondly, when it comes to lazer scores specifically, this change assumes the invariant that if there are no mods on a score, then `total_score == total_score_without_mods` - so there is no point writing out values.
… key mod multipliers Migrating to the new multiplier API, which already has to handle multiplier versioning and the directly-applied change to mania key mod multipliers, means that it is now possible to leverage the versioned logic and automatically correct for the key mod multiplier change by filling out `ScoreInfo.ClientVersion` using `osu_builds`.
…necessary number of rows with `total_score_without_mods` Adds an alternate processing path for stable scores which leverages the algorithm of converting from legacy total score to implement the multiplier recalculation. The command was already skipping scores with no mods at all, so no adjustment required there.
Written using real-world cases pulled from live production scores (via data.ppy.sh dumps plus some manual trawling for the mania key mod cases).
| return 0; | ||
| } | ||
|
|
||
| private static BeatmapScoringAttributes? getScoringAttributesFor(SoloScore score, MySqlConnection conn) |
There was a problem hiding this comment.
I wonder if we want to expose and use the cached path for this.
There was a problem hiding this comment.
I considered it. My primary concern was running out of memory because the aforementioned static dictionary is never cleared other than via BeatmapStatusWatcher.StartPollingAsync() which is only relevant when the actual data changes.
I'm not even sure why this seemingly doesn't explode in BatchInserter, really.
There was a problem hiding this comment.
There's a finite number of ranked beatmaps in the mentioned case (BatchInserter should only be dealing with ranked beatmap scores). It can most definitely fit in memory.
For our purposes here, I think we also consider unranked beatmaps, which may tip the scales.
There was a problem hiding this comment.
For our purposes here, I think we also consider unranked beatmaps
There are presumably lazer scores set on unranked beatmaps, yes. I am not aware of any process culling them at this time.
This will have to be done again when the new mod multipliers land, but I guess the tag is already there to use.
|
I'll undraft and check off the boxes from the OP now, given that the indicated plan was to dry run this and check whether this new adjusted process is preserving existing multipliers (with the one exception of mania key mods). Still bit scared of all this but testing added in 9b87267 should make sure I haven't screwed up dry run mode at least. Will re-test once more with new multiplier values. |
This mirrors the updated server flow (ppy/osu-queue-score-statistics#374) pretty closely.
PRing early as draft primarily for visibility reasons.
The original plan for deploying the mod multiplier changes was to first ensure that all scores in the database have
total_score_without_modspresent, and then write out the updatedtotal_scoreacross the whole table as the new multipliers times that populatedtotal_score_without_modsquantity.Aside: Why is the migration not run in-place?
The primary reason for me approaching things in this way rather than running the recalculation in-place is because running the recalculation in-place would be an inherently lossy process.
Assuming everything goes right, because total score is rounded to full integers and mod multipliers are very much not integers, the possibility of rounding errors and floating-point inaccuracies is very real. Therefore, any in-place recalculation of multipliers would be a process that ideally would only ever occur once and never again, because the more times it is performed, the bigger the degradation incurred with every in-place conversion.
And that is assuming everything goes right. In case of any mistake or incorrectly-applied migration, the original data would be lost in an in-place migration, and would not be recoverable quickly or at all because of how complicated the scoring algorithm is at the explicit demand of the user base. The closest thing to a recovery procedure in that case would be replays - when there are replays - and even that would be slow.
It is my former professional experience that tells me you never want to run huge data migrations in-place which incur chances of irretrievable or difficult-to-recover-from data loss.
In recent days, an attempt was made to run the backpopulation of
total_score_without_modson the productonscorestable. Unfortunately at this time the original plan appears to be mostly unviable.The number of rows in the production
scorestable that require this backfill is estimated to be around 2.38 billion rows. This would incur a storage overhead in the hundreds of gigabytes (not only from the actual rows themselves, but also from a transient spike in the size of binlog, which is relevant when replication is depended upon) and take roughly a month to execute.This PR contains a set of changes that is designed to reduce the number of writes required to execute this migration. The cost here is the increased complexity of the migration process. While I am somewhat confident this is going to work, I will want to test this more thoroughly on the actual changes and maybe even do some test runs locally using data dumps to make sure I have not missed anything.
Additionally, comments from #269 (review) are addressed here.
As for the gory details:
Switch to new mod multiplier calculation API in backpopulation command
By itself this does nothing, the old API worked fine as far as the backpopulation was concerned. However doing this allows for an opportunity of automatically handling the change to mania key mod multipliers that was done live without backwards adjustments. More on this later.
Adjust query in backpopulation command to target the minimum necessary number of rows
As the inline commentary states, the number of rows that require writes of
total_score_without_modscan be significantly reduced because of two facts:total_score_without_modsas well astotal_scoreare quantities derived directly fromlegacy_total_scoreas well as beatmap attributes. Therefore, it is not actually necessary to writetotal_score_without_modsfor those scores, and instead the score conversion algorithm can be used to recalculate the multipliers instead.total_score == total_score_without_mods- so there is no point writing out values.Adjust backpopulation command to automatically handle change to mania key mod multipliers
Migrating to the new multiplier API, which already has to handle multiplier versioning and the directly-applied change to mania key mod multipliers, means that it is now possible to leverage the versioned logic and automatically correct for the key mod multiplier change by filling out
ScoreInfo.ClientVersionusingosu_builds.Adjust mod multiplier recalculation command to work with the minimum necessary number of rows with
total_score_without_modsAdds an alternate processing path for stable scores which leverages the algorithm of converting from legacy total score to implement the multiplier recalculation.
The command was already skipping scores with no mods at all, so no adjustment required there.
Add test coverage of commands
Written using real-world cases pulled from live production scores (via data.ppy.sh dumps plus some manual trawling for the mania key mod cases).
In the current state the most useful tests are the ones checking correctness of the mania key mod recalculation. Everything else will need updating once the new proposed values are out.