chore: Improve performance on reserved values [DHIS2-21253]#23520
chore: Improve performance on reserved values [DHIS2-21253]#23520
Conversation
...dhis-service-core/src/main/java/org/hisp/dhis/reservedvalue/DefaultReservedValueService.java
Show resolved
Hide resolved
| do { | ||
| deleted = | ||
| requireNonNullElse( | ||
| transactionTemplate.execute(s -> reservedValueStore.removeExpiredValues()), 0); |
There was a problem hiding this comment.
Also wondering if this transactionTemplate is necessary? Can we set the @transactional on the store to achieve the same behaviour?
There was a problem hiding this comment.
We could, and it would behave the same way.
I just didn’t do it because I thought our convention was to handle transactions at the service level, not at the store level.
|
|
||
| </class> | ||
|
|
||
| <sql-query name="getRandomGeneratedValuesNotAvailableNamedQuery"> |
There was a problem hiding this comment.
Is this query used somewhere? I feel this may not scale that well. Probably not something for this PR anyway.
There was a problem hiding this comment.
It is used, and I’m working on it.
I forgot to mention it in the PR description, but this PR focuses solely on the deletion of reserved values.
I’m now working on the generation and reservation parts.
Doing everything together felt too large to explain in a single PR.
ameenhere
left a comment
There was a problem hiding this comment.
The job changes looks good to me. The evidence of performance test results is also clear.
|
|
||
| log.info("... Completed deleting expired or used reserved values"); | ||
| @Override | ||
| public int removeUsedValues() { |
There was a problem hiding this comment.
I am wondering if we should remove this completely.
The only way to add a trackedEntityAttributeValue is through Tracker Importer and we are calling useReservedValue when persisting an attribute.
This query should always return 0 rows.
I cannot think in which case we would have used values that were not removed.
There was a problem hiding this comment.
In theory that's true, but I'm not sure whether that still holds for implementations that have been running for some years.
To be honest, I'm not comfortable enough with this feature to confidently remove it.
| total += deleted; | ||
| } while (deleted > 0); | ||
|
|
||
| log.info("Deleted {} reserved values", total); |
There was a problem hiding this comment.
This log was here before so maybe it is not for this PR but it would be good if we could delegate this info to the job itself.
When you run the job you can complete a stage passing some information that will be logged in the status of the job.
|
| } catch (TimeoutException ex) { | ||
| log.warn( | ||
| String.format( | ||
| "Generation and reservation of values for %s wih uid %s timed out. %s values was reserved. You might be running low on available values", |
There was a problem hiding this comment.
typo in message and use https://www.baeldung.com/slf4j-parameterized-logging
| + "SELECT rv.reservedvalueid FROM reservedvalue rv " | ||
| + "JOIN trackedentityattribute tea ON rv.owneruid = tea.uid " | ||
| + "JOIN trackedentityattributevalue teav ON teav.trackedentityattributeid = tea.trackedentityattributeid " | ||
| + "AND lower(teav.value) = lower(rv.value) " |
There was a problem hiding this comment.
This was r.value = teav.plainValue was it wrong before?
How do large TEAV tables perform?
The existing index on TEAV is (trackedentityinstanceid, trackedentityattributeid, lower(value)) which cannot be used here as trackedentityinstanceid is the leading column and isn't in this join. There's also in_trackedentityattributevalue_attributeid on (trackedentityattributeid) which helps with the first join column, but the lower(value) comparison then requires a filter, not an index lookup.
| requireNonNullElse( | ||
| transactionTemplate.execute(s -> reservedValueStore.removeExpiredValues()), 0); | ||
| total += deleted; | ||
| } while (deleted > 0); |
There was a problem hiding this comment.
If the batch returns fewer than 500k rows, there's nothing left to delete, so comparing against DELETE_BATCH_SIZE instead of 0 avoids the final full table scan that only confirms there's no more work.
| public int removeExpiredValues() { | ||
| return jdbcTemplate.update( | ||
| "DELETE FROM reservedvalue WHERE reservedvalueid IN " | ||
| + "(SELECT reservedvalueid FROM reservedvalue WHERE expirydate < now() LIMIT ?)", |
There was a problem hiding this comment.
I think there is no index on expirydate. I don't know this reserved value feature so don't know if that scenario is plausible but:
When the job runs regularly and the table is mostly clean, so only a small fraction of rows are expired. An index on expirydate would let Postgres jump straight to the expired rows instead of scanning millions of non-expired ones to find the few that need deleting.



The
REMOVE_USED_OR_EXPIRED_RESERVED_VALUESjob previously executed a singleHQL DELETEstatement that combined expired and used values with a correlated subquery. This caused the query planner to execute a subplan for every row inreservedvalue, resulting in an extremely expensive operation that does not scale and effectively becomes unprocessable on large datasets like the one in DRC, where they have ~11M rows in that table.Changes
trackedentityattributevalue.LIMITin a subquery, looping until no rows remain.PROPAGATION_REQUIRES_NEW), reducing lock duration and replication pressure. Large single transactions previously held row locks for the entire duration. With batching, locks are released frequently and failures only affect a single batch instead of the entire job. That was likely the issue in DRC.saveGeneratedValueswherenumberOfReservations(the original total) was used instead of the remaining count, which could lead to over-reservation during retries.Test plan
I replicated the production like dataset from DRC in the
reservedvaluetable to simulate realistic load conditions.In total, the dataset contains 11 million reserved values, of which 8 million are expired and the remaining 3 million are still valid.
There are 1M tracked entities and 10M TEAVs (≈10 per tracked entity), distributed as follows:
removeUsedValues().Performance test
Candidate

Baseline

That's roughly a 7x speedup for deleting 8M expired + 20k used values.
Here's the link to the test.
Pgbench test
Two scenarios were tested to simulate a healthy table (cleanup job runs regularly) and a degraded table (large backlog of expired values). Each scenario was run at two scales: 500k and 5M rows.
The "expired" rows represent values whose expiry date has passed and should be deleted. The "not expired" rows represent values still in use or reserved for future use.
Approaches compared:
The original query degrades badly as the table grows and, counterintuitively, performs worse as fewer rows are expired. That's because the correlated subquery runs once per non-expired row. At 5M rows, it never finishes within the test timeout, regardless of the expired ratio.
Both batched approaches are orders of magnitude faster. The 500k batch outperforms the 100k batch in every scenario because fewer round-trips to the database outweigh the higher cost per individual batch. The difference is more pronounced at 5M rows (27s vs. 37s at 90% expired, 10.4s vs. 14.5s at 20% expired), where the number of iterations matters more.
The 20% expired scenario is consistently faster than 90% for the batched approaches because there are fewer rows to delete overall. For the original query, the opposite is true, more non-expired rows mean more correlated subquery executions.
This PR focuses only on the deletion part. I’ll create a separate PR to handle the generation of reserved values.