Add test to demonstrate xincrease bug#629
Conversation
Signed-off-by: yeya24 <benye@amazon.com>
|
@fpetkovski @GiedriusS would appreciate if you can take a look on this one. IMO I agree with Ben that for sparse samples, if we have a single sample within the user-given-time-range sample, then we should just return the sample value without considering the |
|
I didn't work on this for a very long time and I don't remember all the details. This function is "our own" so we can kind of do what we want. In the original proposal there were at least 3 different implementations IIRC. In my opinion, this either way is a heuristic and it is not guaranteed to be "correct" in all cases. As far as I remember, the extended Select() time range and this whole logic was exactly because we want to guess whether some metric existed in the past or not. The key part to me is this:
How do you define "ingested"? The Prometheus team is also trying to solve this with the created timestamp functionality. Your proposed change looks fine, I can't see anything wrong right now but I will think about cases where it might not work correctly. Could you also elaborate a bit on how/why the way it works right now blocks some of your work? Also, should we invest any more time into this when prometheus/prometheus#16457 modifiers are gaining traction in the Prometheus world? Do you see a future for this function & code? |
|
Thanks @GiedriusS. This behavior looks like a bug from our users' perspective. It is not really blocking us but we'd like to understand the right behavior.
Yeah I think created timestamp seems the right way to go and at least it solves our usecase. Also agree that we should try to support the new modifiers in the thanos engine to support this and align with the community with the right semantics. For xrate it is quite hard to actually define its behavior. |
We are testing xfunctions in our set up and here is one issue we hit when testing with super sparse metrics.
Here is the problem we got and I created a simple test to reproduce the problem and added to this PR.
xincrease(http_requests_total[1m]) > 0to have value at the steps when sparse metrics were ingested.xincreasevalue is 0 for the sample at t2 if my query start time is before 2025-09-07T07:48:38Z. If my query start time is equal to or after 2025-09-07T07:48:38Z,xincreaseshows the correct value.2025-09-07T07:48:38Zis kind of the magic timestamp because it is t1 + 60m + 1m + 1s, where 60m is the ext lookback delta and 1m is the select range inhttp_requests_total[1m].The issue in
extendedRateis in this line. If the query start time is within [t1, t1 + ext lookback delta + select range],metricAppearedTsfor the series is set to t1 according to this. Thenif stepTime-offset <= untilcondition will not meet when evaluating another sample at a later timestamp.Proposed fix for this issue:
For each step, when there is only 1 sample selected within the select range, it is expected for
xincreaseto always return the sample value. It shouldn't be impacted by the start time of the query and if there is any sample close enough.Current logic. Depends on first sample timestamp.
Proposed logic, this doesn't rely on first sample timestamp. Also when there is only one sample within the select range, return its value. When there are more than 1 sample but they have the same value, return 0.