Support long range _over_time functions for distributed execution. #195

fpetkovski · 2023-03-01T16:32:58Z

Queries like sum_over_time[1d] can be challenging to distribute since one engine might not have a large range to calculate the result for a given step. For example, this exact query against Prometheus with a 6h retention will produce incorrect results.

I can think of two ways we can address this problem:

If the range is small enough, (e.g. 5m), we can increase the start time in each engine by that amount. This will make sure the window for the first few steps for that engine has enough samples to produce a correct result. The values for the steps that were skipped would have to be calculated from an older engine.
Figure out a way to distribute _over_time/rate functions. Maybe they can be transformed into something like

sum_over_time(append(sum_over_time([30m]), sum_over_time([30m]))[1h])

similar to how subqueries work.

The text was updated successfully, but these errors were encountered:

GiedriusS · 2023-03-01T21:06:52Z

Be careful with (2) because you might run into grafana/mimir#2599. Not sure what would be a good way to solve that.

fpetkovski · 2023-03-02T06:40:02Z

Thanks for pointing me to that issue! The catch might be to always have an overlap of one step.

So instead of having (3 - 1 + 41 - 40) / 120 as in their example, we can try (3 - 1 + 41 - 3) / 120.

GiedriusS · 2023-03-02T09:20:19Z

Mhm, it sounds like that could work! It will complicate the implementation, of course, but it should solve that issue. Perhaps it would also be cool to implement this in a way so that we could have instant query sharding in the engine in the future easily.

nishchay-veer · 2023-05-17T21:41:49Z

Hey @fpetkovski can I work on this issue ?

nishchay-veer · 2023-05-17T22:16:16Z

one possible way to implement the first approach is to increase the start time by the select range using startTime := f.StepTime - f.SelectRange. This ensures that each engine has a larger window of time to calculate the result for each step.
Next, we can check that if the start time is earlier than the metric appeared timestamp using if f.MetricAppearedTs != nil && startTime < *f.MetricAppearedTs. If it is, we need to calculate missing values from an older engine.
// For example, we can query the older engine for values between MetricAppearedTs and startTime, and use those values to calculate the missing steps.

Finally, we return the result for the current engine

fpetkovski · 2023-05-18T05:47:58Z

Hi @nishchay-veer, this is a possible solution and it is already implemented in #246. Feel free to review that PR and leave comments if you have any.

nishchay-veer · 2023-05-18T05:54:14Z

Which one is a better approach according to you, just being curious?

fpetkovski mentioned this issue Apr 26, 2023

distribute: Add offset to engines to account for range selectors #246

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support long range _over_time functions for distributed execution. #195

Support long range _over_time functions for distributed execution. #195

fpetkovski commented Mar 1, 2023 •

edited

GiedriusS commented Mar 1, 2023

fpetkovski commented Mar 2, 2023 •

edited

GiedriusS commented Mar 2, 2023

nishchay-veer commented May 17, 2023

nishchay-veer commented May 17, 2023

fpetkovski commented May 18, 2023 •

edited

nishchay-veer commented May 18, 2023

Support long range _over_time functions for distributed execution. #195

Support long range _over_time functions for distributed execution. #195

Comments

fpetkovski commented Mar 1, 2023 • edited

GiedriusS commented Mar 1, 2023

fpetkovski commented Mar 2, 2023 • edited

GiedriusS commented Mar 2, 2023

nishchay-veer commented May 17, 2023

nishchay-veer commented May 17, 2023

fpetkovski commented May 18, 2023 • edited

nishchay-veer commented May 18, 2023

fpetkovski commented Mar 1, 2023 •

edited

fpetkovski commented Mar 2, 2023 •

edited

fpetkovski commented May 18, 2023 •

edited