Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support long range _over_time functions for distributed execution. #195

Open
fpetkovski opened this issue Mar 1, 2023 · 7 comments
Open

Comments

@fpetkovski
Copy link
Collaborator

fpetkovski commented Mar 1, 2023

Queries like sum_over_time[1d] can be challenging to distribute since one engine might not have a large range to calculate the result for a given step. For example, this exact query against Prometheus with a 6h retention will produce incorrect results.

I can think of two ways we can address this problem:

  • If the range is small enough, (e.g. 5m), we can increase the start time in each engine by that amount. This will make sure the window for the first few steps for that engine has enough samples to produce a correct result. The values for the steps that were skipped would have to be calculated from an older engine.
  • Figure out a way to distribute _over_time/rate functions. Maybe they can be transformed into something like
sum_over_time(append(sum_over_time([30m]), sum_over_time([30m]))[1h])

similar to how subqueries work.

@GiedriusS
Copy link
Member

Be careful with (2) because you might run into grafana/mimir#2599. Not sure what would be a good way to solve that.

@fpetkovski
Copy link
Collaborator Author

fpetkovski commented Mar 2, 2023

Thanks for pointing me to that issue! The catch might be to always have an overlap of one step.

So instead of having (3 - 1 + 41 - 40) / 120 as in their example, we can try (3 - 1 + 41 - 3) / 120.

@GiedriusS
Copy link
Member

Mhm, it sounds like that could work! It will complicate the implementation, of course, but it should solve that issue. Perhaps it would also be cool to implement this in a way so that we could have instant query sharding in the engine in the future easily.

@nishchay-veer
Copy link
Contributor

Hey @fpetkovski can I work on this issue ?

@nishchay-veer
Copy link
Contributor

one possible way to implement the first approach is to increase the start time by the select range using startTime := f.StepTime - f.SelectRange. This ensures that each engine has a larger window of time to calculate the result for each step.
Next, we can check that if the start time is earlier than the metric appeared timestamp using if f.MetricAppearedTs != nil && startTime < *f.MetricAppearedTs. If it is, we need to calculate missing values from an older engine.
// For example, we can query the older engine for values between MetricAppearedTs and startTime, and use those values to calculate the missing steps.

Finally, we return the result for the current engine

@fpetkovski
Copy link
Collaborator Author

fpetkovski commented May 18, 2023

Hi @nishchay-veer, this is a possible solution and it is already implemented in #246. Feel free to review that PR and leave comments if you have any.

@nishchay-veer
Copy link
Contributor

Which one is a better approach according to you, just being curious?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants