Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: add 20240508_temporal_filter_expressions #27001

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

maddyblue
Copy link
Contributor

Design doc to improve mz_now in temporal filters.

See #26184

Motivation

  • This PR adds a known-desirable feature.

Checklist

@benesch
Copy link
Member

benesch commented May 9, 2024

Neat! Who are the right people to review this? I tagged in @frankmcsherry as a start. I certainly don't have the subtleties of temporal filters and error handling paged in.

@maddyblue
Copy link
Contributor Author

Who are the right people to review this?

Certainly Frank, or someone else from compute who might be aware of hidden complexities. @antiguru perhaps?

Copy link
Contributor

@frankmcsherry frankmcsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments landed. Mostly, the expression transformations don't seem appropriate for MZ expression internals, but may be appropriate as a higher level rewrite, done on behalf of the user and in plain sight, vs behind the scenes.

Comment on lines +37 to +39
1. Teach temporal filters in `linear.rs` to recognize expressions like `mz_now() + some_mz_timestamp > x` and convert them to `mz_now() > x - some_mz_timestamp`.
(`-` and the other inequality operators also supported.)
This logic can happen repeatedly, so `mz_now() + x - y <= z` becomes `mz_now() <= z - x + y`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How strongly do you feel that it should be taught in linear.rs? My sense is that the proposed transformations are altering the behavior of the query, and while that may be desirable we should do it outside of the core engine rather than amidst it. For example, if the SQL layer wants to perform and own the transformation, and communicate to the rest of the system that when one types WHERE mz_now() + foo < bar it actually means WHERE mz_now() < bar - foo, that makes sense to me. But asking linear.rs to introduce this seems wrong, to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By way of analogy, many folks want ORDER BY to be preserved across views. I think we've agreed that while this may be a good idea it isn't part of the core SQL spec, and that action (copy any ORDER BY clause to views that select from the ordered view) can be done, but it should be outside of the core SQL planning.

A concern about this is that the underlying data gain additional restrictions in order to not produce errors.
For example, `mz_now() - '10 years'::interval > x` is converted into `mz_now() > x + '10 years'::interval`,
so if `x` is within 10 years of u64::MAX milliseconds, the entire materialized view will error and be unreadable at any time.
Due to the range of `mz_timestamp` this seems unlikely to ever matter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems unlikely to ever matter

What about x - '10 years'::interval dipping below u64::MIN? I'm not wild about this argument, that it won't matter, and would love instead to have the action localized to some part of the system, and an owner there who can curate the experience so that when it does happen we can be clear about what went wrong (vs a long debug spree to find the change in linear.rs that results in the behavior).

Comment on lines +45 to +48
We already have a tiny version of this problem too: the `=`, `<=`, and `>` inequalities all add 1 to the non-`mz_now()` side,
so if the underlying data contain `u64::MAX` and one of those inequalities, the materialized view will error.
Because we already have a small version of this problem and the `mz_timestamp` range is so large (~500M years),
allowing the users to reduce that range seems acceptable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a bug, and we should fix it! Or at least file it. I don't believe compute's current posture is "this is fine" as much as "this is not known".

Comment on lines +55 to +57
## Alternatives


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help to have alternatives listed! For example, avoid modifying linear.rs, and instead have a SQL-to-SQL layer that performs semantics altering rewrites on behalf of the user, and e.g. records these actions and warns users about them. I currently strongly prefer that, and would like to see it discussed.


## Open questions

- Is it always safe to move expressions around?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! :D

At least, if the goal is to do so without altering semantics. I think we don't have a framework in place for resolving that tension: where is it "safe" to modify user queries, and in what framework.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the feeling from this comment that even if I can come up with a better place to do the expression modification, we may still decide it's not safe to do. Thinking through this with the example:

WHERE mz_now() + y > x

this would error when mz_now() + y overflows for some positive y and underflows for some negative y. If we had a rewriter that converted it to

WHERE mz_now() > x - y

this would never error based on mz_now()'s value, but would instead error based on x's value in relation to y. Is this the type of thing you mean by "Nope, at least, if the goal is to do so without altering semantics."?

@maddyblue
Copy link
Contributor Author

See #27059 for fixing the temporal filter off-by-one error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants