Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for aggregate expressions with filters #585

Merged
merged 9 commits into from Sep 8, 2022

Conversation

andygrove
Copy link
Collaborator

Closes #577

@coveralls
Copy link

coveralls commented Aug 19, 2022

Pull Request Test Coverage Report for Build 2980739859

  • 31 of 34 (91.18%) changed or added relevant lines in 6 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.01%) to 85.558%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/dialect/hive.rs 1 2 50.0%
src/dialect/mod.rs 1 2 50.0%
src/dialect/postgresql.rs 1 2 50.0%
Files with Coverage Reduction New Missed Lines %
src/ast/mod.rs 1 77.79%
Totals Coverage Status
Change from base Build 2957846519: 0.01%
Covered Lines: 9728
Relevant Lines: 11370

💛 - Coveralls

@andygrove andygrove marked this pull request as draft August 19, 2022 12:14
@andygrove andygrove changed the title Hive dialect cannot parse WHERE nested in FILTER WIP: Hive dialect cannot parse WHERE nested in FILTER Aug 19, 2022
@andygrove
Copy link
Collaborator Author

Code needs cleaning up but maybe this approach can work?


#[test]
fn filter_as_alias() {
let rename = "SELECT name filter FROM region";
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a different style test - this fails because it is re-written as SELECT name AS filter FROM region

@xrl
Copy link

xrl commented Aug 19, 2022

This was the approach I was crawling towards. Hopefully I'm not being too obvious, but the filtering_during_aggregation test case fails:

#[test]
fn filtering_during_aggregation() {
    let rename = "SELECT array_agg(name) FILTER (WHERE name IS NOT NULL) FROM region";
    println!("{}", hive().verified_stmt(rename));
}

fails with

called `Result::unwrap()` on an `Err` value: ParserError("Expected end of statement, found: name")
thread 'filtering_during_aggregation' panicked at 'called `Result::unwrap()` on an `Err` value: ParserError("Expected end of statement, found: name")', src/test_utils.rs:86:61
stack backtrace:

@andygrove
Copy link
Collaborator Author

This was the approach I was crawling towards. Hopefully I'm not being too obvious, but the filtering_during_aggregation test case fails:

Ugh, this was passing at one point. I will try and take another look over the weekend.

@xrl
Copy link

xrl commented Aug 22, 2022

I think this is failing because the get_next_precedence function doesn't consider the FILTER token.

[2022-08-22T18:16:08Z DEBUG sqlparser::parser] Parsing sql 'SELECT array_agg(name) FILTER (WHERE name IS NOT NULL) FROM region'...
[2022-08-22T18:16:08Z DEBUG sqlparser::parser] parse query
[2022-08-22T18:16:08Z DEBUG sqlparser::parser] parsing expr
[2022-08-22T18:16:08Z DEBUG sqlparser::parser] parsing expr
[2022-08-22T18:16:08Z DEBUG sqlparser::parser] prefix: Identifier(Ident { value: "name", quote_style: None })
[2022-08-22T18:16:08Z DEBUG sqlparser::parser] get_next_precedence() RParen
[2022-08-22T18:16:08Z DEBUG sqlparser::parser] 0: ) 1: FILTER 2: (
[2022-08-22T18:16:08Z DEBUG sqlparser::parser] next precedence: 0
[2022-08-22T18:16:08Z DEBUG sqlparser::parser] breaking!
[2022-08-22T18:16:08Z DEBUG sqlparser::parser] prefix: Function(Function { name: ObjectName([Ident { value: "array_agg", quote_style: None }]), args: [Unnamed(Expr(Identifier(Ident { value: "name", quote_style: None })))], over: None, distinct: false, special: false })
[2022-08-22T18:16:08Z DEBUG sqlparser::parser] get_next_precedence() Word(Word { value: "FILTER", quote_style: None, keyword: FILTER })
[2022-08-22T18:16:08Z DEBUG sqlparser::parser] 0: FILTER 1: ( 2: WHERE
[2022-08-22T18:16:08Z DEBUG sqlparser::parser] next precedence: 0
[2022-08-22T18:16:08Z DEBUG sqlparser::parser] breaking!

@xrl
Copy link

xrl commented Aug 22, 2022

Hmm, I think the FILTER is applied to the specific expression in the select. It's not for the whole select, but for each column. The example of this SQL that led me down this rabbit hole:

   SELECT
     "organization_id"
   , "provider_pg_id" "provider_id"
   , "array_agg"("status") FILTER (WHERE ("badge_type" = 'AW'))[1] "aw_badge"
   , "array_agg"("status") FILTER (WHERE ("badge_type" = 'HBS'))[1] "hbs_badge"
   , "array_agg"("status") FILTER (WHERE ("badge_type" = 'RWE'))[1] "rwe_badge"
[[[ SNIP ]]]

so I think the logic has to be pushed "deeper" in to the parser logic, pushed from parse_select down to parse_select_item

@@ -36,4 +36,8 @@ impl Dialect for HiveDialect {
|| ch == '{'
|| ch == '}'
}

fn supports_filter_during_aggregation(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we would want to add this to src/dialect/postgres.rs as well since postgresql supports FILTER also https://www.postgresql.org/docs/current/sql-expressions.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@andygrove
Copy link
Collaborator Author

I will resume work on this tomorrow

@andygrove andygrove changed the title WIP: Hive dialect cannot parse WHERE nested in FILTER Add support for aggregate expressions with filters Sep 2, 2022
@andygrove andygrove marked this pull request as ready for review September 2, 2022 17:15
@andygrove
Copy link
Collaborator Author

@xrl @jdye64 @alamb This is now ready for review

@xrl
Copy link

xrl commented Sep 2, 2022

This work got me closer but I'm unable to parse expressions with the filter proceeded a [1] term.

    #[test]
    fn filter_with_array_access_sql() {
        env_logger::init();

        let sql = r#"SELECT
     "organization_id"
   , "provider_pg_id" "provider_id"
   , "array_agg"("status") FILTER (WHERE ("badge_type" = 'AW'))[1] "aw_badge"
   FROM
   asdf"#;
        sqlparser::parser::Parser::parse_sql(&sqlparser::dialect::HiveDialect {}, sql).unwrap();
    }

using this PR as a git crate dependency now yields:

     Running unittests src/lib.rs (target/debug/deps/sparkle_view_analyzer-f90def7b85a0f85e)
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] Parsing sql 'SELECT
         "organization_id"
       , "provider_pg_id" "provider_id"
       , "array_agg"("status") FILTER (WHERE ("badge_type" = 'AW'))[1] "aw_badge"
       FROM
       asdf'...
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] parsing expr
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] prefix: Identifier(Ident { value: "organization_id", quote_style: Some('"') })
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] get_next_precedence() Comma
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] 0: , 1: "provider_pg_id" 2: "provider_id"
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] next precedence: 0
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] parsing expr
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] prefix: Identifier(Ident { value: "provider_pg_id", quote_style: Some('"') })
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] get_next_precedence() Word(Word { value: "provider_id", quote_style: Some('"'), keyword: NoKeyword })
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] 0: "provider_id" 1: , 2: "array_agg"
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] next precedence: 0
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] parsing expr
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] parsing expr
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] prefix: Identifier(Ident { value: "status", quote_style: Some('"') })
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] get_next_precedence() RParen
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] 0: ) 1: FILTER 2: (
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] next precedence: 0
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] prefix: Function(Function { name: ObjectName([Ident { value: "array_agg", quote_style: Some('"') }]), args: [Unnamed(Expr(Identifier(Ident { value: "status", quote_style: Some('"') })))], over: None, distinct: false, special: false })
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] get_next_precedence() Word(Word { value: "FILTER", quote_style: None, keyword: FILTER })
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] 0: FILTER 1: ( 2: WHERE
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] next precedence: 0
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] parsing expr
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] parsing expr
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] prefix: Identifier(Ident { value: "badge_type", quote_style: Some('"') })
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] get_next_precedence() Eq
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] 0: = 1: 'AW' 2: )
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] next precedence: 20
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] parsing expr
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] prefix: Value(SingleQuotedString("AW"))
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] get_next_precedence() RParen
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] 0: ) 1: ) 2: [
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] next precedence: 0
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] get_next_precedence() RParen
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] 0: ) 1: ) 2: [
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] next precedence: 0
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] prefix: Nested(BinaryOp { left: Identifier(Ident { value: "badge_type", quote_style: Some('"') }), op: Eq, right: Value(SingleQuotedString("AW")) })
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] get_next_precedence() RParen
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] 0: ) 1: [ 2: 1
[2022-09-02T20:54:48Z DEBUG sqlparser::parser] next precedence: 0

called `Result::unwrap()` on an `Err` value: ParserError("Expected end of statement, found: [")
thread 'graph::tests::filter_with_array_access_sql' panicked at 'called `Result::unwrap()` on an `Err` value: ParserError("Expected end of statement, found: [")', sparkle-view-analyzer/src/graph/mod.rs:336:88

background: https://trino.io/docs/current/functions/aggregate.html

array_agg(x) → array<[same as input]>

    Returns an array created from the input x elements.

@xrl
Copy link

xrl commented Sep 3, 2022

Should the select item ("some_name")[1][20] be parseable? Right now it's not surviving a round trip.

Copy link
Contributor

@jdye64 jdye64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andygrove
Copy link
Collaborator Author

Should the select item ("some_name")[1][20] be parseable? Right now it's not surviving a round trip.

Hi @xrl. I am not sure I understand this. Is the goal to extract one array element from the aggregated array? I tried this in postgres and it also did not like this syntax:

postgres=# select array_agg(foo)[1] from arrays;
ERROR:  syntax error at or near "["
LINE 1: select array_agg(foo)[1] from arrays;
                             ^
postgres=# select array_agg(foo) filter (where foo is not null)[1] from arrays;
ERROR:  syntax error at or near "["
LINE 1: ...lect array_agg(foo) filter (where foo is not null)[1] from a...

Is this query working for you in Hive/Trino?

Maybe you could do this with a subquery for the aggregation? This worked for me in Postgres:

select x[1] from (select array_agg(foo) as x from arrays) y;

@andygrove
Copy link
Collaborator Author

@alamb Could you review when you have time

@xrl
Copy link

xrl commented Sep 6, 2022

Is this query working for you in Hive/Trino?

Yes, array agg then pull out an element. And the query does work on AWS Athena (presto under the hood).

image

@andygrove
Copy link
Collaborator Author

Is this query working for you in Hive/Trino?

Yes, array agg then pull out an element. And the query does work on AWS Athena (presto under the hood).

Interesting. Ok, I filed a new issue for supporting this. I think this is separate from the FILTER (WHERE) support.

@andygrove andygrove merged commit 303f80f into sqlparser-rs:main Sep 8, 2022
@andygrove andygrove deleted the hive-filter branch September 8, 2022 19:08
Copy link
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time I look at a PR in this repo I feel like I learn something new about SQL. TIL FILTER (where) 🤯

Thanks @andygrove this PR looks reasonable to me

@@ -42,6 +42,10 @@ impl Dialect for PostgreSqlDialect {
None
}
}

fn supports_filter_during_aggregation(&self) -> bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it is worth, I think the DataFusion actually uses the Generic Dialect, https://github.com/apache/arrow-datafusion/blob/0084aeb686b318cbdb49cab00cb8f15c9f520d1e/datafusion/sql/src/parser.rs#L102

Thus we may want to add return true from supports_filter_during_aggregation in https://github.com/sqlparser-rs/sqlparser-rs/blob/main/src/dialect/generic.rs as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Dask SQL has their own dialect where they want to use this, but it makes sense to add to the dialect that DataFusion is using as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hive dialect cannot parse WHERE nested in FILTER
5 participants