Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmarks for testing row filtering #3769

Merged
merged 6 commits into from Oct 12, 2022
Merged

Conversation

thinkharderdev
Copy link
Contributor

Which issue does this PR close?

Closes #3457

Rationale for this change

Need a set benchmarks for evaluating performance implications of parquet predicate pushdown. This sets up some very basic benchmarks which can be used for that purpose. Thanks to @tustvold for cooking up a script to generate synthetic datasets for this purpose.

What changes are included in this PR?

Add new benchmark script parquet_filter_pushdown which will execute a series of ParquetExec plans with different filter predicates. For each predicate in the suite, we will execute the plan with all three different ParquetScanOptions configurations:

  1. Disable filter pushdown entirely (currently the default behavior)
  2. Enable filter pushdown and predicate reordering
  3. Enable filter pushdown without predicate reordering.

Are there any user-facing changes?

No

@thinkharderdev
Copy link
Contributor Author

@alamb @tustvold @Ted-Jiang

];

let filter_matrix = vec![
// Selective-ish filter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well-defined test case and test data! 👍

path: PathBuf,

/// Batch size when reading Parquet files
#[structopt(short = "s", long = "batch-size", default_value = "8192")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think there two short options 's'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact when you run the example in debug mode it asserts on exactly this problem:

cargo run --bin parquet_filter_pushdown -- --path ./data --\|alamb@aal-dev:~/arrow-datafusion$ 
scale-factor 1.0
...
     Running `target/debug/parquet_filter_pushdown --path ./data --scale-factor 1.0`           |error[E0433]: failed to resolve: use of undeclared type `WriterProperties`
thread 'main' panicked at 'Argument short must be unique                                       |   --> benchmarks/src/bin/parquet_filter_pushdown.rs:235:17
                                                                                               |    |
        -s is already in use', /home/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/cla\|235 |     let props = WriterProperties::builder()
p-2.34.0/src/app/parser.rs:190:13                                                              |    |                 ^^^^^^^^^^^^^^^^ not found in this scope
note: run with `RUST_BACKTRACE=1` environment variable to display a ba

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just removed one of them. I don't think batch size needs to be a cli option in this benchmark.

benchmarks/README.md Outdated Show resolved Hide resolved
@Ted-Jiang
Copy link
Member

@thinkharderdev thanks for your great bench.
I run parquet tools in local get (1.0 GB)

(venv) yangjiang@LM-SHC-15009782 data % parquet-tools column-index  ./logs.parquet                                                     
row group 0:
column index for column service:
Boudary order: UNORDERED
                      null count  min                                       max                                     
page-0                         0  backend                                   frontend                                

offset index for column service:
                          offset   compressed size       first row index
page-0                        62               117                     0

column index for column host:
Boudary order: UNORDERED
                      null count  min                                       max                                     
page-0                         0  i-1ec3ca3151468928.ec2.internal           i-1ec408f54dbd3750.ec2.internal         

offset index for column host:
                          offset   compressed size       first row index
page-0                       566               125                     0

column index for column pod:
Boudary order: UNORDERED
                      null count  min                                       max                                     
page-0                         0  aejowuublavflbbsvlfozigwpmrxldvhaollk     zxxlzhdrucrhpicpdgxtfpyuknvviimggtq     

offset index for column pod:
                          offset   compressed size       first row index
page-0                      6689               602                     0

column index for column container:
Boudary order: UNORDERED
                      null count  min                                       max                                     
page-0                         0  backend_container_0                       frontend_container_1                    

offset index for column container:
                          offset   compressed size       first row index
page-0                      7602               593                     0

There are at most two pages in one col, I think if we adjust to get more pages in one col (like reduce the page size), it will get greater performance in enable enable_page_index , we can get more opportunitys to skip whole pages without decoding! 🤔

FYI, i see impala choose to use fixed row number in one page to do benchmark for getting good performance.

@alamb
Copy link
Contributor

alamb commented Oct 11, 2022

Thank you @thinkharderdev -- I plan to review this PR in detail later today

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great -- thank you @thinkharderdev

I also verified the parquet file that was created

$ du -s -h /tmp/data/logs.parquet 
988M    /tmp/data/logs.parquet

It looks good to me (using the neat pqrs tool from @manojkarthick)

alamb@aal-dev:~/2022-10-05-slow-query-high-cardinality$ pqrs cat --csv /tmp/data/logs.parquet  | head 

############################
File: /tmp/data/logs.parquet
############################

service,host,pod,container,image,time,client_addr,request_duration_ns,request_user_agent,request_method,request_host,request_bytes,response_bytes,response_status
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000000000+00:00,127.216.178.64,-1261239112,rkxttrfiiietlsaygzphhwlqcgngnumuphliejmxfdznuurswhdcicrlprbnocibvsbukiohjjbjdygwbfhxqvurm,PUT,https://backend.mydomain.com,-312099516,1\
448834362,200
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000001024+00:00,187.49.24.179,1374800428,sdxkctvmvuqxhwigrhjaouwdzvasqlqphymcgqvfmsbjswswnzgvanmalnmvsvruakcudmqvzateabhlya,PATCH,https://backend.mydomain.com,-1363067408,1111765\
98,204
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000002048+00:00,14.29.229.168,-1795280692,bhlvymbbtgcqrwzujukyotusnsoidygnklhx,GET,https://backend.mydomain.com,-1323615082,-705662117,400
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000003072+00:00,180.188.29.17,-717290117,hjaynltdswdekcguqmrkucsepzqjhasklmimkibabijihitimmsglgettywifdzmraipvyvekczuwxettayslrffyz,HEAD,https://backend.mydomain.com,-1847395296,\
1206750179,200
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000004096+00:00,68.92.115.208,759902764,yupopowlaqbwskdwvtlitugpzzxoajhvnmndhca,DELETE,https://backend.mydomain.com,-50170254,-415949533,403
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000005120+00:00,230.160.203.201,-1271567754,pwbruedgdgtsavjuksxwkecxulbnjbsaltuvcjxcmblhnraawouvrunwwsmvjbq,GET,https://backend.mydomain.com,-1193079450,1281912293,204
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000006144+00:00,249.254.50.191,-971196614,amtuqookzibtvrtqfnyzuyesikbrafhcfnjhoaoedvmlwpkypfsedtbbwlbnzigwgjpzcwdxtwhrykcibmhlxnkckynvgli,PATCH,https://backend.mydomain.com,-2627\
74709,-1695212300,204
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000007168+00:00,77.183.81.164,-547300163,ogkufdxssjqzjphxwvegwvofchpsgntbyslgarcyqcawokzfoppdftoctmtlwcvikazwrujlgrzrlqueaaceibxvdicfhp,HEAD,https://backend.mydomain.com,-1349820\
595,-327759246,204
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000008192+00:00,63.17.88.115,-88404773,ogardohhoorttptpnkxmvyenqfzvvkjabcrfwapoywttjdunvmlgwgstmsjbefxqta,HEAD,https://backend.mydomain.com,1830978558,,200
Error: ArrowReadWriteError(CsvError("Broken pipe (os error 32)"))

benchmarks/README.md Outdated Show resolved Hide resolved
let generator = Generator::new();

let file = File::create(&path).unwrap();
let mut writer = ArrowWriter::try_new(file, generator.schema.clone(), None).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should make the properties used here explicit?

Like maybe explicitly set what type of statistics are created as well as potentially setting ZSTD compression

https://docs.rs/parquet/24.0.0/parquet/file/properties/struct.WriterPropertiesBuilder.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll need to revisit this again once apache/arrow-rs#2854 is released and pulled in so we cam generate the files with proper page sizes (which should make a significant difference)

thinkharderdev and others added 2 commits October 11, 2022 15:44
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@thinkharderdev
Copy link
Contributor Author

This looks great -- thank you @thinkharderdev

I also verified the parquet file that was created

$ du -s -h /tmp/data/logs.parquet 
988M    /tmp/data/logs.parquet

It looks good to me (using the neat pqrs tool from @manojkarthick)

alamb@aal-dev:~/2022-10-05-slow-query-high-cardinality$ pqrs cat --csv /tmp/data/logs.parquet  | head 

############################
File: /tmp/data/logs.parquet
############################

service,host,pod,container,image,time,client_addr,request_duration_ns,request_user_agent,request_method,request_host,request_bytes,response_bytes,response_status
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000000000+00:00,127.216.178.64,-1261239112,rkxttrfiiietlsaygzphhwlqcgngnumuphliejmxfdznuurswhdcicrlprbnocibvsbukiohjjbjdygwbfhxqvurm,PUT,https://backend.mydomain.com,-312099516,1\
448834362,200
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000001024+00:00,187.49.24.179,1374800428,sdxkctvmvuqxhwigrhjaouwdzvasqlqphymcgqvfmsbjswswnzgvanmalnmvsvruakcudmqvzateabhlya,PATCH,https://backend.mydomain.com,-1363067408,1111765\
98,204
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000002048+00:00,14.29.229.168,-1795280692,bhlvymbbtgcqrwzujukyotusnsoidygnklhx,GET,https://backend.mydomain.com,-1323615082,-705662117,400
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000003072+00:00,180.188.29.17,-717290117,hjaynltdswdekcguqmrkucsepzqjhasklmimkibabijihitimmsglgettywifdzmraipvyvekczuwxettayslrffyz,HEAD,https://backend.mydomain.com,-1847395296,\
1206750179,200
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000004096+00:00,68.92.115.208,759902764,yupopowlaqbwskdwvtlitugpzzxoajhvnmndhca,DELETE,https://backend.mydomain.com,-50170254,-415949533,403
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000005120+00:00,230.160.203.201,-1271567754,pwbruedgdgtsavjuksxwkecxulbnjbsaltuvcjxcmblhnraawouvrunwwsmvjbq,GET,https://backend.mydomain.com,-1193079450,1281912293,204
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000006144+00:00,249.254.50.191,-971196614,amtuqookzibtvrtqfnyzuyesikbrafhcfnjhoaoedvmlwpkypfsedtbbwlbnzigwgjpzcwdxtwhrykcibmhlxnkckynvgli,PATCH,https://backend.mydomain.com,-2627\
74709,-1695212300,204
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000007168+00:00,77.183.81.164,-547300163,ogkufdxssjqzjphxwvegwvofchpsgntbyslgarcyqcawokzfoppdftoctmtlwcvikazwrujlgrzrlqueaaceibxvdicfhp,HEAD,https://backend.mydomain.com,-1349820\
595,-327759246,204
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000008192+00:00,63.17.88.115,-88404773,ogardohhoorttptpnkxmvyenqfzvvkjabcrfwapoywttjdunvmlgwgstmsjbefxqta,HEAD,https://backend.mydomain.com,1830978558,,200
Error: ArrowReadWriteError(CsvError("Broken pipe (os error 32)"))

I stole the gen code from @tustvold so you know it works :)

@alamb
Copy link
Contributor

alamb commented Oct 12, 2022

CI check is unrelated : #3798

combine_filters(&[
col("request_method").not_eq(lit("GET")),
col("response_status").eq(lit(400_u16)),
// TODO this fails in the FilterExec with Error: Internal("The type of Dictionary(Int32, Utf8) = Utf8 of binary physical should be same")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coercion!

@alamb alamb merged commit fb39d5d into master Oct 12, 2022
@alamb alamb deleted the row-filter-benchmarks branch October 12, 2022 10:45
@ursabot
Copy link

ursabot commented Oct 12, 2022

Benchmark runs are scheduled for baseline = ae5b23e and contender = fb39d5d. fb39d5d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add benchmarks for parquet queries with filter pushdown enabled
4 participants