Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update thrift v0.16 and vendor parquet-format (#2502) #2626

Merged
merged 9 commits into from Sep 10, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Sep 1, 2022

Which issue does this PR close?

Closes #2502

Rationale for this change

See ticket

What changes are included in this PR?

Vendors the generated thrift code, and updates to thrift v0.16. Unfortunately thrift no longer generates enumerations, instead generating constants. This requires some churn in the conversion logic, and making some things fallible that previously weren't.

Are there any user-facing changes?

Yes, although only to the low-level interfaces.

@tustvold tustvold added the api-change Changes to the arrow API label Sep 1, 2022
@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 1, 2022
@alamb alamb changed the title Update thrift v0.16 vendor parquet-format (#2502) Update thrift v0.16 and vendor parquet-format (#2502) Sep 9, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍

I didn't review the generated thift code.

I also could not verify I could re-create them with the instructions in CONTRIBUTING.md - I think those instructions should be updated while this is still fresh in our minds

r/src/arrowExports.cpp linguist-generated=true
r/man/*.Rd linguist-generated=true

parquet/src/format.rs linguist-generated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

parquet/CONTRIBUTING.md Outdated Show resolved Hide resolved
bytes = { version = "1.1", default-features = false, features = ["std"] }
thrift = { version = "0.13", default-features = false }
thrift = { version = "0.16", default-features = false }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -61,6 +61,9 @@ macro_rules! experimental {
pub mod errors;
pub mod basic;

#[allow(clippy::derivable_impls, clippy::match_single_binding)]
pub mod format;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub mod format;
/// Automatically generated code for reading parquet thrift definition.
// see parquet/CONTRIBUTING.md for instructions on regenerating
pub mod format;

parquet::Type::DOUBLE => Type::DOUBLE,
parquet::Type::BYTE_ARRAY => Type::BYTE_ARRAY,
parquet::Type::FIXED_LEN_BYTE_ARRAY => Type::FIXED_LEN_BYTE_ARRAY,
_ => return Err(general_err!("unexpected type: {}", value.0)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ => return Err(general_err!("unexpected type: {}", value.0)),
_ => return Err(general_err!("unexpected parquet type: {}", value.0)),

parquet::FieldRepetitionType::REQUIRED => Repetition::REQUIRED,
parquet::FieldRepetitionType::OPTIONAL => Repetition::OPTIONAL,
parquet::FieldRepetitionType::REPEATED => Repetition::REPEATED,
_ => return Err(general_err!("unexpected repetition type: {}", value.0)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ => return Err(general_err!("unexpected repetition type: {}", value.0)),
_ => return Err(general_err!("unexpected parquet repetition type: {}", value.0)),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am suggesting adding parquet to all the errors thinking of the poor end user who will see this without the potential stack to know what the context is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential stack to know what the context is

If only there was a well established solution to this conundrum... grumble 😆

parquet::Encoding::DELTA_BYTE_ARRAY => Encoding::DELTA_BYTE_ARRAY,
parquet::Encoding::RLE_DICTIONARY => Encoding::RLE_DICTIONARY,
parquet::Encoding::BYTE_STREAM_SPLIT => Encoding::BYTE_STREAM_SPLIT,
_ => return Err(general_err!("unexpected encoding: {}", value.0)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ => return Err(general_err!("unexpected encoding: {}", value.0)),
_ => return Err(general_err!("unexpected parquet encoding: {}", value.0)),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious about the other values (as in what variants are _ matching) so I commented it out:

   Compiling parquet v22.0.0 (/Users/alamb/Software/arrow-rs/parquet)
error[E0004]: non-exhaustive patterns: `Encoding(i32::MIN..=-1_i32)`, `Encoding(1_i32)` and `Encoding(10_i32..=i32::MAX)` not covered
   --> parquet/src/basic.rs:769:18
    |
769 |         Ok(match value {
    |                  ^^^^^ patterns `Encoding(i32::MIN..=-1_i32)`, `Encoding(1_i32)` and `Encoding(10_i32..=i32::MAX)` not covered
    |

In case any other reviewer was interestd

parquet::PageType::INDEX_PAGE => PageType::INDEX_PAGE,
parquet::PageType::DICTIONARY_PAGE => PageType::DICTIONARY_PAGE,
parquet::PageType::DATA_PAGE_V2 => PageType::DATA_PAGE_V2,
_ => return Err(general_err!("unexpected page type: {}", value.0)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ => return Err(general_err!("unexpected page type: {}", value.0)),
_ => return Err(general_err!("unexpected parquet page type: {}", value.0)),

parquet/CONTRIBUTING.md Outdated Show resolved Hide resolved
# docker build just builds a docker image with thrift dependencies
$ docker build -t thrift build/docker/ubuntu-bionic
# build/docker/scripts/cmake.sh actually compiles thrift
$ docker run -v $(pwd):/thrift/src -it thrift build/docker/scripts/cmake.sh && wget https://raw.githubusercontent.com/apache/parquet-format/apache-parquet-format-2.9.0/src/main/thrift/parquet.thrift && ./cmake_build/compiler/cpp/bin/thrift --gen rs parquet.thrift
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command did not complete successfully for me.

...
441: ----------------------------------------------------------------------
441: Ran 2 tests in 0.012s
441: 
441: OK
441: Traceback (most recent call last):
441:   File "/thrift/src/test/py/TestServer.py", line 403, in <module>
441:     from thrift.TMultiplexedProcessor import TMultiplexedProcessor
441:   File "/thrift/src/lib/py/build/lib.linux-x86_64-2.7/thrift/TMultiplexedProcessor.py", line 20, in <module>
441:     from thrift.Thrift import TProcessor, TMessageType
441: ImportError: No module named Thrift
441: t.py
441: ----
441: ----------------
441:  Executing Client/Server tests with various generated code directories
441:  Servers to be tested: TSimpleServer, TThreadedServer, TThreadPoolServer, TNonblockingServer, THttpServer, TProcessPoolServer, TForkingServer
441:  Directories to be tested: gen-py-default, gen-py-slots, gen-py-oldstyle, gen-py-no_utf8strings, gen-py-dynamic, gen-py-dynamicslots
441:  Protocols to be tested: accel, accelc, binary, compact, json, header
441:  Options to be tested: ZLIB(yes/no), SSL(yes/no)
441: ----------------
441: 
441: Test run #0:  (includes gen-py-default) Server=TSimpleServer,  Proto=accel,  zlib=False,  SSL=False
441: Testing server TSimpleServer: /usr/bin/python /thrift/src/test/py/TestServer.py --protocol=accel --port=9090 TSimpleServer
441: FAIL: Server process (/usr/bin/python /thrift/src/test/py/TestServer.py --protocol=accel --port=9090 TSimpleServer) failed with retcode 1
441: Traceback (most recent call last):
441:   File "/thrift/src/test/py/RunClientServer.py", line 323, in <module>
441:     sys.exit(main())
441:   File "/thrift/src/test/py/RunClientServer.py", line 315, in main
441:     tests.test_feature('gendir', generated_dirs)
441:   File "/thrift/src/test/py/RunClientServer.py", line 230, in test_feature
441:     if self.run(conf, test_count):
441:   File "/thrift/src/test/py/RunClientServer.py", line 219, in run
441:     runServiceTest(self.libdir, self.genbase, genpydir, try_server, try_proto, self.port, with_zlib, with_ssl, self.verbose)
441:   File "/thrift/src/test/py/RunClientServer.py", line 157, in runServiceTest
441:     ensureServerAlive()
441:   File "/thrift/src/test/py/RunClientServer.py", line 140, in ensureServerAlive
441:     % (server_class, ' '.join(server_args)))
441: Exception: Server subprocess TSimpleServer died, args: /usr/bin/python /thrift/src/test/py/TestServer.py --protocol=accel --port=9090 TSimpleServer
441/441 Test #441: python_test .......................***Failed   28.49 sec

99% tests passed, 6 tests failed out of 441

Total Test time (real) = 250.09 sec

The following tests FAILED:
	382 - TInterruptTest (Failed)
	401 - TNonblockingSSLServerTest (Failed)
	403 - SecurityTest (Failed)
	404 - SecurityFromBufferTest (Failed)
	433 - PythonTestSSLSocket (Failed)
	441 - python_test (Failed)
Errors while running CTest

Copy link
Contributor Author

@tustvold tustvold Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I ended up just using the version vended by Arch linux... Perhaps I will just update the instructions to do that 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah -- maybe docker run a arch linux image?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, although of course the formatting is now different... But at least we now have a one-liner

To generate the parquet format (thrift definitions) code run from the repository root run

```
$ docker run -v $(pwd):/thrift/src -it archlinux pacman -Sy --noconfirm thrift && wget https://raw.githubusercontent.com/apache/parquet-format/apache-parquet-format-2.9.0/src/main/thrift/parquet.thrift -O /tmp/parquet.thrift && thrift --gen rs /tmp/parquet.thrift && sed -i '/use thrift::server::TProcessor;/d' parquet.rs && mv parquet.rs parquet/src/format.rs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this and it works great:

(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-rs$ diff parquet.rs parquet/src/format.rs 
26d25
< use thrift::server::TProcessor;

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I only managed to skim through it but overall looks great. Thanks @tustvold !

@tustvold tustvold merged commit d88ed6a into apache:master Sep 10, 2022
@ursabot
Copy link

ursabot commented Sep 10, 2022

Benchmark runs are scheduled for baseline = 2f360e1 and contender = d88ed6a. d88ed6a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@alamb
Copy link
Contributor

alamb commented Sep 11, 2022

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inline Generated Thift Code Into Parquet Crate
4 participants