Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add page index reader test for all types and support empty index. #2012

Merged
merged 7 commits into from Jul 8, 2022

Conversation

Ted-Jiang
Copy link
Member

Which issue does this PR close?

Closes #2010.

Rationale for this change

After apache/parquet-testing#25 merged, there will be standard page index test file in parquet-testing.All types support page index in parquet-format has add test in this pr.

Another change:
I found there is a situation one col has pageLocation but without min_max index(馃槀 i missed it). So add EMPTY_ARRAY in enum Index.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 6, 2022
@Ted-Jiang Ted-Jiang changed the title Add page index reader test and support empty index. Add page index reader test for all types and support empty index. Jul 6, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jul 7, 2022

Codecov Report

Merging #2012 (d18f459) into master (2309157) will decrease coverage by 0.03%.
The diff coverage is 96.90%.

@@            Coverage Diff             @@
##           master    #2012      +/-   ##
==========================================
- Coverage   83.58%   83.55%   -0.04%     
==========================================
  Files         222      222              
  Lines       57529    58059     +530     
==========================================
+ Hits        48088    48510     +422     
- Misses       9441     9549     +108     
Impacted Files Coverage 螖
parquet/src/file/page_index/index.rs 69.23% <酶> (+46.15%) 猬嗭笍
parquet/src/file/serialized_reader.rs 95.00% <96.73%> (+0.04%) 猬嗭笍
parquet/src/file/metadata.rs 95.39% <100.00%> (酶)
parquet/src/file/page_index/index_reader.rs 90.90% <100.00%> (+10.60%) 猬嗭笍
parquet/src/arrow/array_reader/empty_array.rs 45.45% <0.00%> (-10.11%) 猬囷笍
arrow/src/compute/kernels/arity.rs 81.05% <0.00%> (-9.86%) 猬囷笍
parquet/src/arrow/array_reader/struct_array.rs 78.99% <0.00%> (-9.69%) 猬囷笍
parquet/src/arrow/array_reader/map_array.rs 58.82% <0.00%> (-8.98%) 猬囷笍
parquet/src/column/reader/decoder.rs 69.23% <0.00%> (-7.05%) 猬囷笍
parquet/src/arrow/array_reader/null_array.rs 81.48% <0.00%> (-6.52%) 猬囷笍
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
螖 = absolute <relative> (impact), 酶 = not affected, ? = missing data
Powered by Codecov. Last update 2309157...d18f459. Read the comment docs.

@Ted-Jiang Ted-Jiang closed this Jul 7, 2022
@Ted-Jiang Ted-Jiang reopened this Jul 7, 2022
@Ted-Jiang
Copy link
Member Author

@alamb @tustvold PTAL 馃槃

@tustvold tustvold added the api-change Changes to the arrow API label Jul 7, 2022
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really nice 馃憤

Only some minor nits and then this can go in

parquet/src/file/page_index/index.rs Outdated Show resolved Hide resolved
Ted-Jiang and others added 2 commits July 8, 2022 10:31
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
@Ted-Jiang Ted-Jiang closed this Jul 8, 2022
@Ted-Jiang Ted-Jiang reopened this Jul 8, 2022
@Ted-Jiang
Copy link
Member Author

@tustvold do you know this error in IT

==========================================================
Testing file duration
==========================================================
Traceback (most recent call last):
  File "/arrow/dev/archery/archery/integration/runner.py", line 246, in _run_ipc_test_case
    run_binaries(producer, consumer, test_case)
  File "/arrow/dev/archery/archery/integration/runner.py", line 286, in _produce_consume
    consumer.stream_to_file(producer_stream_path, consumer_file_path)
  File "/arrow/dev/archery/archery/integration/tester_csharp.py", line 63, in stream_to_file
    self.run_shell_command(cmd)
  File "/arrow/dev/archery/archery/integration/tester.py", line 49, in run_shell_command
    subprocess.check_call(cmd, shell=True)
  File "/opt/conda/envs/arrow/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/arrow/csharp/artifacts/Apache.Arrow.IntegrationTest/Debug/net6.0/Apache.Arrow.IntegrationTest --mode stream-to-file -a /tmp/tmp3okdpdw5/124c7ca4_generated_datetime.consumer_stream_as_file < /tmp/tmp3okdpdw5/124c7ca4_generated_datetime.producer_file_as_stream' returned non-zero exit status 1.
################# FAILURES #################
FAILED TEST: datetime Java producing,  C# consuming
1 failures
1
Error: `docker-compose --file /home/runner/work/arrow-rs/arrow-rs/docker-compose.yml run --rm -e ARCHERY_INTEGRATION_WITH_RUST=1 conda-integration` exited with a non-zero exit code 1, see the process log above.
The docker-compose command was invoked with the following parameters:
Defaults defined in .env:
  ALMALINUX: 8
  ARCH: amd64
  ARCH_ALIAS: x86_64
  ARCH_SHORT: amd64
  ARROW_R_DEV: TRUE
  BUILDKIT_INLINE_CACHE: 1
  CLANG_TOOLS: 12
  COMPOSE_DOCKER_CLI_BUILD: 1
  CONAN: gcc10
  CUDA: 9.1
  DASK: latest
  DEBIAN: 11
  DEVTOOLSET_VERSION: -1
  DOCKER_BUILDKIT: 1
  DOCKER_VOLUME_PREFIX: 
  DOTNET: 6.0
  FEDORA: 35
  GCC_VERSION: 
  GO: 1.16
  HDFS: 3.2.1
  JDK: 8
  KARTOTHEK: latest
  LLVM: 13
  MAVEN: 3.5.4
  NODE: 16
  NUMPY: latest
  PANDAS: latest
  PYTHON: 3.8
  PYTHON_WHEEL_WINDOWS_IMAGE_REVISION: 2022-06-12
  R: 4.2
  REPO: apache/arrow-dev
  R_CUSTOM_CCACHE: false
  R_IMAGE: ubuntu-gcc-release
  R_ORG: rhub
  R_PRUNE_DEPS: FALSE
  R_TAG: latest
  SPARK: master
  TURBODBC: latest
  TZ: UTC
  UBUNTU: 20.04
  ULIMIT_CORE: -1
  VCPKG: 38bb87c
Archery was called with:

@tustvold
Copy link
Contributor

tustvold commented Jul 8, 2022

See #1931

/// Sometimes reading page index from parquet file
/// will only return pageLocations without min_max index,
/// Use `EMPTY_ARRAY` representing None will be more convenient.
EMPTY_ARRAY,
Copy link
Contributor

@tustvold tustvold Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer None to indicate absence of index information. EMPTY_ARRAY makes me think of an empty byte array or something

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense.

Ted-Jiang and others added 2 commits July 8, 2022 23:25
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
@tustvold tustvold merged commit 373ac81 into apache:master Jul 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable column page index read test for all types
3 participants