Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] chunked parquet reader is not factoring empty dataframes with >0 columns present #15743

Closed
galipremsagar opened this issue May 14, 2024 · 2 comments · Fixed by #15757
Closed
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.

Comments

@galipremsagar
Copy link
Contributor

Describe the bug
A dataframe can have >0 columns when it has 0 rows. There are two issues at play here:

  1. chunked parquet reader seems to be returning False when we do has_next, but return an empty dataframe correctly when we call read_chunk.
  2. When a parquet file is completely exhausted has_next returns False and read_chunk raises a RuntimeError - as expected, But incase of empty dataframes, has_next returns False and read_chunk endlessly keeps returning the empty dataframe without any error.

Steps/Code to reproduce bug

# Non-empty - working case

In [1]: import cudf

In [2]: df = cudf.DataFrame({'a': [1, 1, 1, 2, 2], 'b': [1,2 ,3, 4, 5]})

In [3]: df.to_parquet("a.parquet")

In [4]: from cudf._lib.parquet import ParquetReader


In [6]: reader = ParquetReader(["a.parquet"])

In [8]: reader._has_next()
Out[8]: True

In [9]: reader._read_chunk()
Out[9]: 
   a  b
0  1  1
1  1  2
2  1  3
3  2  4
4  2  5

In [10]: reader._has_next()
Out[10]: False

In [11]: reader._read_chunk()
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[11], line 1
----> 1 reader._read_chunk()

File parquet.pyx:851, in cudf._lib.parquet.ParquetReader._read_chunk()

RuntimeError: Fatal CUDA error encountered at: /nvme/0/pgali/cudf/cpp/include/cudf/detail/utilities/vector_factories.hpp:277: 700 cudaErrorIllegalAddress an illegal memory access was encountered




# Empty - bug case
In [1]: import cudf

In [2]: df = cudf.DataFrame({'a': [], 'b': []})

In [3]: df
Out[3]: 
Empty DataFrame
Columns: [a, b]
Index: []

In [4]: df.to_parquet("a.parquet")

In [5]: from cudf._lib.parquet import ParquetReader

In [6]: reader = ParquetReader(["a.parquet"])

In [7]: reader._has_next()  
Out[7]: False     # Expected: True

In [8]: reader._read_chunk()
Out[8]: 
Empty DataFrame
Columns: [a, b]
Index: []

In [9]: reader._read_chunk()    # Expected: RuntimeError
Out[9]: 
Empty DataFrame
Columns: [a, b]
Index: []

In [10]: reader._read_chunk()   # Expected: RuntimeError
Out[10]: 
Empty DataFrame
Columns: [a, b]
Index: []

In [11]: reader._read_chunk()    # Expected: RuntimeError
Out[11]: 
Empty DataFrame
Columns: [a, b]
Index: []

In [12]: reader._has_next()  
Out[12]: False

Environment overview (please complete the following information)

  • Environment location: [Bare-metal]
  • Method of cuDF install: [from source]
@galipremsagar galipremsagar added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. labels May 14, 2024
@mhaseeb123 mhaseeb123 self-assigned this May 14, 2024
@mhaseeb123
Copy link
Member

mhaseeb123 commented May 14, 2024

Hi @galipremsagar I have been looking into this and it just involves handling a bunch of logic. I have a couple of small questions before I implement the solution.

  1. Should a reader.read_chunk() call when all chunks have been read result in the following error or should be gracefully handled via CUDF_LOG_CRITICAL?
RuntimeError: Fatal CUDA error encountered at: /nvme/0/pgali/cudf/cpp/include/cudf/detail/utilities/vector_factories.hpp:277: 700 cudaErrorIllegalAddress an illegal memory access was encountered
  1. What should be the expected behavior of has_next() and read_chunk() when the table is empty and also has no columns? Note that we might still have an index_column (handled in Cython layer) even if no other columns are present?

CC: @nvdbaranec @GregoryKimball for vis

@galipremsagar
Copy link
Contributor Author

Hi @galipremsagar I have been looking into this and it just involves handling a bunch of logic. I have a couple of small questions before I implement the solution.

  1. Should a reader.read_chunk() call when all chunks have been read result in the following error or should be gracefully handled via CUDF_LOG_CRITICAL?

RuntimeError: Fatal CUDA error encountered at: /nvme/0/pgali/cudf/cpp/include/cudf/detail/utilities/vector_factories.hpp:277: 700 cudaErrorIllegalAddress an illegal memory access was encountered

It is always good to not throw runtime error but that is less pressing for me. If we can fix it, it'll be a bonus.

  1. What should be the expected behavior of has_next() and read_chunk() when the table is empty and also has no columns? Note that we might still have an index_column (handled in Cython layer) even if no other columns are present?

CC: @nvdbaranec @GregoryKimball for vis

I would expect has_next to return true and read_chunk to return an empty table from libcudf layer. It does right now but infinitely while has_next still keeps returning False.

rapids-bot bot pushed a commit that referenced this issue May 16, 2024
…read (#15757)

Fixes #15743 

This PR solves two problems. 

First, it does not any longer throw a CUDA failure or exception when an invalid (out of bound) chunk is read via `chunked_parquet_reader::read_chunk()` and instead returns an empty chunk.

Second, for empty tables, it returns true for `has_next()` until the first call to `chunked_parquet_reader::read_chunk()`. After that `has_next()` returns false but `chunked_parquet_reader::read_chunk()` keeps returning empty chunks

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #15757
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
Archived in project
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants