Unable to load Feather v2 files created by pyarrow and pandas. #286

ghuls · 2021-05-11T17:56:19Z

Describe the bug

Original bug report is here (agains polars, which was using arrow-rs for parsing Feather v2 files (IPC)):
pola-rs/polars#623

Unable to load Feather v2 files created by pyarrow and pandas.

Those files can be loaded fine by pyarrow and pandas itself.

To Reproduce
Steps to reproduce the behavior:

Try to load the attached Feather files:
test_feather_file.zip
)

test_pandas.feather: Original Feather file
test_arrow.feather: loading test_pandas.feather with pyarrow and saving with pyarrow: df_pa = pa.feather.read_feather('test_pandas.feather')
test_polars.feather:  Loading test_pandas.feather with pyarrow and saving with polars (this one can be read by arrow-rs)
test_pandas_from_polars.feather: Loading test_polars.feather with polars and using the to_pandas option.

Expected behavior

Feather v2 files can be opened by arrow-rs.

Additional context

import polars as pl
import pyarrow as pa
import pandas as pd

# Reading Feather file created with Pandas with pyarrow works fine.
df_pa = pa.feather.read_feather('test_pandas.feather')

# Write pyarrow dataframe to Feather file.
df_pa.to_feather('test_arrow.feather')

# Convert pyarrow dataframe to polars dataframe.
df_pl = pl.DataFrame(df_pa)

# Convert polars dataframe to pandas dataframe.
df_pd = df_pl.to_pandas()

# Write pandas dataframe  to feather file.
df_pd.to_feather('test_pandas_from_polars.feather')


In [88]: df_pa
Out[88]: 
   motif1  motif2  motif3  motif4 regions
0     1.2     3.0     0.3     5.6    reg1
1     6.7     3.0     4.3     5.6    reg2
2     3.5     3.0     0.0     0.0    reg3
3     0.0     3.0     0.0     5.6    reg4
4     2.4     3.0     7.8     1.2    reg5
5     2.4     3.0     0.6     0.0    reg6
6     2.4     3.0     7.7     0.0    reg7

In [89]: df_pl
Out[89]: 
shape: (7, 5)
╭────────┬────────┬────────┬────────┬─────────╮
│ motif1 ┆ motif2 ┆ motif3 ┆ motif4 ┆ regions │
│ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---     │
│ f64    ┆ f64    ┆ f64    ┆ f64    ┆ str     │
╞════════╪════════╪════════╪════════╪═════════╡
│ 1.2    ┆ 3      ┆ 0.3    ┆ 5.6    ┆ "reg1"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 6.7    ┆ 3      ┆ 4.3    ┆ 5.6    ┆ "reg2"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3.5    ┆ 3      ┆ 0.0    ┆ 0.0    ┆ "reg3"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 0.0    ┆ 3      ┆ 0.0    ┆ 5.6    ┆ "reg4"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.4    ┆ 3      ┆ 7.8    ┆ 1.2    ┆ "reg5"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.4    ┆ 3      ┆ 0.6    ┆ 0.0    ┆ "reg6"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.4    ┆ 3      ┆ 7.7    ┆ 0.0    ┆ "reg7"  │
╰────────┴────────┴────────┴────────┴─────────╯

In [90]: df_pd
Out[90]: 
   motif1  motif2  motif3  motif4 regions
0     1.2     3.0     0.3     5.6    reg1
1     6.7     3.0     4.3     5.6    reg2
2     3.5     3.0     0.0     0.0    reg3
3     0.0     3.0     0.0     5.6    reg4
4     2.4     3.0     7.8     1.2    reg5
5     2.4     3.0     0.6     0.0    reg6
6     2.4     3.0     7.7     0.0    reg7



In [103]: pl.read_ipc('test_polars.feather')
Out[103]: 
shape: (7, 5)
╭────────┬────────┬────────┬────────┬─────────╮
│ motif1 ┆ motif2 ┆ motif3 ┆ motif4 ┆ regions │
│ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---     │
│ f64    ┆ f64    ┆ f64    ┆ f64    ┆ str     │
╞════════╪════════╪════════╪════════╪═════════╡
│ 1.2    ┆ 3      ┆ 0.3    ┆ 5.6    ┆ "reg1"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 6.7    ┆ 3      ┆ 4.3    ┆ 5.6    ┆ "reg2"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3.5    ┆ 3      ┆ 0.0    ┆ 0.0    ┆ "reg3"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 0.0    ┆ 3      ┆ 0.0    ┆ 5.6    ┆ "reg4"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.4    ┆ 3      ┆ 7.8    ┆ 1.2    ┆ "reg5"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.4    ┆ 3      ┆ 0.6    ┆ 0.0    ┆ "reg6"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.4    ┆ 3      ┆ 7.7    ┆ 0.0    ┆ "reg7"  │
╰────────┴────────┴────────┴────────┴─────────╯

In [104]: pl.read_ipc('test_arrow.feather')
thread '<unnamed>' panicked at 'assertion failed: prefix.is_empty() && suffix.is_empty()', /github/home/.cargo/git/checkouts/arrow-rs-3b86e19e889d5acc/d008f31/arrow/src/buffer/immutable.rs:179:9
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
<ipython-input-104-f9a22f9a0eb1> in <module>
----> 1 pl.read_ipc('test_arrow.feather')

~/software/anaconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/polars/functions.py in read_ipc(file)
    278     """
    279     file = _prepare_file_arg(file)
--> 280     return DataFrame.read_ipc(file)
    281 
    282 

~/software/anaconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/polars/frame.py in read_ipc(file)
    235         """
    236         self = DataFrame.__new__(DataFrame)
--> 237         self._df = PyDataFrame.read_ipc(file)
    238         return self
    239 

PanicException: assertion failed: prefix.is_empty() && suffix.is_empty()

In [105]: pl.read_ipc('test_pandas.feather')
thread '<unnamed>' panicked at 'assertion failed: prefix.is_empty() && suffix.is_empty()', /github/home/.cargo/git/checkouts/arrow-rs-3b86e19e889d5acc/d008f31/arrow/src/buffer/immutable.rs:179:9
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
<ipython-input-105-35809d9ae65f> in <module>
----> 1 pl.read_ipc('test_pandas.feather')

~/software/anaconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/polars/functions.py in read_ipc(file)
    278     """
    279     file = _prepare_file_arg(file)
--> 280     return DataFrame.read_ipc(file)
    281 
    282 

~/software/anaconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/polars/frame.py in read_ipc(file)
    235         """
    236         self = DataFrame.__new__(DataFrame)
--> 237         self._df = PyDataFrame.read_ipc(file)
    238         return self
    239 

PanicException: assertion failed: prefix.is_empty() && suffix.is_empty()

In [106]: pl.read_ipc('test_pandas_from_polars.feather')
thread '<unnamed>' panicked at 'assertion failed: prefix.is_empty() && suffix.is_empty()', /github/home/.cargo/git/checkouts/arrow-rs-3b86e19e889d5acc/d008f31/arrow/src/buffer/immutable.rs:179:9
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
<ipython-input-107-d0a17f51c6ac> in <module>
----> 1 pl.read_ipc('test_pandas_from_polars.feather')

~/software/anaconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/polars/functions.py in read_ipc(file)
    278     """
    279     file = _prepare_file_arg(file)
--> 280     return DataFrame.read_ipc(file)
    281 
    282 

~/software/anaconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/polars/frame.py in read_ipc(file)
    235         """
    236         self = DataFrame.__new__(DataFrame)
--> 237         self._df = PyDataFrame.read_ipc(file)
    238         return self
    239 

PanicException: assertion failed: prefix.is_empty() && suffix.is_empty()

The text was updated successfully, but these errors were encountered:

jorgecarleitao · 2021-05-12T05:16:13Z

I did not know this: is feather compatible with IPC?

ghuls · 2021-05-12T05:39:42Z

It should be IPC on disk with optional compression with lz4 or zstd:

https://arrow.apache.org/docs/python/feather.html
https://ursalabs.org/blog/2020-feather-v2/

Feather v1 is indeed a total different format. (header bytes: FEA1 instead of ARROW1

ghuls · 2021-05-12T06:07:36Z

Here is the original commit that introduced Feather v2 support in Arrow: apache/arrow@e03251c

jorgecarleitao · 2021-05-12T06:11:34Z

Nice, learnt something new today. Thanks for the explanation

This is indeed a bug, and a dangerous one because that prefix and suffix imply that we allowed misaligned bytes to go to the MutableBuffer (that check is like the last line of defense against UB).

jorgecarleitao · 2021-05-12T06:49:20Z

I investigated this and there is something funny going on: the file reports that there is an array whose buffer of type u8 has 201326592 slots, but the buffers' total length is 51. This happens on the 5th column, which is a Utf8.

This behavior is consistent among test_pandas.feather and test_arrow.feather on the zip.

That number of slots seems incorrect. I need to check if this is a problem while reading those slots from the file or whewther they are already written as that.

jorgecarleitao · 2021-05-12T07:13:38Z

More details: in both files, I am getting the following:

Reading Utf8
field_node: FieldNode { length: 7, null_count: 0 }
offset buffer: Buffer { offset: 200, length: 55 }
offsets: [32, 0, 407708164, 545407072, 8388608, 67108864, 134217728, 201326592]
values buffer: Buffer { offset: 256, length: 51 }

offsets[0] != 0 indicates a problem: offsets are expected to start from zero on any array with offsets.
offsets[i+1] < offsets[i+1] for some i, which indicates a problem: offsets are expected to be monotonically increasing

I do not have a root cause yet, these are just observations.

ghuls · 2021-05-12T07:36:39Z

It makes sense that you see the same in the Feather file created by pyarrow and pandas as pandas uses the same pyarrow.feather code: https://github.com/pandas-dev/pandas/blob/059c8bac51e47d6eaaa3e36d6a293a22312925e6/pandas/io/feather_format.py

ghuls · 2021-05-12T08:30:06Z

Could it be that this difference you see is due tostreaming IPC vs random access IPC format?

For most cases, it is most convenient to use the RecordBatchStreamReader or RecordBatchFileReader class, depending on which variant of the IPC format you want to read. The former requires a InputStream source, while the latter requires a RandomAccessFile.

Reading Arrow IPC data is inherently zero-copy if the source allows it. For example, a BufferReader or MemoryMappedFile can typically be zero-copy. Exceptions are when the data must be transformed on the fly, e.g. when buffer compression has been enabled on the IPC stream or file.

https://arrow.apache.org/docs/cpp/ipc.html

ghuls · 2021-05-18T10:53:00Z

IPC File Format

We define a “file format” supporting random access that is build with the stream format. The file starts and ends with a magic string ARROW1 (plus padding). What follows in the file is identical to the stream format. At the end of the file, we write a footer containing a redundant copy of the schema (which is a part of the streaming format) plus memory offsets and sizes for each of the data blocks in the file. This enables random access any record batch in the file. See File.fbs for the precise details of the file footer.

Schematically we have:

<magic number "ARROW1">
<empty padding bytes [to 8 byte boundary]>
<STREAMING FORMAT with EOS>
<FOOTER>
<FOOTER SIZE: int32>
<magic number "ARROW1">

In the file format, there is no requirement that dictionary keys should be defined in a DictionaryBatch before they are used in a RecordBatch, as long as the keys are defined somewhere in the file. Further more, it is invalid to have more than one non-delta dictionary batch per dictionary ID (i.e. dictionary replacement is not supported). Delta dictionaries are applied in the order they appear in the file footer.

https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format

ghuls · 2021-06-16T14:29:47Z

@jorgecarleitao There is a recent commit on arrow that improves the documentation of the arrow IPC file format:
apache/arrow@59c5781#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60L1011-R1023

IPC File Format
---------------

- We define a "file format" supporting random access that is build with
- the stream format. The file starts and ends with a magic string
- ``ARROW1`` (plus padding). What follows in the file is identical to
- the stream format. At the end of the file, we write a *footer*
- containing a redundant copy of the schema (which is a part of the
- streaming format) plus memory offsets and sizes for each of the data
- blocks in the file. This enables random access any record batch in the
- file. See `File.fbs`_ for the precise details of the file footer.
+ We define a "file format" supporting random access that is an extension of
+ the stream format. The file starts and ends with a magic string ``ARROW1``
+ (plus padding). What follows in the file is identical to the stream format.
+ At the end of the file, we write a *footer* containing a redundant copy of
+ the schema (which is a part of the streaming format) plus memory offsets and
+ sizes for each of the data blocks in the file. This enables random access to
+ any record batch in the file. See `File.fbs`_ for the precise details of the
+ file footer.

ghuls · 2021-06-21T22:25:23Z

@jorgecarleitao I think I might have figured out the problem.

import polars as pl
import pyarrow as pa
import pandas as pd

# Read Feather file written with pandas, with pa,feather.read_feather (wrapped inside pl.read_ipc) in Polars dataframe.
df_pl = pl.read_ipc('test_pandas.feather', use_pyarrow=True)

# Convert Polars dataframe to arrow table and write to Feather v2 file without compression (with pyarrow).
pa.feather.write_feather(df_pl.to_arrow(), 'test_polars_to_arrow_uncompressed.feather', compression='uncompressed', version=2)

# Convert Polars dataframe to arrow table and write to Feather v2 file without compression (with pyarrow).
pa.feather.write_feather(df_pl.to_arrow(), 'test_polars_to_arrow_lz4.feather', compression='lz4', version=2)

# Convert Polars dataframe to arrow table and convert arrow table to pandas dataframe and write to Feather v2 file without compression (with pyarrow).
pa.feather.write_feather(df_pl.to_arrow().to_pandas(), 'test_polars_to_arrow_to_pandas_uncompressed.feather', compression='uncompressed', version=2)

# Convert Polars dataframe to arrow table and convert arrow table to pandas dataframe and write to Feather v2 file with lz4 compression (with pyarrow).
pa.feather.write_feather(df_pl.to_arrow().to_pandas(), 'test_polars_to_arrow_to_pandas_lz4.feather', compression='lz4', version=2)


# Now try to read all those files with polars without using the pyarrow Feather reading code, but the arrow-rs code instead.

# Reading Feather v2 file without compression containing saved arrow table data, works.
In [9]: pl.read_ipc('test_polars_to_arrow_uncompressed.feather', use_pyarrow=False)
Out[9]: 
shape: (7, 5)
╭────────────────────┬────────┬─────────────────────┬────────────────────┬─────────╮
│ motif1             ┆ motif2 ┆ motif3              ┆ motif4             ┆ regions │
│ ---                ┆ ---    ┆ ---                 ┆ ---                ┆ ---     │
│ f32                ┆ f32    ┆ f32                 ┆ f32                ┆ str     │
╞════════════════════╪════════╪═════════════════════╪════════════════════╪═════════╡
│ 1.2000000476837158 ┆ 3      ┆ 0.30000001192092896 ┆ 5.599999904632568  ┆ "reg1"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 6.699999809265137  ┆ 3      ┆ 4.300000190734863   ┆ 5.599999904632568  ┆ "reg2"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3.5                ┆ 3      ┆ 0.0                 ┆ 0.0                ┆ "reg3"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 0.0                ┆ 3      ┆ 0.0                 ┆ 5.599999904632568  ┆ "reg4"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.4000000953674316 ┆ 3      ┆ 7.800000190734863   ┆ 1.2000000476837158 ┆ "reg5"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.4000000953674316 ┆ 3      ┆ 0.6000000238418579  ┆ 0.0                ┆ "reg6"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.4000000953674316 ┆ 3      ┆ 7.699999809265137   ┆ 0.0                ┆ "reg7"  │
╰────────────────────┴────────┴─────────────────────┴────────────────────┴─────────╯


# Reading Feather v2 file without compression containing saved pandas dataframe, works.
In [10]: pl.read_ipc('test_polars_to_arrow_to_pandas_uncompressed.feather', use_pyarrow=False)
Out[10]: 
shape: (7, 5)
╭────────────────────┬────────┬─────────────────────┬────────────────────┬─────────╮
│ motif1             ┆ motif2 ┆ motif3              ┆ motif4             ┆ regions │
│ ---                ┆ ---    ┆ ---                 ┆ ---                ┆ ---     │
│ f32                ┆ f32    ┆ f32                 ┆ f32                ┆ str     │
╞════════════════════╪════════╪═════════════════════╪════════════════════╪═════════╡
│ 1.2000000476837158 ┆ 3      ┆ 0.30000001192092896 ┆ 5.599999904632568  ┆ "reg1"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 6.699999809265137  ┆ 3      ┆ 4.300000190734863   ┆ 5.599999904632568  ┆ "reg2"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3.5                ┆ 3      ┆ 0.0                 ┆ 0.0                ┆ "reg3"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 0.0                ┆ 3      ┆ 0.0                 ┆ 5.599999904632568  ┆ "reg4"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.4000000953674316 ┆ 3      ┆ 7.800000190734863   ┆ 1.2000000476837158 ┆ "reg5"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.4000000953674316 ┆ 3      ┆ 0.6000000238418579  ┆ 0.0                ┆ "reg6"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.4000000953674316 ┆ 3      ┆ 7.699999809265137   ┆ 0.0                ┆ "reg7"  │
╰────────────────────┴────────┴─────────────────────┴────────────────────┴─────────╯


# Reading Feather v2 file with lz4 compression containing saved pandas dataframe, gives the error from the first post.
In [11]: pl.read_ipc('test_polars_to_arrow_to_pandas_lz4.feather', use_pyarrow=False)
thread '<unnamed>' panicked at 'assertion failed: prefix.is_empty() && suffix.is_empty()', /github/home/.cargo/git/checkouts/arrow-rs-3b86e19e889d5acc/9f56afb/arrow/src/buffer/immutable.rs:179:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
<ipython-input-11-04613b1d0975> in <module>
----> 1 pl.read_ipc('test_polars_to_arrow_to_pandas_lz4.feather', use_pyarrow=False)
/software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/polars/functions.py in read_ipc(file, use_pyarrow)
    337     """
    338     file = _prepare_file_arg(file)
--> 339     return DataFrame.read_ipc(file, use_pyarrow)
    340 
    341 

/software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/polars/frame.py in read_ipc(file, use_pyarrow)
    302 
    303         self = DataFrame.__new__(DataFrame)
--> 304         self._df = PyDataFrame.read_ipc(file)
    305         return self
    306 

PanicException: assertion failed: prefix.is_empty() && suffix.is_empty()


# Reading Feather v2 file with lz4 compression containing saved pyarrow table, results in killing of iPython due to trying to allocate a too big buffer.
In [12]: pl.read_ipc('test_polars_to_arrow_lz4.feather', use_pyarrow=False)
Out[12]: memory allocation of 2702793507844465093 bytes failed
Aborted

So to me it looks like that arrow-rs is not detecting that pyarrow saved the Feather file with lz4 compression and I guess it is reading data (or offsets) from the wrong locations.

In [6]: ?pa.feather.write_feather
Signature:
pa.feather.write_feather(
    df,
    dest,
    compression=None,
    compression_level=None,
    chunksize=None,
    version=2,
)
Docstring:
Write a pandas.DataFrame to Feather format.

Parameters
----------
df : pandas.DataFrame or pyarrow.Table
    Data to write out as Feather format.
dest : str
    Local destination path.
compression : string, default None
    Can be one of {"zstd", "lz4", "uncompressed"}. The default of None uses
    LZ4 for V2 files if it is available, otherwise uncompressed.
compression_level : int, default None
    Use a compression level particular to the chosen compressor. If None
    use the default compression level
chunksize : int, default None
    For V2 files, the internal maximum size of Arrow RecordBatch chunks
    when writing the Arrow IPC file format. None means use the default,
    which is currently 64K
version : int, default 2
    Feather file version. Version 2 is the current. Version 1 is the more
    limited legacy format
File:      /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/feather.py
Type:      function

Feather files are attached:
test_feather_polars_to_pyarrow.zip

nevi-me · 2021-06-22T09:30:52Z

@ghuls compression isn't supported, see #70 and https://issues.apache.org/jira/browse/ARROW-8676. I had a PR for this, but struggled with getting integration tests to pass, so I abandoned it as I didn't have more time for it.

Here's the PR: apache/arrow#9137

ghuls · 2021-06-22T10:00:35Z

@nevi-me A pity it is not supported (yet) as Pandas and pyarrow will write Feather files with lz4 compression by default (at least when using the official packages). At least arrow-rs should detect that a compression codec is used that it does not support yet, instead of doing the wrong thing and reading compressed data as uncompressed data.

ghuls · 2023-02-15T08:06:07Z

I guess it is solved now, if I read https://arrow.apache.org/blog/2023/02/13/rust-32.0.0/

IPC File Compression: Arrow IPC file compression with ZSTD and LZ4 is now fully supported.

correctly.

tustvold · 2023-03-16T13:04:07Z

I believe this was closed by #2369 feel to reopen if I am mistaken

ghuls added the bug label May 11, 2021

ghuls mentioned this issue Jun 23, 2021

Does arrow2 support codecs when reading IPC files? jorgecarleitao/arrow2#163

Closed

ghuls mentioned this issue Oct 5, 2021

read_ipc won't load Feather V2 files pola-rs/polars#1488

Closed

tustvold closed this as completed Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to load Feather v2 files created by pyarrow and pandas. #286

Unable to load Feather v2 files created by pyarrow and pandas. #286

ghuls commented May 11, 2021

jorgecarleitao commented May 12, 2021

ghuls commented May 12, 2021 •

edited

ghuls commented May 12, 2021

jorgecarleitao commented May 12, 2021 •

edited

jorgecarleitao commented May 12, 2021

jorgecarleitao commented May 12, 2021

ghuls commented May 12, 2021

ghuls commented May 12, 2021

ghuls commented May 18, 2021

ghuls commented Jun 16, 2021 •

edited

ghuls commented Jun 21, 2021 •

edited

nevi-me commented Jun 22, 2021

ghuls commented Jun 22, 2021

ghuls commented Feb 15, 2023

tustvold commented Mar 16, 2023

Unable to load Feather v2 files created by pyarrow and pandas. #286

Unable to load Feather v2 files created by pyarrow and pandas. #286

Comments

ghuls commented May 11, 2021

jorgecarleitao commented May 12, 2021

ghuls commented May 12, 2021 • edited

ghuls commented May 12, 2021

jorgecarleitao commented May 12, 2021 • edited

jorgecarleitao commented May 12, 2021

jorgecarleitao commented May 12, 2021

ghuls commented May 12, 2021

ghuls commented May 12, 2021

ghuls commented May 18, 2021

ghuls commented Jun 16, 2021 • edited

ghuls commented Jun 21, 2021 • edited

nevi-me commented Jun 22, 2021

ghuls commented Jun 22, 2021

ghuls commented Feb 15, 2023

tustvold commented Mar 16, 2023

ghuls commented May 12, 2021 •

edited

jorgecarleitao commented May 12, 2021 •

edited

ghuls commented Jun 16, 2021 •

edited

ghuls commented Jun 21, 2021 •

edited