Implement `StringViewArray` and `BinaryViewArray` reading/writing in parquet #5530

alamb · 2024-03-18T19:25:13Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

This is part of the larger project to implement StringViewArray -- see #5374

In #5481 we added support for StringViewArray and ByteViewArray.

The parquet crate has a reader and writer for reading/writing parquet data to arrow:

Describe the solution you'd like
I would like to be able to read a StringViewArray and BinaryViewArray directly from the reader and writer with no data copies (so the raw byte values are not copied).

Add functionality
Add tests

Describe alternatives you've considered

For example, I think we need to add the support to the writer here

arrow-rs/parquet/src/arrow/arrow_writer/mod.rs

Lines 719 to 732 in f41c2a4

    
               ArrowDataType::Dictionary(_, value_type) => match value_type.as_ref() { 
        
                   ArrowDataType::Utf8 | ArrowDataType::LargeUtf8 | ArrowDataType::Binary | ArrowDataType::LargeBinary => { 
        
                       out.push(bytes(leaves.next().unwrap())) 
        
                   } 
        
                   _ => { 
        
                       out.push(col(leaves.next().unwrap())) 
        
                   } 
        
               } 
        
              _ => return Err(ParquetError::NYI( 
        
                  format!( 
        
                      "Attempting to write an Arrow type {data_type:?} to parquet that is not yet implemented" 
        
                  ) 
        
              )) 
        
           }

Additional context

The reader/writer already handles DictionaryArrays which I think could serve as a model for the view arrays.

@ariesdevil reports they are working on this feature #5374 (comment)

The text was updated successfully, but these errors were encountered:

alamb · 2024-04-12T12:41:49Z

@mapleFU brought up a good point on #5557 (comment) that I wanted to record here. The observation is that that the special handling StringView / BinaryView will be substantially more code in the parquet decoder.

The reason for adding the special case is to avoid a copy

For example as I understand it, from the parquet encodings doc

...
BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in the array
...

So the data looks like this (length prefix, followed by the bytes):

\3\0\0\0foo\26\0\0\0abcdefghijklmnoprstuvwxyz

To make a StringArray, those bytes must be copied to a new buffer so they are contiguous:

offets: [0, 3, 29]
data: fooabcdefghijklmnoprstuvwxyz

However, for a StringView array, the raw bytes can be used without copying

views: [(len: 3, data:"foo"), (len:26, offset8)] 
\3\0\0\0foo\26\0\0\0abcdefghijklmnoprstuvwxyz

mapleFU · 2024-04-12T13:59:22Z

In parquet, ARROW:SCHEMA is used to identify the schema and extended info in parquet file. I think the trick point is that, without key-value metadata, both view and string are stored as same thing. So I think we can add optimization for "read string as stringview", but I also think maybe storing a stringview as string can make some legacy reader not confused about it.

For some view, this might work for some encoding( like PLAIN and DELTA_LENGTH_BYTE_ARRAY), but might be tricky for DELTA_BYTE_ARRAY.

Besides, do we need to "copy" the buffer when we'd like to clone or concat the array?

alamb · 2024-04-12T16:55:11Z

In parquet, ARROW:SCHEMA is used to identify the schema and extended info in parquet file. I think the trick point is that, without key-value metadata, both view and string are stored as same thing. So I think we can add optimization for "read string as stringview", but I also think maybe storing a stringview as string can make some legacy reader not confused about it.

This is a good point. Maybe something we could consider is to support setting the desired type directly on the reader
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html

Something like

let reader = ArrowReaderBuilder::try_new(...)
  // override any schema declared in the file
  .with_schema(schema)
  .build()?;

If we supported a similar API for the writer than we could write StringViewArary without any extra copies but the metadata stored in the file could say "read this as a StringView" 🤔

mapleFU · 2024-04-12T16:57:50Z

Yes, my view is that:

writer side: cast string view/binary view to string/binary, keeping the "large" and metadata. This also not confuse legacy reader
reader side: user can choose read by string or string view

alamb · 2024-04-12T17:37:54Z

I think the current writer has the very nice property that RecordBatches round trp cleanly -- they can be written to parquet and then re-read and will be equal.

Thus, yy personal preference is that the writer should (by default) write the metadata that matches the data it was passed (so in this case specify StringView) . The reader should likewise read out the type that was written unless the user overrides the behavior for some reason.

I think it would be valuable to add an option on the writer to write in "compatiblity" mode (and store string/binary data with StringArray metadata, for example)

mapleFU · 2024-04-12T17:59:29Z

This may differ from how we treat "view"( For both list view and string view ). If we think it's a part of string or list, it might better be string/list in metadata. If we regard it as a completely different type, we should regard it as string-view/list-view

For ipc, they're completely different type. For parquet, they stored nothing different, however parquet uses ipc schema here. ::arrow::ipc::ReadSchema might work bad in this scenerio, which might regard "large-string-view-with-metadata" to another type. I'm ok with this but I think at least we can testing this

I believe a new file can be read well, but the nested info might be lost. But anyway this might not severe

alamb · 2024-04-12T18:12:54Z

however parquet uses ipc schema here. ::arrow::ipc::ReadSchema might work bad in this scenerio,

Thanks @mapleFU -- I don't fully understand the concern -- if the embedded metadata in a parquet file says it was written as a StringViewArray I would expect the parqet readers that support that type to read it out as a StringViewArray unless they don't support the type (old readers) or the user specifies they want to read the data from parquet as a different type

mapleFU · 2024-04-12T18:20:01Z

unless they don't support the type

IMO at least, a release without StringView/BinaryView's spec (rather than don't support read StringViewArray) they might meet a "parse failed" and cannot parse an ARROW:SCHEMA. This doesn't matter if we think it's ok.

alamb · 2024-04-13T12:19:56Z

unless they don't support the type

IMO at least, a release without StringView/BinaryView's spec (rather than don't support read StringViewArray) they might meet a "parse failed" and cannot parse an ARROW:SCHEMA. This doesn't matter if we think it's ok.

I see -- I think we are both in agreement that the user of the Rust parquet crate should be able to decide how they want to handle this situation. We may just have different opinions on what a better default is.

As long as both options are possible, I could be convinced either way on the default value

mapleFU · 2024-04-13T15:07:58Z

Yes, I'm also ok with writing a string view type, and maybe we can generate a test file first

alamb mentioned this issue Mar 18, 2024

Implement StringViewArray and BinaryViewArray #5374

Open

13 tasks

ariesdevil linked a pull request Mar 26, 2024 that will close this issue

feat: support reading and writingStringView and BinaryView in parquet (part 2) #5557

Open

alamb mentioned this issue Apr 9, 2024

feat: support reading and writingStringView and BinaryView in parquet (part 1) #5618

Merged

alamb mentioned this issue Apr 12, 2024

Add more invalid utf8 parquet reader tests #5639

Merged

alamb mentioned this issue Apr 16, 2024

Account for Timezone when Casting Timestamp to Date32 #5605

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `StringViewArray` and `BinaryViewArray` reading/writing in parquet #5530

Implement `StringViewArray` and `BinaryViewArray` reading/writing in parquet #5530

alamb commented Mar 18, 2024 •

edited

alamb commented Apr 12, 2024

mapleFU commented Apr 12, 2024 •

edited

alamb commented Apr 12, 2024

mapleFU commented Apr 12, 2024

alamb commented Apr 12, 2024 •

edited

mapleFU commented Apr 12, 2024 •

edited

alamb commented Apr 12, 2024

mapleFU commented Apr 12, 2024 •

edited

alamb commented Apr 13, 2024

mapleFU commented Apr 13, 2024

Implement StringViewArray and BinaryViewArray reading/writing in parquet #5530

Implement StringViewArray and BinaryViewArray reading/writing in parquet #5530

Comments

alamb commented Mar 18, 2024 • edited

alamb commented Apr 12, 2024

mapleFU commented Apr 12, 2024 • edited

alamb commented Apr 12, 2024

mapleFU commented Apr 12, 2024

alamb commented Apr 12, 2024 • edited

mapleFU commented Apr 12, 2024 • edited

alamb commented Apr 12, 2024

mapleFU commented Apr 12, 2024 • edited

alamb commented Apr 13, 2024

mapleFU commented Apr 13, 2024

Implement `StringViewArray` and `BinaryViewArray` reading/writing in parquet #5530

Implement `StringViewArray` and `BinaryViewArray` reading/writing in parquet #5530

alamb commented Mar 18, 2024 •

edited

mapleFU commented Apr 12, 2024 •

edited

alamb commented Apr 12, 2024 •

edited

mapleFU commented Apr 12, 2024 •

edited

mapleFU commented Apr 12, 2024 •

edited