Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change binary array type from LargeBinaryArray to BinaryArray #3913

Closed
evenyag opened this issue May 11, 2024 · 0 comments
Closed

Change binary array type from LargeBinaryArray to BinaryArray #3913

evenyag opened this issue May 11, 2024 · 0 comments
Assignees
Labels
C-enhancement Category Enhancements good first issue Good for newcomers

Comments

@evenyag
Copy link
Contributor

evenyag commented May 11, 2024

What type of enhancement is this?

Tech debt reduction

What does the enhancement do?

Due to historical reasons, we use LargeBinaryArray for the binary type but StringArray for the string type.

pub type BinaryArray = arrow::array::LargeBinaryArray;
pub type MutableBinaryArray = arrow::array::LargeBinaryBuilder;
pub type StringArray = arrow::array::StringArray;
pub type MutableStringArray = arrow::array::StringBuilder;

As storing and processing are inefficient, I suggest that we could change the default binary array type to BinaryArray.

Implementation challenges

Since the BinaryArray is a type alias, we could change it to LargeBinaryArray and fix the remaining compiler errors.

We need to change the data type mapping to the arrow type.

fn as_arrow_type(&self) -> ArrowDataType {
ArrowDataType::LargeBinary
}

We use our data type and schema while reading the parquet file so we should be able to read old parquet files with binary type columns.

// Computes the field levels.
let hint = Some(read_format.arrow_schema().fields());
let field_levels =
parquet_to_arrow_field_levels(parquet_schema_desc, projection_mask.clone(), hint)
.context(ReadParquetSnafu { path: &file_path })?;

We might use the old DB to generate a file and read it using the new DB. If this is a breaking change, we should find some places to cast the array to the target type.

Note that we should also modify the helper to cast the binary array correctly.

ArrowDataType::LargeBinary => Arc::new(BinaryVector::try_from_arrow_array(array)?),
ArrowDataType::FixedSizeBinary(_) | ArrowDataType::Binary => {
let array = arrow::compute::cast(array.as_ref(), &ArrowDataType::LargeBinary)
.context(crate::error::ArrowComputeSnafu)?;
Arc::new(BinaryVector::try_from_arrow_array(array)?)
}

I also found a bug while casting the LargeUtf8 type. We are converting a string array into a binary vector......

ArrowDataType::LargeUtf8 => {
let array = arrow::compute::cast(array.as_ref(), &ArrowDataType::Utf8)
.context(crate::error::ArrowComputeSnafu)?;
Arc::new(BinaryVector::try_from_arrow_array(array)?)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category Enhancements good first issue Good for newcomers
Projects
Status: Done
Development

No branches or pull requests

2 participants