Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FixedSizeBinaryArray::try_from_sparse_iter_with_size #3054

Merged
merged 12 commits into from Nov 13, 2022
Merged
72 changes: 62 additions & 10 deletions arrow-array/src/array/fixed_size_binary_array.rs
Expand Up @@ -129,17 +129,29 @@ impl FixedSizeBinaryArray {
/// # Errors
///
/// Returns error if argument has length zero, or sizes of nested slices don't match.
pub fn try_from_sparse_iter<T, U>(mut iter: T) -> Result<Self, ArrowError>
pub fn try_from_sparse_iter<T, U>(iter: T) -> Result<Self, ArrowError>
where
T: Iterator<Item = Option<U>>,
U: AsRef<[u8]>,
{
Self::try_from_sparse_iter_with_size(iter, None)
}

pub fn try_from_sparse_iter_with_size<T, U>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a doc comment similar to the one above would be good, especially to explain what the size argument is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn try_from_sparse_iter_with_size<T, U>(
pub fn try_from_sparse_iter_with_size<T, U>(iter: T, size: i32) -> Result<Self, ArrowError> {
try_from_sparse_iter_with_size_opt::<T, U>(iter, Some(size))
}
fn try_from_sparse_iter_with_size_opt<T, U>(

I think this makes for better ergonomics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the parameter Optional instead

mut iter: T,
asserted_size: Option<i32>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow the need for a separate asserted_size and detected_size. Looking at the logic I think we could just have mut size: Option<i32> and everything would work correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we were to just have a single size option we would need to have separate implementations of try_from_sparse_iter and try_from_sparse_iter_with_size.

(IMO, I think the former is problematic and should be deprecated / removed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why you need to distinguish between the two different sizes in the implementation, it doesn't change if the size is determined by being passed in or from the first non-null element. They must all be equal regardless?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separated the instances of try_from_sparse_iter and try_from_sparse_iter_with_size because I think the corner cases of making the two implementations match up wasn't worth the complexity.

) -> Result<Self, ArrowError>
where
T: Iterator<Item = Option<U>>,
U: AsRef<[u8]>,
{
let mut len = 0;
let mut size = None;
let mut detected_size = None;
let mut byte = 0;
let mut null_buf = MutableBuffer::from_len_zeroed(0);
let mut buffer = MutableBuffer::from_len_zeroed(0);
let mut prepend = 0;

iter.try_for_each(|item| -> Result<(), ArrowError> {
// extend null bitmask by one byte per each 8 items
if byte == 0 {
Expand All @@ -150,21 +162,21 @@ impl FixedSizeBinaryArray {

if let Some(slice) = item {
let slice = slice.as_ref();
if let Some(size) = size {
if size != slice.len() {
if let Some(detected_size) = detected_size {
if detected_size != slice.len() {
return Err(ArrowError::InvalidArgumentError(format!(
"Nested array size mismatch: one is {}, and the other is {}",
size,
detected_size,
slice.len()
)));
}
} else {
size = Some(slice.len());
detected_size = Some(slice.len());
buffer.extend_zeros(slice.len() * prepend);
}
bit_util::set_bit(null_buf.as_slice_mut(), len);
buffer.extend_from_slice(slice);
} else if let Some(size) = size {
} else if let Some(size) = detected_size {
buffer.extend_zeros(size);
} else {
prepend += 1;
Expand All @@ -175,16 +187,43 @@ impl FixedSizeBinaryArray {
Ok(())
})?;

if len == 0 {
if len == 0 && asserted_size.is_none() {
return Err(ArrowError::InvalidArgumentError(
"Input iterable argument has no data".to_owned(),
));
}

let size = size.unwrap_or(0);
let size = if let Some(detected_size) = detected_size {
// Either asserted size is zero (because we are entering this
// function by way of FixedSizeBinaryArray::try_from_sparse_iter), or
// the detected size has to be equal to the asserted size. If neither
// of these conditions hold then we need to report an error.
if let Some(asserted_size) = asserted_size {
if detected_size as i32 != asserted_size {
return Err(ArrowError::InvalidArgumentError(format!(
"Nested array size mismatch: detected size is {}, and the asserted size is {}",
detected_size,
asserted_size
)));
}
asserted_size
} else {
detected_size as i32
}
} else {
if let Some(asserted_size) = asserted_size {
// If we _haven't_ detected a size, then our provided iterator has
// only produced `None` values. We need to extend the buffer to the
// expected size else we'll segfault if we attempt to use this array.
buffer.extend_zeros(len * (asserted_size as usize));
}

asserted_size.unwrap_or(0)
};

let array_data = unsafe {
ArrayData::new_unchecked(
DataType::FixedSizeBinary(size as i32),
DataType::FixedSizeBinary(size),
len,
None,
Some(null_buf.into()),
Expand Down Expand Up @@ -582,6 +621,19 @@ mod tests {
assert_eq!(5, arr.len())
}

#[test]
fn test_fixed_size_binary_array_from_sparse_iter_with_size_all_none() {
let input_arg = vec![None, None, None, None, None] as Vec<Option<Vec<u8>>>;

let arr = FixedSizeBinaryArray::try_from_sparse_iter_with_size(
input_arg.into_iter(),
Some(16),
)
.unwrap();
assert_eq!(16, arr.value_length());
assert_eq!(5, arr.len())
}

#[test]
fn test_fixed_size_binary_array_from_vec() {
let values = vec!["one".as_bytes(), b"two", b"six", b"ten"];
Expand Down
7 changes: 4 additions & 3 deletions arrow-select/src/take.rs
Expand Up @@ -207,12 +207,12 @@ where
DataType::LargeBinary => {
Ok(Arc::new(take_bytes(as_generic_binary_array::<i64>(values), indices)?))
}
DataType::FixedSizeBinary(_) => {
DataType::FixedSizeBinary(size) => {
let values = values
.as_any()
.downcast_ref::<FixedSizeBinaryArray>()
.unwrap();
Ok(Arc::new(take_fixed_size_binary(values, indices)?))
Ok(Arc::new(take_fixed_size_binary(values, indices, *size)?))
}
DataType::Null => {
// Take applied to a null array produces a null array.
Expand Down Expand Up @@ -769,6 +769,7 @@ where
fn take_fixed_size_binary<IndexType>(
values: &FixedSizeBinaryArray,
indices: &PrimitiveArray<IndexType>,
size: i32,
) -> Result<FixedSizeBinaryArray, ArrowError>
where
IndexType: ArrowPrimitiveType,
Expand All @@ -789,7 +790,7 @@ where
.collect::<Result<Vec<_>, ArrowError>>()?
.into_iter();

FixedSizeBinaryArray::try_from_sparse_iter(array_iter)
FixedSizeBinaryArray::try_from_sparse_iter_with_size(array_iter, Some(size))
}

/// `take` implementation for dictionary arrays
Expand Down
12 changes: 12 additions & 0 deletions arrow/src/compute/kernels/comparison.rs
Expand Up @@ -2270,6 +2270,18 @@ macro_rules! typed_compares {
as_largestring_array($RIGHT),
$OP,
),
(DataType::FixedSizeBinary(_), DataType::FixedSizeBinary(_)) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does not appear to be related?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not, but it was something we were hoping to upstream and I think it accidentally fell into the branch. Would you like me to update this PR to mention this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either we need tests of it, or we need to roll it back

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests added.

let lhs = $LEFT
.as_any()
.downcast_ref::<FixedSizeBinaryArray>()
.unwrap();
let rhs = $RIGHT
.as_any()
.downcast_ref::<FixedSizeBinaryArray>()
.unwrap();

compare_op(lhs, rhs, $OP)
}
(DataType::Binary, DataType::Binary) => compare_op(
as_generic_binary_array::<i32>($LEFT),
as_generic_binary_array::<i32>($RIGHT),
Expand Down