Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add overflow-checking variants of arithmetic dyn kernels #2740

Merged
merged 17 commits into from Sep 20, 2022

Conversation

viirya
Copy link
Member

@viirya viirya commented Sep 15, 2022

Which issue does this PR close?

Closes #2739.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 15, 2022
@viirya
Copy link
Member Author

viirya commented Sep 15, 2022

cc @sunchao

@viirya
Copy link
Member Author

viirya commented Sep 16, 2022

The CI failure looks unrelated.

arrow/src/compute/kernels/arithmetic.rs Outdated Show resolved Hide resolved
arrow/src/compute/kernels/arithmetic.rs Outdated Show resolved Hide resolved
arrow/src/compute/kernels/arithmetic.rs Outdated Show resolved Hide resolved
math_checked_op_dict
)
}
DataType::Date32 => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if the handling of Date32 and Date64 can be extracted as a separate function so it can be shared with add_dyn, but feel free to ignore if this is not easy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, let me leave it as it is now. I'm wonder if we should do checking/non-checking behavior on Data32/Date64.

@@ -522,67 +548,86 @@ macro_rules! typed_dict_math_op {
}};
}

/// Helper function to perform math lambda function on values from two dictionary arrays, this
/// version does not attempt to use SIMD explicitly (though the compiler may auto vectorize)
macro_rules! math_dict_op {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This macro is unnecessary and can be inlined. Removed it.

arrow/src/compute/kernels/arithmetic.rs Outdated Show resolved Hide resolved
arrow/src/compute/kernels/arithmetic.rs Outdated Show resolved Hide resolved
arrow/src/compute/kernels/arity.rs Show resolved Hide resolved
Comment on lines 300 to 301
if a.null_count() == 0 && b.null_count() == 0 {
let values = a.values().iter().zip(b.values()).map(|(l, r)| op(*l, *r));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This latest optimization cause a bit trouble on the change from PrimitiveArray to ArrayAccessor as function parameter.

I can only add one new trait ArrayValuesAccessor to coordinate it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rewrite this as external iteration, i.e. for i in 0..array.len() this is easier for the compiler, tends to optimise better and is simpler to understand. MutableBuffer::push_unchecked will behave equivalently

The trusted len iterators are not necessary imo, and are probably best avoided in general.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Sounds good to do. So I can get rid of ArrayValuesAccessor which is just needed to get values.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move away from using trusted len iter, it's basically a hack and internal iteration will optimise better or the same

FYI @Dandandan


/// Returns a values accessor [`ArrayValuesAccessor`] for this [`ArrayAccessor`] if
/// it supports. Returns [`None`] if it doesn't support accessing values directly.
fn get_values_accessor(&self) -> Option<&dyn ArrayValuesAccessor<Item = Self::Item>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of the dyn indirection here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to return a Option<&[Self::Item]> directly as the semantics look weird (i.e. some ArrayAccessor doesn't provide values). This is to return self as ArrayValuesAccessor so the caller can call values.

But I think I will remove it based on #2740 (comment).

Comment on lines 300 to 301
if a.null_count() == 0 && b.null_count() == 0 {
let values = a.values().iter().zip(b.values()).map(|(l, r)| op(*l, *r));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rewrite this as external iteration, i.e. for i in 0..array.len() this is easier for the compiler, tends to optimise better and is simpler to understand. MutableBuffer::push_unchecked will behave equivalently

The trusted len iterators are not necessary imo, and are probably best avoided in general.

/// This is similar to `math_op` as it performs given operation between two input primitive arrays.
/// But the given operation can return `Err` if overflow is detected. For the case, this function
/// returns an `Err`.
fn math_checked_op<LT, RT, F>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem to be necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for type bound only.

Changed the math_checked_op to try_binary, got:

error[E0284]: type annotations needed: cannot satisfy `<_ as native::ArrowPrimitiveType>::Native == i8`
   --> arrow/src/compute/kernels/arithmetic.rs:848:21
    |
848 |                     try_binary(left, right, |a, b| a.add_checked(b)).map(|a| Arc::new(a) as ArrayRef)
    |                     ^^^^^^^^^^ cannot satisfy `<_ as native::ArrowPrimitiveType>::Native == i8`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not just a matter of providing the necessary type hint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is wrapped in downcast_primitive_array macro call. It seems no way to add type hint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, it is because the return type needs to be type hinted. Darn...

)));
}
/// Perform given operation on two `DictionaryArray`s.
/// Returns an error if the two arrays have different value type
Copy link
Contributor

@tustvold tustvold Sep 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance of this will be pretty terrible, and results in a huge amount of codegen. I'm not entirely sure of the use-case tbh... Perhaps it is worth exploring putting this behind a feature flag

Edit: I wouldn't be surprised if hydrating the dictionary to its values, performing the operation and casting back was faster

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, math_op_dict? The change just inlines macro math_dict_op into the function math_op_dict. So I suppose you are not meaning the change but math_op_dict itself?

This can be done in follow-up to optimize it. The current change diff is quite big.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, the compile times are currently despair inducing and I have a chip on my shoulder about them 😅

buffer.append_n_zeroed(len);
let slice = buffer.as_slice_mut();

for idx in 0..len {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite it with a for loop.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we run the arithmetic benchmarks possibly?

// `values` is an iterator with a known size from a PrimitiveArray
return Ok(unsafe { build_primitive_array(len, buffer, 0, None) });
let mut buffer = BufferBuilder::<O::Native>::new(len);
buffer.append_n_zeroed(len);
Copy link
Contributor

@tustvold tustvold Sep 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be faster to reserve the correct capacity and use push_unchecked. Avoids zero allocating.
As written I suspect this is a performance regression as this if block is effectively identical to the one below now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if I get it correctly, I did something like:

for idx in 0..len {
    unsafe {
        buffer.push_unchecked(op(a.value_unchecked(idx), b.value_unchecked(idx))?);
    };
}

But still see regression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I roll back to the ArrayValuesAccessor way? I benchmarked it and the performance keeps the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a brief play tomorrow, I would really like to avoid adding further APIs, especially ones that aren't a generalisable abstraction, if we can possibly avoid it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried another way to iterate the values from both side. But still the regression.

I think when we call value_unchecked to get values iteratively, it is an indirect access pattern compared with values where we simply process on a slice. It makes LLVM hard to optimize it as vectorized access.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slice iterators boil down to the same thing, in fact the external iteration is harder for llvm. Something else is going on here.

Copy link
Contributor

@tustvold tustvold Sep 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this appears to boil down to the whims of LLVMs inlining heuristics, try_from_trusted_len_iter doesn't appear to get fully inlined and therefore gets optimised properly (although I have a really hard time understanding the generated code). Potentially rustc is doing something to help LLVM here.

However, when the loop is defined in the function body of try_binary LLVM doesn't optimize it properly. If you split it out into a free function with inline(never), it optimises correctly and is actually marginally faster than the iterator.

Annoyingly using try_from_trusted_len_iter but doing something like (0..a.len()).map(|idx| ...) to construct the iterator also doesn't work, unless split into a free function.

I would therefore suggest splitting the no null variant into a free function marked inline never, as this appears to work...

FYI @Dandandan as I wonder if we're running into this elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, interesting...let me split it out to a free function and benchmark again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, the suggested never inline function works.

/// This is similar to `math_op` as it performs given operation between two input primitive arrays.
/// But the given operation can return `Err` if overflow is detected. For the case, this function
/// returns an `Err`.
fn math_checked_op<LT, RT, F>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not just a matter of providing the necessary type hint?

)));
}
/// Perform given operation on two `DictionaryArray`s.
/// Returns an error if the two arrays have different value type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, the compile times are currently despair inducing and I have a chip on my shoulder about them 😅

{
math_dict_op!(left, right, op, PrimitiveArray<T>)
let left = left.downcast_dict::<PrimitiveArray<T>>().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This possibly needs to check the actual type, currently it will panic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called by a macro which matches the value types of both sides, so it should be matched. But let me add one check.

@Dandandan
Copy link
Contributor

I would move away from using trusted len iter, it's basically a hack and internal iteration will optimise better or the same

FYI @Dandandan

Sure - if we get a similar performance or better I've no problems with that.
I am not sure I would call it a hack though - the Rust standard library uses the same technique.

@tustvold
Copy link
Contributor

tustvold commented Sep 17, 2022

Agreed, iterators can be made to perform the same, but it requires a significant amount of trickery and is subject to the whims of LLVM. For the purposes of arrow kernels, its just unnecessary imo.

https://medium.com/@veedrac/rust-is-slow-and-i-am-the-cure-32facc0fdcb may also be of interest, though sadly it isnt't possible to implement try_for_each currently.

@viirya
Copy link
Member Author

viirya commented Sep 17, 2022

Could we run the arithmetic benchmarks possibly?

Yea, checked kernels with no null values show obvious regression.

O: ArrowPrimitiveType,
F: Fn(A::Item, B::Item) -> Result<O::Native>,
{
let mut buffer = MutableBuffer::new(len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length parameter for MutableBuffer::new is in bytes, so I think this needs to be multiplied by O::get_byte_width()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed it. Thanks for catching it.

/// This is similar to `math_op` as it performs given operation between two input primitive arrays.
/// But the given operation can return `Err` if overflow is detected. For the case, this function
/// returns an `Err`.
fn math_checked_op<LT, RT, F>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, it is because the return type needs to be type hinted. Darn...

@tustvold tustvold merged commit 9599178 into apache:master Sep 20, 2022
@ursabot
Copy link

ursabot commented Sep 20, 2022

Benchmark runs are scheduled for baseline = 3bf6eb9 and contender = 9599178. 9599178 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@viirya
Copy link
Member Author

viirya commented Sep 20, 2022

Thanks for review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add overflow-checking variants of arithmetic dyn kernels
7 participants