Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DictionaryArray in unary kernel #1990

Merged
merged 7 commits into from Jul 6, 2022
Merged

Conversation

viirya
Copy link
Member

@viirya viirya commented Jul 2, 2022

Which issue does this PR close?

Closes #1989.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 2, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jul 2, 2022

Codecov Report

Merging #1990 (a7a5b78) into master (2309157) will increase coverage by 0.00%.
The diff coverage is 85.03%.

❗ Current head a7a5b78 differs from pull request most recent head ef80266. Consider uploading reports for the commit ef80266 to get more accurate results

@@           Coverage Diff            @@
##           master    #1990    +/-   ##
========================================
  Coverage   83.58%   83.58%            
========================================
  Files         222      222            
  Lines       57529    57639   +110     
========================================
+ Hits        48088    48180    +92     
- Misses       9441     9459    +18     
Impacted Files Coverage Δ
parquet/src/record/api.rs 96.82% <ø> (ø)
arrow/src/compute/kernels/arity.rs 81.05% <78.08%> (-9.86%) ⬇️
...row/src/array/builder/fixed_size_binary_builder.rs 85.52% <94.44%> (+21.42%) ⬆️
arrow/src/array/array_binary.rs 93.18% <0.00%> (-1.11%) ⬇️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (+0.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2309157...ef80266. Read the comment docs.

@viirya
Copy link
Member Author

viirya commented Jul 5, 2022

cc @sunchao I forgot this is not merged yet, I need to use this when implementing dictionary and scalar arithmetic kernels.

@viirya
Copy link
Member Author

viirya commented Jul 5, 2022

Unrelated error:

################# FAILURES #################
FAILED TEST: datetime Java producing,  C# consuming

arrow/src/compute/kernels/arity.rs Outdated Show resolved Hide resolved
}

/// A helper function that applies an unary function to a dictionary array with primitive value type.
fn unary_dict<K, F, T>(array: &DictionaryArray<K>, op: F) -> Result<PrimitiveArray<T>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be public so it can be used by other mods like arithmetic.rs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will use unary_dyn directly.

.as_any()
.downcast_ref::<$value_ty>()
.unwrap()
.take_iter_unchecked($array.keys_iter())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is it possible to directly apply the op on dictionary values? if values are large strings, the current approach will need to first decode the dictionary and convert it to a "plain" array, and then apply the op to each value in there, which is expensive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it too, but didn't try. Let me try if it is feasible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I remember it. It is feasible, but it has one con. Because unary_dict and unary_dyn must return ArrayRef, the compiler cannot infer T so the caller must specify it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change in latest commit. You can see if this is okay.


let values = dict_values
.iter()
.map(|v| v.map(|value| op(value)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the clippy complains about this

error: redundant closure
   --> arrow/src/compute/kernels/arity.rs:101:24
    |
101 |         .map(|v| v.map(|value| op(value)))
    |                        ^^^^^^^^^^^^^^^^^ help: replace the closure with the function itself: `op`
    |

But I cannot follow the suggestion. The compiler fails to compile because op cannot moved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I tried the same too - this should be fine.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good!

T: ArrowPrimitiveType,
F: Fn(T::Native) -> T::Native,
{
let dict_values = array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


let values = dict_values
.iter()
.map(|v| v.map(|value| op(value)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I tried the same too - this should be fine.

arrow/src/compute/kernels/arity.rs Outdated Show resolved Hide resolved
arrow/src/compute/kernels/arity.rs Outdated Show resolved Hide resolved

let mut data = ArrayData::builder(array.data_type().clone())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another question: similar to unary, do we plan to support different output value type than T?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it. I think for mostly usage in query execution, input/output types are the same for unary. A difficulty to specify output type for unary_dict is, unlike unary where we can bind output type on the produced PrimitiveArray<O> and F: Fn(T::Native) -> T::Native. In unary_dict, the output type O cannot be bound on the produce dictionary array. So the compiler cannot infer it. Then we need to specify O when invoking unary_dict/unary_dyn. I feel it is harder to use. So if we only use it for same input/output types, I just leave it without a separate output type parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, make sense.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@viirya
Copy link
Member Author

viirya commented Jul 6, 2022

Thanks @sunchao

@viirya viirya merged commit 62053a8 into apache:master Jul 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support DictionaryArray in unary kernel
3 participants