Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support IN list on Dictionary #3975

Closed
wants to merge 3 commits into from

Conversation

NGA-TRAN
Copy link
Contributor

Which issue does this PR close?

Closes #3936

Rationale for this change

To support the predicate where col IN (values_of_dictionary_col)

@github-actions github-actions bot added core Core datafusion crate physical-expr Physical Expressions labels Oct 26, 2022
@@ -489,6 +496,196 @@ impl InListExpr {
contains_null
))
}

fn evaluate_non_dict(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a refactor to move all available on into this function so it can be reused with Dictionary data

@@ -714,188 +912,31 @@ impl PhysicalExpr for InListExpr {
};

match value_data_type {
DataType::Float32 => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the moved code

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that the implementation as written will have extremely poor memory and CPU performance.

I think a more optimised implementation might do the following:

  • Compute the set of used dictionary keys to a BooleanArray, e.g. using BooleanBufferBuilder
  • And null mask of the Dictionary values with the computed mask
  • Call InList on that new array of values

.unwrap();
let mut dict_vals = Vec::with_capacity(dict_array.len());
for i in 0..dict_array.len() {
let (values_array, values_index) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will perform a downcast for each element, it would be better to access dict_array.keys, and dict_array.values, and then iterate over the keys

.as_any()
.downcast_ref::<DictionaryArray<Int32Type>>()
.unwrap();
let mut dict_vals = Vec::with_capacity(dict_array.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This acts to hydrate the dictionary which is likely extremely inefficient

// Get values from the dictionary that include nulls for none values
let dict_array = array
.as_any()
.downcast_ref::<DictionaryArray<Int32Type>>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will panic if key_type is not DataType::Int32. We should either add this to the match block, or use something like https://docs.rs/arrow/latest/arrow/macro.downcast_dictionary_array.html to handle all cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I forgot to make it general. Working on it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make this general after the null data work. I tried to move this inside a function with trait to make if <DictionaryArray<K>>. It works for the downcast_ref but it does not work with take. I need to write more code for all cases if we use take

// Look up value from Index
let value = match values_index {
Some(values_index) => {
ScalarValue::try_from_array(values_array, values_index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also performs a fairly expensive downcast operation for every dictionary element

@@ -436,6 +436,13 @@ impl InListExpr {
ScalarValue::Utf8(None) => None,
ScalarValue::LargeUtf8(Some(v)) => Some(v.as_str()),
ScalarValue::LargeUtf8(None) => None,
ScalarValue::Dictionary(_, v) => match v.as_ref() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this modification

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It allows getting the underlying value out of a (typed) ScalarValue::Dictionary

@@ -428,9 +428,8 @@ async fn csv_in_set_test() -> Result<()> {
}

#[tokio::test]
#[ignore]
// https://github.com/apache/arrow-datafusion/issues/3936
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// https://github.com/apache/arrow-datafusion/issues/3936

@@ -436,6 +436,13 @@ impl InListExpr {
ScalarValue::Utf8(None) => None,
ScalarValue::LargeUtf8(Some(v)) => Some(v.as_str()),
ScalarValue::LargeUtf8(None) => None,
ScalarValue::Dictionary(_, v) => match v.as_ref() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It allows getting the underlying value out of a (typed) ScalarValue::Dictionary

)
DataType::Dictionary(_key_type, value_type) => {
// Get values from the dictionary that include nulls for none values
let dict_array = array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like you should be able to evaluate the IN list only on the dictionary values rather than continually looking up the same elements over and over again in the dictionary and use 'take' https://docs.rs/arrow/latest/arrow/compute/kernels/take/fn.take.html to form the final array

Something like thus pseudo code maybe

let values = dict_array.values();

// recursively evaluate IN <..> on the value array
let values_result = evaluate_set(values, list_values);

// Then form the final boolean array by calling take on the indices
compute::take(values_result, dict_array.keys())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb and @tustvold
I did what as above but it only works well if there are no nulls in the data. I may miss something or there will be a lot more work to make it work this way.

FYI // 1 is the new implementation per your suggestion. // 2 is the expensive one I temporarily keep to run some tests

)
DataType::Dictionary(_key_type, value_type) => {
// Get values from the dictionary that include nulls for none values
let dict_array = array
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb and @tustvold
I did what as above but it only works well if there are no nulls in the data. I may miss something or there will be a lot more work to make it work this way.

FYI // 1 is the new implementation per your suggestion. // 2 is the expensive one I temporarily keep to run some tests

// Get values from the dictionary that include nulls for none values
let dict_array = array
.as_any()
.downcast_ref::<DictionaryArray<Int32Type>>()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make this general after the null data work. I tried to move this inside a function with trait to make if <DictionaryArray<K>>. It works for the downcast_ref but it does not work with take. I need to write more code for all cases if we use take

}

#[tokio::test]
async fn in_list_string_dictionaries_with_null() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test do not pass with method // 1

.unwrap();
let keys = dict_array.keys();

let values_result = evaluate_set(&array, list_values).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let values_result = evaluate_set(&array, list_values).unwrap();
let values_result = evaluate_set(dict_array.values().as_ref(), list_values).unwrap();

? I'm surprised as written this doesn't result in a stack overflow?

}

// Return a boolean array indicating whether the value is in list_values
fn evaluate_set(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a new function, I think @alamb 's suggestion was to recurse into InListExpr::evaluate (by pulling it into a free function)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was suggesting that since InList was already implemented for non dictionary types, we re-use that implementation (though I think that will require some restructuring of how evaluate is written

.collect::<Vec<_>>();
let list_array = ScalarValue::iter_to_array(scalars).unwrap();

let cmp = build_compare(&array, &list_array).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is known not to handle nulls correctly - apache/arrow-rs#2687

@tustvold tustvold marked this pull request as draft October 27, 2022 06:00
@tustvold
Copy link
Contributor

Chatted with @NGA-TRAN, I'm going to take a stab at this first thing tomorrow (NZ time)

@tustvold
Copy link
Contributor

tustvold commented Nov 1, 2022

Alternative implementation that builds upon #4057 can be found 9973b03.

Going to close this one

@tustvold tustvold closed this Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error using IN list on dictionary encoded data: InList does not support datatype Dictionary(Int32, Utf8).
3 participants