feat: support IN list on Dictionary #3975

NGA-TRAN · 2022-10-26T18:28:13Z

Which issue does this PR close?

Rationale for this change

To support the predicate where col IN (values_of_dictionary_col)

NGA-TRAN · 2022-10-26T18:29:02Z

datafusion/physical-expr/src/expressions/in_list.rs

@@ -489,6 +496,196 @@ impl InListExpr {
            contains_null
        ))
    }
+
+    fn evaluate_non_dict(


This is just a refactor to move all available on into this function so it can be reused with Dictionary data

NGA-TRAN · 2022-10-26T18:29:27Z

datafusion/physical-expr/src/expressions/in_list.rs

@@ -714,188 +912,31 @@ impl PhysicalExpr for InListExpr {
            };

            match value_data_type {
-                DataType::Float32 => {


This is the moved code

tustvold

I worry that the implementation as written will have extremely poor memory and CPU performance.

I think a more optimised implementation might do the following:

Compute the set of used dictionary keys to a BooleanArray, e.g. using BooleanBufferBuilder
And null mask of the Dictionary values with the computed mask
Call InList on that new array of values

tustvold · 2022-10-26T18:43:35Z

datafusion/physical-expr/src/expressions/in_list.rs

+                        .unwrap();
+                    let mut dict_vals = Vec::with_capacity(dict_array.len());
+                    for i in 0..dict_array.len() {
+                        let (values_array, values_index) =


This will perform a downcast for each element, it would be better to access dict_array.keys, and dict_array.values, and then iterate over the keys

tustvold · 2022-10-26T18:44:28Z

datafusion/physical-expr/src/expressions/in_list.rs

+                        .as_any()
+                        .downcast_ref::<DictionaryArray<Int32Type>>()
+                        .unwrap();
+                    let mut dict_vals = Vec::with_capacity(dict_array.len());


This acts to hydrate the dictionary which is likely extremely inefficient

tustvold · 2022-10-26T18:45:59Z

datafusion/physical-expr/src/expressions/in_list.rs

+                    // Get values from the dictionary that include nulls for none values
+                    let dict_array = array
+                        .as_any()
+                        .downcast_ref::<DictionaryArray<Int32Type>>()


This will panic if key_type is not DataType::Int32. We should either add this to the match block, or use something like https://docs.rs/arrow/latest/arrow/macro.downcast_dictionary_array.html to handle all cases

Good catch. I forgot to make it general. Working on it

I will make this general after the null data work. I tried to move this inside a function with trait to make if <DictionaryArray<K>>. It works for the downcast_ref but it does not work with take. I need to write more code for all cases if we use take

tustvold · 2022-10-26T18:47:13Z

datafusion/physical-expr/src/expressions/in_list.rs

+                        // Look up value from Index
+                        let value = match values_index {
+                            Some(values_index) => {
+                                ScalarValue::try_from_array(values_array, values_index)


This also performs a fairly expensive downcast operation for every dictionary element

tustvold · 2022-10-26T18:48:35Z

datafusion/physical-expr/src/expressions/in_list.rs

@@ -436,6 +436,13 @@ impl InListExpr {
                    ScalarValue::Utf8(None) => None,
                    ScalarValue::LargeUtf8(Some(v)) => Some(v.as_str()),
                    ScalarValue::LargeUtf8(None) => None,
+                    ScalarValue::Dictionary(_, v) => match v.as_ref() {


I don't understand this modification

It allows getting the underlying value out of a (typed) ScalarValue::Dictionary

alamb · 2022-10-26T20:28:20Z

datafusion/core/tests/sql/predicates.rs

@@ -428,9 +428,8 @@ async fn csv_in_set_test() -> Result<()> {
 }

 #[tokio::test]
-#[ignore]
 // https://github.com/apache/arrow-datafusion/issues/3936


Suggested change

// https://github.com/apache/arrow-datafusion/issues/3936

alamb · 2022-10-26T20:30:33Z

datafusion/physical-expr/src/expressions/in_list.rs

@@ -436,6 +436,13 @@ impl InListExpr {
                    ScalarValue::Utf8(None) => None,
                    ScalarValue::LargeUtf8(Some(v)) => Some(v.as_str()),
                    ScalarValue::LargeUtf8(None) => None,
+                    ScalarValue::Dictionary(_, v) => match v.as_ref() {


It allows getting the underlying value out of a (typed) ScalarValue::Dictionary

alamb · 2022-10-26T21:01:53Z

datafusion/physical-expr/src/expressions/in_list.rs

-                        )
+                DataType::Dictionary(_key_type, value_type) => {
+                    // Get values from the dictionary that include nulls for none values
+                    let dict_array = array


I feel like you should be able to evaluate the IN list only on the dictionary values rather than continually looking up the same elements over and over again in the dictionary and use 'take' https://docs.rs/arrow/latest/arrow/compute/kernels/take/fn.take.html to form the final array

Something like thus pseudo code maybe

let values = dict_array.values(); // recursively evaluate IN <..> on the value array let values_result = evaluate_set(values, list_values); // Then form the final boolean array by calling take on the indices compute::take(values_result, dict_array.keys())

@alamb and @tustvold
I did what as above but it only works well if there are no nulls in the data. I may miss something or there will be a lot more work to make it work this way.

FYI // 1 is the new implementation per your suggestion. // 2 is the expensive one I temporarily keep to run some tests

…tion

NGA-TRAN · 2022-10-27T05:36:34Z

datafusion/physical-expr/src/expressions/in_list.rs

-                        )
+                DataType::Dictionary(_key_type, value_type) => {
+                    // Get values from the dictionary that include nulls for none values
+                    let dict_array = array


@alamb and @tustvold
I did what as above but it only works well if there are no nulls in the data. I may miss something or there will be a lot more work to make it work this way.

FYI // 1 is the new implementation per your suggestion. // 2 is the expensive one I temporarily keep to run some tests

NGA-TRAN · 2022-10-27T05:40:22Z

datafusion/physical-expr/src/expressions/in_list.rs

+                    // Get values from the dictionary that include nulls for none values
+                    let dict_array = array
+                        .as_any()
+                        .downcast_ref::<DictionaryArray<Int32Type>>()


I will make this general after the null data work. I tried to move this inside a function with trait to make if <DictionaryArray<K>>. It works for the downcast_ref but it does not work with take. I need to write more code for all cases if we use take

NGA-TRAN · 2022-10-27T05:41:24Z

datafusion/core/tests/sql/predicates.rs

+}
+
+#[tokio::test]
+async fn in_list_string_dictionaries_with_null() -> Result<()> {


This test do not pass with method // 1

tustvold · 2022-10-27T05:45:37Z

datafusion/physical-expr/src/expressions/in_list.rs

+                        .unwrap();
+                    let keys = dict_array.keys();
+
+                    let values_result = evaluate_set(&array, list_values).unwrap();


Suggested change

let values_result = evaluate_set(&array, list_values).unwrap();

let values_result = evaluate_set(dict_array.values().as_ref(), list_values).unwrap();

? I'm surprised as written this doesn't result in a stack overflow?

tustvold · 2022-10-27T05:49:19Z

datafusion/physical-expr/src/expressions/in_list.rs

+}
+
+// Return a boolean array indicating whether the value is in list_values
+fn evaluate_set(


Why is this a new function, I think @alamb 's suggestion was to recurse into InListExpr::evaluate (by pulling it into a free function)

Yeah, I was suggesting that since InList was already implemented for non dictionary types, we re-use that implementation (though I think that will require some restructuring of how evaluate is written

tustvold · 2022-10-27T05:51:31Z

datafusion/physical-expr/src/expressions/in_list.rs

+        .collect::<Vec<_>>();
+    let list_array = ScalarValue::iter_to_array(scalars).unwrap();
+
+    let cmp = build_compare(&array, &list_array).unwrap();


This is known not to handle nulls correctly - apache/arrow-rs#2687

tustvold · 2022-10-27T06:01:34Z

Chatted with @NGA-TRAN, I'm going to take a stab at this first thing tomorrow (NZ time)

tustvold · 2022-11-01T23:46:09Z

Alternative implementation that builds upon #4057 can be found 9973b03.

Going to close this one

feat: support IN list on Dictionary

d032076

github-actions bot added core Core datafusion crate physical-expr Physical Expressions labels Oct 26, 2022

chore: remove empty line

4d517cb

NGA-TRAN commented Oct 26, 2022

View reviewed changes

tustvold reviewed Oct 26, 2022

View reviewed changes

alamb reviewed Oct 26, 2022

View reviewed changes

refactor: use dictionary value to match In list per reviewer's sugges…

535a10b

…tion

NGA-TRAN commented Oct 27, 2022

View reviewed changes

tustvold reviewed Oct 27, 2022

View reviewed changes

tustvold marked this pull request as draft October 27, 2022 06:00

tustvold closed this Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support IN list on Dictionary #3975

feat: support IN list on Dictionary #3975

NGA-TRAN commented Oct 26, 2022

NGA-TRAN Oct 26, 2022

NGA-TRAN Oct 26, 2022

tustvold left a comment •

edited

tustvold Oct 26, 2022

tustvold Oct 26, 2022

tustvold Oct 26, 2022

NGA-TRAN Oct 26, 2022

NGA-TRAN Oct 27, 2022

tustvold Oct 26, 2022

tustvold Oct 26, 2022

alamb Oct 26, 2022

alamb Oct 26, 2022

alamb Oct 26, 2022

alamb Oct 26, 2022

NGA-TRAN Oct 27, 2022

NGA-TRAN Oct 27, 2022

NGA-TRAN Oct 27, 2022

NGA-TRAN Oct 27, 2022

tustvold Oct 27, 2022

tustvold Oct 27, 2022

alamb Oct 27, 2022

tustvold Oct 27, 2022

tustvold commented Oct 27, 2022

tustvold commented Nov 1, 2022

	let values_result = evaluate_set(&array, list_values).unwrap();
	let values_result = evaluate_set(dict_array.values().as_ref(), list_values).unwrap();

feat: support IN list on Dictionary #3975

feat: support IN list on Dictionary #3975

Conversation

NGA-TRAN commented Oct 26, 2022

Which issue does this PR close?

Rationale for this change

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Oct 27, 2022

tustvold commented Nov 1, 2022

tustvold left a comment •

edited