Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Optimize" Dictionary contents in DictionaryArray / concat_batches #506

Closed
alamb opened this issue Jun 28, 2021 · 16 comments · Fixed by #3558
Closed

"Optimize" Dictionary contents in DictionaryArray / concat_batches #506

alamb opened this issue Jun 28, 2021 · 16 comments · Fixed by #3558
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Jun 28, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Our project (IOx) uses dictionary arrays heavily to compress arrays of strings (that are mostly the same value). When DictionaryArrays are concatenated together the contents of the dictionaries are also concatenated resulting in duplication and some dictionary entries which are unused.

Similarly, when filtering a DictionaryArray some values may be filtered entirely and thus the dictionary may include values that are repeated.

Describe the solution you'd like

  1. Add an optimize_dictionaries function which would return a potentially new array such that its dictionary had the following properties:
    a) Every value in the dictionary is unique
    b) Every value in the dictionary has at least one use in the array' values

So given an input like

Array: "A, C, A, B, C, C, A"
Dictionary: {A, B, C, D, E, F, A, B, C}
Values: {0, 2, 0, 1, 2, 2, 0}

The output of optimize_dictionaroes would be (note no duplication in dictionary)
Array: "A, C, A, B, C, C, A"
Dictionary: {A, B, C}
Values: {0, 2, 0, 1, 2, 2, 0}

  1. Add a function such as concat_and_optimize_batches that would do the array compaction as part of concatenation (and thus avoid creating an intermediate concat result which was then optimized)

The solution described in #504 is designed to avoid copying dictionaries when they are identical (aka literally point at the same array)

Describe alternatives you've considered
An alternative might be to extend concat_batches directly to always try and compact arrays. However, since this involves a tradeoff of more CPU for potentially less memory usage, I am not sure it is appropriate to always apply.

Additional context

@tustvold has implemented a version of this code in IOx: https://github.com/influxdata/influxdb_iox/pull/1830 which could be suitable for inclusion in arrow

See also #504,

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Jun 28, 2021
@alamb
Copy link
Contributor Author

alamb commented Jun 28, 2021

FYI @jhorstmann I don't know if this is similarly interesting to you but we will likely not be able to ensure our dictionaries are always exactly the same

@alamb alamb added the arrow Changes to the arrow crate label Jun 28, 2021
@jhorstmann
Copy link
Contributor

1. b) Every value in the dictionary has at least one use in the array' values

A nice benefit of this is that a GROUP BY that dictionary column afterwards would be very cheap since it does not need another hashmap and instead could index directly into an array of accumulators with the keys. Not sure if that is the usecase you are after or if this is more of a nice side effect.

Ensuring sorted dictionaries is something I'm definitely interested in, Field already has the dict_is_ordered flag based on which a much faster implementation of sort comparator or comparison kernel could be selected. I was thinking of a different implementation than using a BTreeSet though. I have only a rough sketch, but the idea is to use sort_to_indices on the dictionary values and then somehow build a lookup table as a vector. With the sorted indices it should also be possible to build a lookup table for remapping duplicates.

@jorgecarleitao
Copy link
Member

Great issue description @alamb 🎩

I would do it on a separate kernel, as to not break the principle that concatenating arrays is an O(N) operation where N is the number of elements in all arrays (this is O(N log N)?)

ensure_sort: bool or something like that would be a nice argument for such a function.

In general, we have a small challenge in how we track dictionary metadata, though: our DataType::Dictionary does not hold dictionary metadata, which means that we must store it somewhere else. Yhis makes it more cumbersome, as the function cannot leverage this information to e.g. avoid re-sorting a sorted dictionary array without that other "dictionary metafata".

My feeling is that we should (backward-incompatibly) extend DataType::Dictionary(keys, values, metadata) where metadata is a struct containing the different dictionary metadata available in Field, but I am not 100% convinced about this.

I also though about a more radical approach of removing DataType::Dictionary, since a Dictionary is not formally a DataType, but an array encoding. With that said, it does have a different physical representation, so in this sense it is convenient to write it as a separate DataType that can be matched. The disadvantage is we can't change an array's encoding without changing the logical type associated with it. This contrasts with parquet, where encodings and logical types are independent of each other.

@alamb
Copy link
Contributor Author

alamb commented Jun 28, 2021

A nice benefit of this is that a GROUP BY that dictionary column afterwards would be very cheap since it does not need another hashmap and instead could index directly into an array of accumulators with the keys. Not sure if that is the usecase you are after or if this is more of a nice side effect.

I think "side effect" :)

Ensuring sorted dictionaries is something I'm definitely interested in, Field already has the dict_is_ordered flag based on which a much faster implementation of sort comparator or comparison kernel could be selected.

Yes, we are definitely also interested in ensuring sorted dictionaries (we have an optimized physical representation that requires a sorted dictionary and today it simply resorts the incoming dictionary, unnecessarily)

@alamb
Copy link
Contributor Author

alamb commented Jun 28, 2021

I also though about a more radical approach of removing DataType::Dictionary, since a Dictionary is not formally a DataType, but an array encoding.

@jorgecarleitao it certainly sounds interesting -- as long as there was some reasonable way to interact with the dictionary and its contents this sounds good to me.

@tustvold
Copy link
Contributor

So another option I've been thinking about, is instead of optimizing the RecordBatches after the fact, to optionally generate better RecordBatches in the first place. This is similar to what is proposed in #504 but perhaps more general.

The basic idea would be to extend MutableArrayData with a flag to recompute dictionaries on freeze, I already have a fairly good idea of how to implement this. Variants of the concat, take, etc... kernels could then make use of this. An optimize batch function would effectively be just this variant take kernel.

Additionally DataFusion operators such as SortPreservingMerge that use MutableArrayData directly could also benefit from this.

I'd be happy to implement this if people are happy for me to

@alamb
Copy link
Contributor Author

alamb commented Jun 29, 2021

The basic idea would be to extend MutableArrayData with a flag to recompute dictionaries on freeze

@tustvold This sounds pretty interesting. Would the idea be to add potentially duplicated / unused values in MutableArrayData and then rewrite the data on freeze?

I wonder if there is any way to avoid collecting the duplicated values in the first plane

@tustvold
Copy link
Contributor

@tustvold This sounds pretty interesting. Would the idea be to add potentially duplicated / unused values in MutableArrayData and then rewrite the data on freeze?

The idea I had was be to defer creation of the dictionary ArrayData until freeze, as opposed to in the constructor, but otherwise to keep the extend generation the same. Freeze would then scan through the generated keys in the _MutableArrayData to compute the values to include in the final dictionary, build this new dictionary, and then rewrite the keys. I realise as I write this that _MutableArrayData may not have the necessary API to actually mutate data in place, but I don't see a compelling reason that it couldn't?

@jorgecarleitao
Copy link
Member

Note that I had very limited knowledge about Arrow and Rust when I initially proposed the MutableArrayData; specially the dictionary part, which was simply written to "work for dictionary arrays".

Its primary goal was simply to allow us to easily construct an array out of "slices" of other arrays, which I was observing to be a recursive pattern in filter, concat, sort-merge and joins.

_MutableArrayData is private and does not need to comply with the arrow spec up to "freeze". So, I think that we can use/ experiment with it as we see fit to extract performance :P

@alamb
Copy link
Contributor Author

alamb commented Jun 29, 2021

I was probably overthinking, but I was imagining that we could reduce peak memory usage by doing something smarter than "concat and then cleanup"

However, given concat and then cleanup would be an improvement over master, starting there and then making additional improvements as events / profiling warrants seems like a good idea to me

@tustvold
Copy link
Contributor

tustvold commented Jun 29, 2021

@alamb what I'm proposing isn't "concat then cleanup". You build the keys array with the current dictionary encodings as if they had been concatenated together (but don't actually concatenate them). Then compute the new dictionary and re-encode the keys with this new dictionary. Is there some way to do better than this?

@alamb
Copy link
Contributor Author

alamb commented Jun 29, 2021

You build the keys array with the current dictionary encodings as if they had been concatenated together (but don't actually concatenate them). Then compute the new dictionary and re-encode the keys with this new dictionary

This sounds like a good plan to me

@lquerel
Copy link

lquerel commented Dec 8, 2021

Another issue with the existing implementation is the DictionaryKeyOverflowError error that is returned in situations where it is reasonably not expected. For example like in this scenario.

  • Let's imagine a dictionary column type is: DataType::Dictionary(Box::new(DataType::UInt8), Box::new(DataType::Utf8))
  • The dictionary represents an enumeration with 10 distincts values.
  • As currently the dictionary columns are concatenated without deduplication it becomes very easy to overflow the key type. In my example the concatenation of 26 batches (containing 10 rows, each row containing a different value of the enum) will return a DictionaryKeyOverflowError error.

This issue makes UInt8 dictionary key unusable in a context where concatenation of batches could take place.

@alamb
Copy link
Contributor Author

alamb commented Dec 9, 2021

@lquerel -- I agree that deduplication of dictionary keys is required (or at least doing so when it is needed)

@tustvold
Copy link
Contributor

tustvold commented Oct 6, 2022

I plan to address this as part of #1523

@lquerel
Copy link

lquerel commented Sep 5, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants