Improve speed of writing string dictionaries to parquet by skipping a copy(#1764) #2322

tustvold · 2022-08-04T14:28:09Z

~~Draft as builds on #2136~~

Which issue does this PR close?

Rationale for this change

write_batch primitive/4096 values string dictionary                                                                            
                        time:   [281.80 us 281.91 us 282.03 us]
                        thrpt:  [169.67 MiB/s 169.74 MiB/s 169.81 MiB/s]
                 change:
                        time:   [-11.583% -11.483% -11.395%] (p = 0.00 < 0.05)
                        thrpt:  [+12.861% +12.973% +13.101%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

What changes are included in this PR?

This alters the parquet writer to not hydrate dictionaries when writing. There is still a potential optimisation here to memoize the dictionary keys as they are converted, instead of interning the same dictionary key repeatedly, but I need to have a think about how to expose this from ArrayAccessor.

Are there any user-facing changes?

No

tustvold · 2022-08-05T11:20:06Z

parquet/src/arrow/arrow_writer/mod.rs

 use crate::{data_type::*, file::writer::SerializedFileWriter};
 use levels::{calculate_array_levels, LevelInfo};

 mod byte_array;
 mod levels;

-/// An object-safe API for writing an [`ArrayRef`]
-trait ArrayWriter {


I ended up implementing type erasure within the ByteArrayWriter, and so this indirection can be removed

alamb

Looks good to me -- I am not super familiar with all the stucts in this area of the code, but this looks like a beautiful way to use ArrayIter and connect up the existing pieces.

alamb · 2022-08-05T11:12:59Z

arrow/src/array/array_dictionary.rs

 pub struct TypedDictionaryArray<'a, K: ArrowPrimitiveType, V> {
    /// The dictionary array
    dictionary: &'a DictionaryArray<K>,
    /// The values of the dictionary
    values: &'a V,
 }

+// Manually implement `Clone` to avoid `V: Clone` type constraint


it is strange that having a reference to & V would require V: Clone in order to #[derive(Clone)] 🤷

https://github.com/mzabaluev/rust-rfcs/blob/generic-derive/text/0000-generic-derive.md

That rfc doesn't sound quite right (or at least it is a overkill -- like a 🔨 for swatting a 🪰 ). All that is needed in this case is to recognize that the struct only uses &V rather than V rather than a way to provide generic arguments to macros 😱

Also, I am firmly of the belief that adding more generics is not the answer to most of life's problems 🤣 Maybe because my feeble mind can't handle the extra level of indirection

alamb · 2022-08-05T11:18:32Z

arrow/src/util/data_gen.rs

@@ -143,6 +143,17 @@ pub fn create_random_array(
                })
                .collect::<Result<Vec<(&str, ArrayRef)>>>()?,
        )?),
+        d @ Dictionary(_, value_type)
+            if crate::compute::can_cast_types(value_type, d) =>


using cast is a neat trick here 👍

alamb · 2022-08-05T11:27:32Z

parquet/src/arrow/arrow_writer/byte_array.rs

+    ($array:ident, $key:ident, $val:ident, $op:expr $(, $arg:expr)*) => {{
+        $op($array
+            .as_any()
+            .downcast_ref::<DictionaryArray<arrow::datatypes::$key>>()


When you say "could be made faster" the issue is that this is effectively creating something that will iterate over strings, which means that to encode the column, the arrow array dictionary index is used to find a string, which is then used to find the parquet array index which is then written.

It could potentially be faster if we skipped the string step in the middle and simply computed an arrow dictionary index --> parquet dictionary index mapping up front and applied that mapping during writing

(I think you said this in this PR's description, but I am restating it to confirm I understand what is happening)

ursabot · 2022-08-05T11:42:01Z

Benchmark runs are scheduled for baseline = 6859efa and contender = b8fd432. b8fd432 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Aug 4, 2022

tustvold mentioned this pull request Aug 4, 2022

Add typed dictionary (#2136) #2297

Merged

tustvold force-pushed the dictionary-array-writer branch from efefa6c to c1b8d26 Compare August 5, 2022 10:30

Don't hydrate string dictionaries when writing to parquet (apache#1764)

fe3e537

tustvold force-pushed the dictionary-array-writer branch from c1b8d26 to fe3e537 Compare August 5, 2022 10:31

tustvold marked this pull request as ready for review August 5, 2022 10:31

alamb added the performance label Aug 5, 2022

alamb changed the title ~~Don't hydrate string dictionaries when writing to parquet (#1764)~~ Improve speed of writing string dictionaries to parquet by skipping a copy(#1764) Aug 5, 2022

tustvold commented Aug 5, 2022

View reviewed changes

alamb approved these changes Aug 5, 2022

View reviewed changes

tustvold merged commit b8fd432 into apache:master Aug 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve speed of writing string dictionaries to parquet by skipping a copy(#1764) #2322

Improve speed of writing string dictionaries to parquet by skipping a copy(#1764) #2322

tustvold commented Aug 4, 2022 •

edited by alamb

tustvold Aug 5, 2022

alamb left a comment

alamb Aug 5, 2022

tustvold Aug 5, 2022

alamb Aug 5, 2022

alamb Aug 5, 2022

alamb Aug 5, 2022

ursabot commented Aug 5, 2022

Improve speed of writing string dictionaries to parquet by skipping a copy(#1764) #2322

Improve speed of writing string dictionaries to parquet by skipping a copy(#1764) #2322

Conversation

tustvold commented Aug 4, 2022 • edited by alamb

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Aug 5, 2022

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Aug 5, 2022

Choose a reason for hiding this comment

tustvold Aug 5, 2022

Choose a reason for hiding this comment

alamb Aug 5, 2022

Choose a reason for hiding this comment

alamb Aug 5, 2022

Choose a reason for hiding this comment

alamb Aug 5, 2022

Choose a reason for hiding this comment

ursabot commented Aug 5, 2022

tustvold commented Aug 4, 2022 •

edited by alamb