Comparable Row Format #2593

tustvold · 2022-08-26T18:00:54Z

Which issue does this PR close?

Part of #2677

Rationale for this change

I'm trying to improve the performance of the SortPreservingMerge operator in DataFusion, and one of the major bottlenecks is performing multi-column comparisons across RecordBatch. I started work on a JITed comparator, but not only was this rather complicated and hard to validate, but the number of unpredictable branches in the generated code made me unsure that the performance would justify the complexity.

Taking a step-back with some bit-twiddling it is possible to construct a row format that represents the values in a row in manner that can be compared with memcmp. This will likely yield the fastest comparator possible, at the cost of the time and memory overheads of assembling the row format.

I need to run some benchmarks, but I'm fairly happy that the conversion code as implemented in this PR manages to avoid bump-allocating and large amounts of dynamic dispatch, and so I'm optimistic the performance will be reasonable

It is possible this might improve the performance of multi-column sorts in addition.

What changes are included in this PR?

Are there any user-facing changes?

FYI @yjshen @alamb @crepererum

alamb · 2022-08-29T11:40:28Z

Possibly related: http://wwwlgis.informatik.uni-kl.de/archiv/wwwdvs.informatik.uni-kl.de/courses/DBSREAL/SS2005/Vorlesungsunterlagen/Implementing_Sorting.pdf

In the discussion of "Normalized Keys" section

alamb

looking very cool

arrow/src/row/mod.rs

alamb · 2022-08-29T19:22:05Z

arrow/src/row/mod.rs

+///
+/// A null is encoded as a big endian encoded `0_u32`
+///
+/// A valid value is encoded as a big endian length, with the most significant bit set, followed


Likewise there are probably some corner cases here related to NULLS FIRST / NULLS LAST

alamb · 2022-08-29T19:25:02Z

arrow/src/row/mod.rs

+
+impl RowBatch {
+    /// Create a [`RowBatch`] from the provided [`RecordBatch`]
+    pub fn new(batch: &RecordBatch) -> Self {


I recommend making this API look like

https://docs.rs/arrow/21.0.0/arrow/compute/kernels/sort/fn.lexsort_to_indices.html

Namely, rather than take a RecordBatch as input, it takes columns: &[[SortColumn] to be consistent with the sorting kernels

Yeah, I was being lazy, supporting SortOptions is on my radar 👍

alamb · 2022-08-29T19:25:52Z

arrow/src/row/mod.rs

+        let buffer = vec![0_u8; cur_offset];
+
+        let mut out = Self {
+            buffer: buffer.into(),


I am just curious why make this a Box<[]> rather than a Vec, given they are Vec to start with

Makes it a bit clearer that it isn't being resized, maybe??

alamb · 2022-08-29T19:26:32Z

arrow/src/row/mod.rs

+            assert_eq!(*out.offsets.last().unwrap(), out.buffer.len());
+            out.offsets
+                .windows(2)
+                .for_each(|w| assert!(w[0] <= w[1], "offsets should be monotonic"));


Suggested change

.for_each(|w| assert!(w[0] <= w[1], "offsets should be monotonic"));

.for_each(|w| assert!(w[0] < w[1], "offsets should be monotonic"));

They should be strictly increasing, right?

alamb · 2022-08-29T19:31:14Z

I am not sure if it is possible, but figuring how to share the same RowFormat machinery in DataFusion -- https://github.com/apache/arrow-datafusion/blob/master/datafusion/row/src/lib.rs would be awesome.

Maybe the DataFusion one could wrap this one, or maybe this one could offer some generic framework that we could refactor DataFusion to use. I feel like the APIs are quite similar and with a bit more effort we could avoid having two copies of almost the same code 🤔

tustvold · 2022-08-29T20:13:28Z

I am not sure if it is possible, but figuring how to share the same RowFormat machinery in DataFusion

My plan is to get this working well for SortPreservingMerge and then see about whether we can unify DataFusion onto a singular row format. I see no obvious reason this wouldn't be possible

yjshen

Looks very cool.

My feelings for the comparable row format go to the third variant RawComparable of RowLayout enum https://github.com/apache/arrow-datafusion/blob/master/datafusion/row/src/layout.rs#L74.

Each variant serves its circumstances with a different usage pattern: RawComparable requires extra bytes flip or float encode/decode, WordAligned needs fast in-place update, and Compact needs to be memory-efficient to buffer a large number of records in-memory for pipeline breakers.

DuckDB sort https://duckdb.org/2021/08/27/external-sorting.html with its implementation is a nice read. https://github.com/duckdb/duckdb/blob/master/src/common/sort/radix_sort.cpp#L274

yjshen · 2022-08-30T02:50:49Z

arrow/src/row/mod.rs

+/// A null is encoded as a big endian encoded `0_u32`
+///
+/// A valid value is encoded as a big endian length, with the most significant bit set, followed
+/// by the byte values


The current encoding seems flawed on var-length attributes comparison since it compares length bytes before content. Therefore a longer string is always bigger than a shorter string. So the current pass test for test_variable_width .

assert!(&"foo" < &"he"); assert!(rows.row(3) > rows.row(1));

Darn, guess I need to do null termination instead

🤦 It sounded so good when we've discussed this. This however also means that you need to escape NULLs within the string, otherwise your termination marker can be confused with actual string content.

Yup - you will need some framing protocol, of which COBS is one possibility

tustvold · 2022-09-05T20:38:36Z

I think this is now ready for review, I intend to experiment with integration in DataFusion's SortPreservingMerge in the coming days and report back

alamb · 2022-09-07T11:45:09Z

I plan to review this later today

xudong963

sort in arrow-rs kernel may leverage the row format to improve performance?

tustvold · 2022-09-08T10:23:31Z

sort in arrow-rs kernel may leverage the row format to improve performance?

Yes, switching lexsort to performing a radix sort on this format will likely yield significant returns.

tustvold · 2022-09-08T11:14:21Z

There is currently a bug concerning nulls in primitive dictionaries, working on a fix

alamb · 2022-09-08T14:06:02Z

Yes, switching lexsort to performing a radix sort on this format will likely yield significant returns.

For what it is worth, I think there will likely be cases where the cost of creating the row format (copying bytes around) -- both CPU and Memory -- will outweigh the sorting performance improvement, so we will have to think about when to use one or the other carefully.

For example, I suspect sorting a single UInt64 array will likely work faster without copying the data into a row format (that is the same)

tustvold · 2022-09-08T14:07:48Z

I suspect sorting a single UInt64 array will likely work faster without copying the data into a row format (that is the same)

Agreed, we already have a special case for lexsort which falls back on the single column sort kernel. I do think that the performance of the row format will almost always outweigh the costs of the DynComparator approach, which is what is used for multi-column predicates, but we should measure this

tustvold · 2022-09-08T14:10:18Z

Urgh... Fixing the null handling for dictionaries, is now causing other false-positives because LexicographicalComparator is wrong 😭

alamb

This code looks very cool. I have some questions about how the variable length and dictionary encoding work but I think this PR is quite close to mergable.

😓 I also haven't completed reviewing the tests in row/mod.rs as I found the comparison code hard to read, as written. I left a comment about how that might be easier to verify

Another random thought is how we might make a row format feature flag (or maybe this would be a good initial set of code to break out into its own crate (e.g. arrow-row or something)?

alamb · 2022-09-08T18:14:21Z

arrow/src/row/mod.rs

+/// ## Reconstruction
+///
+/// Given a schema it would theoretically be possible to reconstruct the columnar data from
+/// the row format, however, this is currently not supported. It is recommended that the row


Suggested change

/// the row format, however, this is currently not supported. It is recommended that the row

/// the row format, however, this is currently not implemented. It is recommended that the row

alamb · 2022-09-08T18:15:18Z

arrow/src/row/mod.rs

+        }
+    }
+
+    /// Convert `cols` into [`Rows`]


Suggested change

/// Convert `cols` into [`Rows`]

/// Convert [`ArrayRef`]s into [`Rows`]

arrow/src/row/mod.rs

alamb · 2022-09-08T19:25:18Z

arrow/src/row/interner.rs

+
+/// A byte array interner that generates normalized keys that are sorted with respect
+/// to the interned values, e.g. `inter(a) < intern(b) => a < b`
+#[derive(Debug, Default)]


It would help me review this code if the algorithm could be described. Specifically the relation ship between keys values and Bucket

I especially find it confusing that keys and values are both interned

alamb · 2022-09-08T19:26:56Z

arrow/src/row/interner.rs

+                )
+            }
+        }
+    }


I recommend also testing interning the values in a single call as part of this test.

alamb · 2022-09-08T19:27:36Z

arrow/src/row/interner.rs

+            Ok(_) => unreachable!("value already exists"),
+            Err(idx) => {
+                let slot = &mut self.slots[idx];
+                // Must always spill final slot so can always create a value less than


I don't understand this comment

alamb · 2022-09-08T19:30:27Z

arrow/src/row/mod.rs

+            ]
+        );
+
+        assert!(rows.row(3) < rows.row(6));


This test would be easier for me to verify if it sorted the rows based on values and then validated the new data order

As in something like

let sorted indexes = rows.lex_sort_indicies(); assert_eq!(sorted_indexes, vec![0,3,4,1,2]);

assert_eq

alamb · 2022-09-08T19:31:28Z

arrow/src/row/mod.rs

+            .collect()
+    }
+
+    fn generate_dictionary<K>(


Can we move these data generator functions into test_util perhaps?

The "issue" is that the test_util variants are fixed seeds. I would rather keep it simple for now and potentially revisit if we can unify these as a follow up PR

yjshen · 2022-09-09T02:34:21Z

Great to see this happening!

I suggest we move the majority of the code in this PR to the DataFusion repo and only keep the API changes on the arrow sort compute kernel (the visibility changes) in arrow-rs. My suggestion mainly comes from two folds: we could ease the development by iterating on a single repo in DataFusion instead of counting on a separate arrow-rs release, and we could minimize confusion by having two row modules in two repos.

After checking the usage of this comparable row format in apache/datafusion#3386, I think it's still valid for us to have three variants of the row format to serve different purposes. One for storing efficiency, one for updating efficiency, and one for sort efficiency. For example, if we use this comparable format for aggregation buffer, we would need to repeatedly flip bytes back and force for each cell update.

tustvold · 2022-09-09T02:47:04Z

I suggest we move the majority of the code in this PR to the DataFusion repo

I covered this in the issue, but I wish for this code to live here so that it can be used within the arrow kernels, in particular lexsort, and benefit use-cases outside DataFusion.

still valid to have three variants

I see no issue with DataFusion retaining its own row formats for these purposes, I'm personally unsure that these bit flips will add up to a significant overhead, but that's a somewhat ancillary discussion to be had down the line. I want to provide a better story for multi-column operations within arrow-rs, what aspects of that DataFusion chooses to use is tbd 😅

alamb

The comments explaining how Bucket works really help me. I think there are some outstanding small items from my perspective, but I think they could be done as follow on PRs (or not at all).

Very nice work @tustvold

alamb · 2022-09-10T10:06:36Z

arrow/src/row/variable.rs

+                let to_write = &mut out.buffer[*offset..end_offset];
+
+                // Set validity
+                to_write[0] = 2;


It is still 2 -- I read the comments as saying the validity byte is a 1. Sorry for my ignorance but I don't understand this particular constant -- can you elaborate?

arrow/src/row/variable.rs

arrow/src/row/mod.rs

alamb · 2022-09-10T10:13:30Z

arrow/src/row/interner.rs

+/// the process for that bucket.
+///
+/// The final key will consists of the slot indexes visited incremented by 1,
+/// with the final value incremented by 2, followed by a null terminator.


Ah, the null terminator idea is the key thing I was missing in my initial reading. This is a very clever idea 👍

alamb · 2022-09-10T10:14:55Z

arrow/src/row/interner.rs

+    fn test_interner() {
+        test_intern_values(&[8, 6, 5, 7]);
+
+        let mut values: Vec<_> = (0_u64..2000).collect();


I wonder if there is any value (as a separate integration test perhaps) for inserting enough values to result in at least two bucket levels (so a bucket has an entry that is a child that has an entry that has a child). Or maybe 2000 is enough to do this already

tustvold · 2022-09-10T20:27:31Z

Integration failure is unrelated, so going to get this one in. Thank you all for the reviews, if I've missed something please shout and I'll address it in a follow on PR 😄

ursabot · 2022-09-10T20:31:41Z

Benchmark runs are scheduled for baseline = 41e0187 and contender = a1d24e4. a1d24e4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the arrow Changes to the arrow crate label Aug 26, 2022

alamb reviewed Aug 29, 2022

View reviewed changes

yjshen reviewed Aug 30, 2022

View reviewed changes

tustvold force-pushed the row-format branch 4 times, most recently from 1cca91e to a0acad3 Compare September 5, 2022 19:55

tustvold changed the title ~~POC: Comparable Row Format~~ Comparable Row Format Sep 5, 2022

tustvold force-pushed the row-format branch from a0acad3 to 7e3271d Compare September 5, 2022 19:56

Add row format

6194776

tustvold force-pushed the row-format branch from 7e3271d to 6194776 Compare September 5, 2022 20:05

Skip miri on heavier tests

9f0aaa2

tustvold marked this pull request as ready for review September 5, 2022 20:37

tustvold mentioned this pull request Sep 7, 2022

Add dictionary cases to merge bench apache/datafusion#3384

Merged

tustvold mentioned this pull request Sep 7, 2022

Use arrow row format in SortPreservingMerge ~50-70% faster apache/datafusion#3386

Merged

xudong963 reviewed Sep 8, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/master' into row-format

8c8a45b

Handle nulls in dictionary values

481d7a7

tustvold added 2 commits September 8, 2022 15:16

Don't fuzz test dictionaries with null values

e20143b

Merge remote-tracking branch 'upstream/master' into row-format

901f04e

alamb reviewed Sep 8, 2022

View reviewed changes

Add docs

4b7c3c0

tustvold force-pushed the row-format branch from 6ed1521 to 4b7c3c0 Compare September 9, 2022 14:59

tustvold added 2 commits September 9, 2022 16:06

Add error plumbing

400247a

Merge remote-tracking branch 'upstream/master' into row-format

0b05785

alamb approved these changes Sep 10, 2022

View reviewed changes

tustvold added 3 commits September 10, 2022 18:22

Review feedback

5bba288

Merge remote-tracking branch 'upstream/master' into row-format

92e1d1d

Fix docs

2b0552f

tustvold merged commit a1d24e4 into apache:master Sep 10, 2022

tustvold mentioned this pull request Oct 28, 2022

Add a compute comparator interface that doesn't bind array instances #563

Closed

alamb mentioned this pull request Oct 30, 2022

[WEBSITE] Blog posts on multi-column sorting implementation apache/arrow-site#264

Merged

	.for_each(\|w\| assert!(w[0] <= w[1], "offsets should be monotonic"));
	.for_each(\|w\| assert!(w[0] < w[1], "offsets should be monotonic"));

	/// the row format, however, this is currently not supported. It is recommended that the row
	/// the row format, however, this is currently not implemented. It is recommended that the row

	/// Convert `cols` into [`Rows`]
	/// Convert [`ArrayRef`]s into [`Rows`]

Comparable Row Format #2593

Comparable Row Format #2593

Conversation

tustvold commented Aug 26, 2022 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb commented Aug 29, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Aug 29, 2022

tustvold commented Aug 29, 2022

yjshen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Sep 5, 2022

alamb commented Sep 7, 2022

xudong963 left a comment

Choose a reason for hiding this comment

tustvold commented Sep 8, 2022

tustvold commented Sep 8, 2022

alamb commented Sep 8, 2022

tustvold commented Sep 8, 2022 • edited

tustvold commented Sep 8, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yjshen commented Sep 9, 2022

tustvold commented Sep 9, 2022 • edited

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Sep 10, 2022

ursabot commented Sep 10, 2022

tustvold commented Aug 26, 2022 •

edited

tustvold commented Sep 8, 2022 •

edited

tustvold commented Sep 9, 2022 •

edited