Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WEBSITE] Blog posts on multi-column sorting implementation #264

Merged
merged 19 commits into from Nov 7, 2022

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 30, 2022

This Blog post describes the row format introduced in apache/arrow-rs#2593

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@alamb alamb requested a review from tustvold October 30, 2022 12:32
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done a first pass, looking good

│ "Bar" │ ───────────────▶│ 01 │
└──────────┘ └─────┘
┌──────────┐ ┌─────┬─────┐
│"Fabulous"│ ───────────────▶│ 01 │ 02 │
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "Bar" is 01 and "Fabulous" is 01 02, how do you distinguish between both when you encounter a 01 byte?

Copy link
Contributor Author

@alamb alamb Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the rows are variable length, each also has a length.

Thus, in this case since the lengths are different (and the length is stored along with the row) "Bar" ([01]) is shorter and thus sorts before "Fabulous" [01 , 02])

Perhaps @tustvold can confirm

We should probably make it clearer in the text that the row format includes a length as well

Edit: I was incorrect

Copy link
Contributor

@tustvold tustvold Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't store lengths along with the rows, in the case of dictionary keys, they are stored null terminated. This is how we are able to distinguish


One detail we have so far ignored over is how to support ascending and descending sorts (e.g. `ASC` or `DESC` in SQL). The Arrow Rust row format supports these options by simply inverting the bytes of the encoded representation, except the initial byte used for nullability encoding, on a per column basis.

Similarly, supporting SQL compatible sorting also requires a format that can specify the order of `NULL`s (before or after all non `NULL` values). The row format supports this option by optionally encoding nulls as `0xFF` instead of `0x00` on a per column basis.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you have to escape 00 and FF bytes in the input to make sure they aren't confused with NULLs, right?
Also, do you try to handle floating-point NaNs in a specific way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @tustvold can weigh in here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you have to escape 00 and FF bytes in the input to make sure they aren't confused with NULLs, right?

The encoding is designed in such a way that this isn't necessary, at no point is it ambiguous as to whether a byte is part of a sentinel (e.g. null) or value data

do you try to handle floating-point NaNs in a specific way?

Nans are ordered according to the IEEE 754 (2008 revision) total order predicate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a sentence explaining this design in 7f89c31

alamb and others added 2 commits November 2, 2022 14:55
Co-authored-by: Paddy Horan <5733408+paddyhoran@users.noreply.github.com>
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
alamb and others added 3 commits November 4, 2022 11:36
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think this blog is basically ready to go from my perspectives. I'll aim for a Monday Nov 7 publish unless there are other comments people would like to provide

alamb and others added 2 commits November 5, 2022 05:48
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
@alamb alamb merged commit 4920b06 into apache:master Nov 7, 2022
@alamb alamb deleted the alamb/multi-column-sorts-part-1 branch November 7, 2022 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants