Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion for stack size when calling par_sort_unstable #1146

Open
vigna opened this issue Mar 20, 2024 · 5 comments
Open

Suggestion for stack size when calling par_sort_unstable #1146

vigna opened this issue Mar 20, 2024 · 5 comments

Comments

@vigna
Copy link

vigna commented Mar 20, 2024

Is there any standard suggestion for the stack size when calling par_sort_unstable? We're sorting dozens of billions of elements and we would like to provide the user with some sound default.

@vigna vigna changed the title Suggesion for stack size when calling par_sort_unstable Suggestion for stack size when calling par_sort_unstable Mar 20, 2024
@cuviper
Copy link
Member

cuviper commented Mar 20, 2024

I don't there can be a standard suggestion, because it depends on both the number of elements and the number of threads in the pool, at least. In theory it will have logarithmic depth, but work-stealing will lengthen that further.

I wonder if we need to be smarter about MAX_SEQUENTIAL here:

// If both partitions are up to this length, we continue sequentially. This number is as small
// as possible but so that the overhead of Rayon's task scheduling is still negligible.
const MAX_SEQUENTIAL: usize = 2000;

i.e. rather than just a fixed constant, maybe that should also consider num_elem / num_threads and some additional factor to account for imbalance. Then it should use a higher limit when you're starting with billions, and you won't get so much stack use due to work stealing.


Similarly, the stable sort has two constants, CHUNK_SIZE while splitting and MAX_SEQUENTIAL while merging:

// The length of initial chunks. This number is as small as possible but so that the overhead
// of Rayon's task scheduling is still negligible.
const CHUNK_LENGTH: usize = 2000;

// Slices whose lengths sum up to this value are merged sequentially. This number is slightly
// larger than `CHUNK_LENGTH`, and the reason is that merging is faster than merge sorting, so
// merging needs a bit coarser granularity in order to hide the overhead of Rayon's task
// scheduling.
const MAX_SEQUENTIAL: usize = 5000;

@vigna
Copy link
Author

vigna commented Mar 20, 2024

Well, what if I know the number of elements, their size, and the number of threads in the pool? Just ballpark.

@cuviper
Copy link
Member

cuviper commented Mar 20, 2024

I don't know. You'll get a better answer if you run a few experiments yourself.

@norabelrose
Copy link

norabelrose commented Mar 25, 2024

I'm running into the same issue and have found that around 5-10B elements the default of 2MB seems to break, so I'm testing out a heuristic of using max[1, log2(N / 5e9)] * 2 MB right now.
(this is sorting a memory mapped slice of u64 on a machine with 48 physical cores)

@norabelrose
Copy link

Update: I actually ran into a bus error (core dumped) when using the logarithmic scaling. Linear scaling would sort of defeat the purpose of memory mapping the data, so I might try square root scaling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants