Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add k_smallest_relaxed and variants #925

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

adamreichold
Copy link

@adamreichold adamreichold commented Apr 27, 2024

This implements the algorithm described in which consumes twice the amount of memory as the existing k_smallest algorithm but achieves linear time in the number of elements in the input.

I expect this to fail the MSRV job as select_nth_unstable was stabilized in 1.49 and is therefore not available in 1.43.1. I decided to propose it anyway for the time when an MSRV increase is planned or if deemed sufficiently useful as reason for one.

Copy link

codecov bot commented Apr 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.52%. Comparing base (6814180) to head (d9453fa).
Report is 63 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #925      +/-   ##
==========================================
+ Coverage   94.38%   94.52%   +0.13%     
==========================================
  Files          48       48              
  Lines        6665     7045     +380     
==========================================
+ Hits         6291     6659     +368     
- Misses        374      386      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@Philippe-Cholet Philippe-Cholet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not check the docs yet. I would if we decide to eventually add this.
As you mentionned, it breaks the MSRV a little so it won't be added right now but the MSRV should become 1.51 soon enough.
Good first iteration. Thanks for adding tests.

I appreciate the idea here and it sure makes sense to use select_nth_unstable_by which I was not familiar with.
Complexity sure seems interesting. I however wonder what is the performance difference with the un-relaxed versions in practice. A compared benchmark might be valuable, we don't have one yet so I think this might help if you want to add one.

I note that we would have 12 methods... k_(smallest|largest)[_relaxed][_by[_key]] which is a lot!

src/k_smallest.rs Show resolved Hide resolved
src/k_smallest.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
@adamreichold
Copy link
Author

adamreichold commented Apr 27, 2024

I note that we would have 12 methods... k_(smallest|largest)[_relaxed][_by[_key]] which is a lot!

I agree and if pressed to make a choice, I would argue that the relaxed would be a reasonable general choice, i.e. could completely replace the heap-based versions.

For one, k usually is bounded by small fixed numbers and I do not really see a lot of situations where I have the memory to compute the top 100 but not the enough to invest into the top 200. And since the implementation can directly work with select_nth_unstable_by, the complications of having a custom heap implementation could be avoided.

However, making this choice does not appear obviously better than having twelve methods though.

Alternatively, the combinatorics could be reigned in by adding a "strategy-like" type parameter, e.g. Strict and Relaxed to indicate the desired memory bound.

Finally, I am not sure the k_largest* methods really carry their weight now that the k_smallest_by and k_smallest_by_key methods exist as in applications of these methods that I am aware of, adapting the way items are sorted is required in any case and hence choosing ascending or descending order does not really increase call site complexity.

@Philippe-Cholet
Copy link
Member

Philippe-Cholet commented Apr 27, 2024

@phimuemue @jswrenn I don't know if you would agree on the current PR but have 12 similar methods seems a bit much to me. k_smallest variants are in itertools 0.13.0 which is not released yet, so we could still change some things without breaking and avoid the 12 methods. We could have

iter.top_k(k).smallest() // redirected from the current k_smallest (that we could deprecate)
iter.top_k(k).smallest_by(f)
iter.top_k(k).smallest_by_key(key)
iter.top_k(k).largest()
iter.top_k(k).largest_by(f)
iter.top_k(k).largest_by_key(key)
And once MSRV >= 1.49, this PR would introduce:
iter.top_k(k).relaxed().smallest()
iter.top_k(k).relaxed().smallest_by(f)
iter.top_k(k).relaxed().smallest_by_key(key)
iter.top_k(k).relaxed().largest()
iter.top_k(k).relaxed().largest_by(f)
iter.top_k(k).relaxed().largest_by_key(key)
Implementation draft (click to see)
trait Itertools: Iterator {
    fn top_k(self, k: usize) -> Top<Self> { ... }
    #[deprecate...]
    fn k_smallest(self, k: usize) -> ... { self.top_k().smallest() }
}

pub struct GeneralTop<I, A> {
    iter: I,
    k: usize,
    marker: PhantomData<A>,
}
pub type Top<I> = GeneralTop<I, Unrelaxed>;
pub type TopRelaxed<I> = GeneralTop<I, Relaxed>;

impl<I: Iterator> Top<I> {
    pub fn relaxed(self) -> TopRelaxed<I> { ... }
}
impl<I: Iterator, A> GeneralTop<I, A> {
    pub fn smallest(self) -> ...
    pub fn smallest_by<F: FnMut...>(self, f: F) -> ...
    pub fn smallest_by_key<F: FnMut...>(self, key: F) -> ...
    pub fn largest(self) -> ...
    pub fn largest_by<F: FnMut...>(self, f: F) -> ...
    pub fn largest_by_key<F: FnMut...>(self, key: F) -> ...
}

The relaxed version would be added once the MSRV is big enough.

"Minimal allocation" vs "linear complexity" are both valuable tradeoffs here, but maybe it would be nice to not multiply methods.

EDIT: There is also alternatives @adamreichold described above.
We could also temporarily withdraw k_smallest variants from the next release and add them back with relaxed versions. It might be better if we want the relaxed versions to be the default ones and heap versions the alternative.

iter.top_k(k) // relaxed
iter.top_k(k).min_alloc() // the currently pushed versions
// both would have the 6 methods  (smallest|largest)[_by[_key]]

@adamreichold
Copy link
Author

Looking at the results of the benchmark pushed as part of the second commit, I would summarize them as no difference for the ascending/best case

grafik

a small but consistent lead in the random/average case

grafik

and a large improvement for the descending/worst case

grafik

@phimuemue
Copy link
Member

phimuemue commented Apr 27, 2024

Hi there, nice algorithm. Unsure about what to do best. And the algorithm begs the question of the scaling factor: Could a capacity of 3k or 4k even be more performant?

As for the 12 methods. I think that - if we decide to keep all of them - we should just offer 12 methods right on Itertools. I still feel the same for our grouping_map thing: It adds an indirection, and - from what I've seen so far - does not help in composability or reusability (see our trials to extend it to custom hashers). Sure, it occupies lots of space in docs.rs, but I do not see a real problem with that.

Sidenote: What's the unstable/stable behavior of the current implementation?

@adamreichold
Copy link
Author

Sidenote: What's the unstable/stable behavior of the current implementation?

It is unstable, at least because it calls into the unstable selection/sorting routines from std.

@adamreichold
Copy link
Author

And the algorithm begs the question of the scaling factor: Could a capacity of 3k or 4k even be more performant?

The original issue and I think I agree with the assessment there: Further increasing the capacity is not too interesting as it will only change the constant factor attached to the O(n) part of the runtime, i.e. I don't think it is worth it complicating the API by giving this choice to the calling code.

@Philippe-Cholet
Copy link
Member

Ok about having 12 methods. The scaling factor 2 is fine.

TODO:

  • Close this OR add this as an alternative OR replace our heap-based versions with it.
    Personally I would choose to keep this as an alternative as both tradeoffs are valuable enough?
  • A bit unsure about the name: change "relaxed"?
    OR should we make the relaxed versions the default and the heap-based ones the ones with a longer name instead? It might be a better choice perf-wise but do so would be a breaking change IMHO.
  • Review docs.
  • Wait a bit for MSRV>=1.49 to merge this.

@adamreichold
Copy link
Author

A bit unsure about the name: change "relaxed"?

Note that I am very much not invested in that name and made it up on the spot when I had to name the methods somehow to avoid conflicting with the existing ones. I think the only requirement for the name is that they should reflect the space-time trade-off somehow.

Copy link
Member

@Philippe-Cholet Philippe-Cholet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about changing "relaxed" but the only alternative I found is "linear" which I kinda find worse (compare docs is necessary to understand anyway) so I'm fine with "relaxed".
I finally reviewed the docs (one small change to do).
Then I would be fine to merge this once our MSRV is big enough (probably later this year).

src/lib.rs Outdated Show resolved Hide resolved
This implements the algorithm described in [1] which consumes twice the amount
of memory as the existing `k_smallest` algorithm but achieves linear time in the
number of elements in the input.

[1] https://quickwit.io/blog/top-k-complexity
Copy link
Member

@Philippe-Cholet Philippe-Cholet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
Now we wait.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants