Add into_grouping_map for efficient group-and-fold operations #465

SkiFire13 · 2020-07-22T13:13:31Z

Adds two functions on the Itertools trait, into_grouping_map and into_grouping_map_by.

into_grouping_map expects an iter of (K, V) where the K will be used as key and V as value.
into_grouping_map_by expect only an iter of V as values, the keys will be calculated using the provided functions.

Both of them return a GroupingMap, which is just a wrapper on an iterator. Since it introduces a lot of related methods I thought it would be better to separate them from the Itertools trait. This also prevents duplicating every method for the _by version.

All of these functions have in common the fact they perform efficient group-and-fold operations without allocating temporary vecs like you would normally do if you used into_group_map + into_iter + map + collect::<HashMap<_, _>>().

Here's the possible issues I can see, I would like to hear some feedback before trying to fix any of them:

name: I initially though of grouping_by which is the java/kotlin equivalent, but later changed it to into_grouping_map to match the already existing into_group_map and to differentiate from group_by;
fold_first: the equivalent function in the Itertools trait is fold1 but there's an unstable stdlib function that does the same thing and is called fold_first. I decided to be consistent with the stdlib;
minmax return type is the already existing MinMaxResult, but the NoElements variant is never returned. I didn't want to duplicate that struct thinking it could cause confusion (and if I did what name could I have chosen?);
sum and product: They dont' use the Sum and Product traits but instead require V: Add<V, Output=V> and V: Mul<V, Output=V>. They're pretty much a wrapper around fold_first. I don't really know if they're meaningful or if I should just remove them;
almost every closure takes as one of the parameters a reference to the key. It bloats them a bit, but could be useful if computing the key is a relatively expensive operation.
no scan function. Even though it is an iterator adapter I could sort of "collect" it into a Vec (more like extend) but I don't really see an use for this;
no integration tests for the _by and _by_key versions of min, max and minmax. To be fair I was a bit lazy, but I also couldn't find any integration test for the normal minmax_by and minmax_by_key so I though it was fine; added;
no benchmark (I don't think it is required but probably would be cool to compare this with into_group_map;
I didn't want to touch the rest of the library, but I guess we could implement into_group_map in terms of into_grouping_map?

Related issues: #457, #309
Related PR: #406 (in particular relates to the into_group_map_by_fold function that was proposed but later removed)

…g_map

src/grouping_map.rs

phimuemue · 2020-07-22T18:19:52Z

src/grouping_map.rs

+            let acc = destination_map.remove(&key);
+            if let Some(op_res) = operation(acc, &key, val) {
+                destination_map.insert(key, op_res);
+            }


Do we really need to remove and insert back into destination_map? Couldn't we facilitate the entry-API?

Using FnMut(Option<R>, ...) forces the caller to deal with Option, i.e. the caller has to deal with Some/None, essentially checking the same thing that is already checked in aggregate in terms of existence in destination_map. It may be ok, but we could avoid these checks in the callers by having two functions (one for the None-case, one for the Some-case).

Do we really need to remove and insert back into destination_map? Couldn't we facilitate the entry-API?

remove and insert are needed if we want to pass an owned accumulator to the closure, like the normal Iterator::fold. The alternative is passing a mutable reference but this is more restricting for the end user.
Maybe I should add a second aggregate function that provides mutable access instead? This would allow a more efficient implementation in some cases, like in GroupingMap::collect (currently it duplicates aggregate's code).

Using FnMut(Option<R>, ...) forces the caller to deal with Option, i.e. the caller has to deal with Some/None, essentially checking the same thing that is already checked in aggregate in terms of existence in destination_map. It may be ok, but we could avoid these checks in the callers by having two functions (one for the None-case, one for the Some-case).

Wouldn't this just move the check from the closure to the aggregate function?
Performance wise it should remain the same.
But what if the caller wants to mutate an external variable in both the case the map contains the value or not? It can't do that with 2 closures.

I was thinking about this, in particular since issue #467. I made a couple of basic benchmarks and the diffeference in speed is not negligible. For example on a simple collect of 1_000_000 i32 grouped by their reminder by 10 or 10_000, the entry version performed 2x better! But at what cost?

I had to add another closure for the init.

The init closure can't take a reference to the key as parameter because Entry::or_insert_with_key is unstable. Not sure but there may be a way to do it anyway.

I had to remove the key parameter from the other closure as well

I had to pass a mutable reference to the accumulator instead of an owned value because the Entry api gives me only that.

The problem is that the current approach move a lot the accumulator and makes two lookup in the map, one for remove and one for insert. Moving the accumulator is required if we don't want to restrict the API, so I wondered if I could use the OccupiedEntry and VacantEntry API to make only one lookup, but unfortunately currently it's impossible. If OccupiedEntry had a way to remove the entry, returning the value, but also returning the just freed VacantEntry it would be perfect, but there's no such API. I tried to imitate it using unsafe and MaybeUninit and this is the result:

https://gist.github.com/SkiFire13/12a61302e35ffe52c0337e0959fb15f6

(Edit: updated the gist with a version that uses less unsafe code and also reduces by half the number of writes to the map in case of an OccupiedEntry)

On the previous example it runs 1.6x faster than the current approach but 1.3 slower than the normal entry. On another example (count the number of occurences of a bunch of i32s) it performed 2.5x the current approach and the same as the normal entry. This could actually be used to unify the implementation with #468, however I really don't like that I had to use unsafe for this. There's also the problem that MaybeUninit wasn't stable in rust 1.32.0

Or maybe should I just add an aggregate_mut/aggregate_update (has anyone better names?) that provides a restricted but faster API using entry?

phimuemue · 2020-07-22T20:34:16Z

* `minmax` return type is the already existing `MinMaxResult`, but the `NoElements` variant is never returned. I didn't want to duplicate that struct thinking it could cause confusion (and if I did what name could I have chosen?);

I think you came to a sensible decision, even if it means that users have to consider the impossible NoElements if they match on the result.

I think that ideally minmax would return Option<MinMaxResult>, and MinMaxResult would not have the NoElements variant. This would clearly be a breaking change; I am not sure what the best option is here. @jswrenn @bluss Do we have precedence for breaking changes? Could we deprecate MinMaxResult in favor of a new MinMax (which has no empty-case)?

* no integration tests for the `_by` and `_by_key` versions of `min`, `max` and `minmax`.

In general, I think we should quickcheck-test that the new functions are equivalent to into_group_map with the equivalents done on the vectors manually.

* I didn't want to touch the rest of the library, but I guess we could implement `into_group_map` in terms of `into_grouping_map`?

If we decide to accept this, I would second this.

SkiFire13 · 2020-07-22T20:47:19Z

I think that ideally minmax would return Option<MinMaxResult>, and MinMaxResult would not have the NoElements variant. This would clearly be a breaking change; I am not sure what the best option is here. @jswrenn @bluss Do we have precedence for breaking changes? Could we deprecate MinMaxResult in favor of a new MinMax (which has no empty-case)?

Another idea I had was to add a new MinMax and provide methods to convert MinMaxResult<T> into Option<MinMax<T>> and MinMax<T> into MinMaxResult<T>. This would remove the need for the extra match but would introduce a bit of duplication. Bonus point: MinMax wouldn't need an into_option, it can be directly converted to a tuple.

In general, I think we should quickcheck-test that the new functions are equivalent to into_group_map with the equivalents done on the vectors manually.

Nice idea, I'm definitely gonna add this.

SkiFire13 · 2020-07-23T12:38:19Z

I was wondering in the issues section and notice #358 which made me realize minmax requires just PartialOrd. I want to point out that all the implementations of min, max and minmax for GroupingMap in this PR require Ord, just like the stdlib does for Iterator::{min, max}. This is inconsistent with the normal minmax but consistent within itself, and won't need a breaking change if normal minmax will ever be fixed.

phimuemue · 2020-08-10T19:01:17Z

#467 brought up this: Is the key==value situation (instead of having explicit key-value-pairs) prevalent enough to either make it the default, or at least support it in a convenient manner?

(As of now, we either have to specify a (K,V)-iterator or provide a function to compute the key.)

jswrenn · 2020-08-10T22:45:40Z

Sorry I haven't yet had time to thoroughly review all this. :(

Do we have precedence for breaking changes? Could we deprecate MinMaxResult in favor of a new MinMax (which has no empty-case)?

@phimuemue Itertools has deprecated/changed/removed plenty of things in the past. I'm fine with making breaking changes, especially if also we take the opportunity to finally remove some long-deprecated methods that conflict with Iterator. My impression is that itertools types are seldom in the public API of crates, so version bumps tend to rarely cause issues for end-users.

Adds two functions on the Itertools trait, into_grouping_map and into_grouping_map_by.

into_grouping_map expects an iter of (K, V) where the K will be used as key and V as value.
into_grouping_map_by expect only an iter of V as values, the keys will be calculated using the provided functions.

Both of them return a GroupingMap, which is just a wrapper on an iterator. Since it introduces a lot of related methods I thought it would be better to separate them from the Itertools trait. This also prevents duplicating every method for the _by version.

Possible alternative: defining a GroupingMap trait (or PairItertools trait?) for these methods. I don't yet have an opinion either way, but I'd like to make sure we consider this alternative.

sollyucko · 2020-08-11T01:00:40Z

Maybe change the GroupingMap methods to be more clear about the fact that they operate on the values for each key separately (perhaps ..._each)?

SkiFire13 · 2020-08-11T19:25:15Z

#467 brought up this: Is the key==value situation (instead of having explicit key-value-pairs) prevalent enough to either make it the default, or at least support it in a convenient manner?

(As of now, we either have to specify a (K,V)-iterator or provide a function to compute the key.)

I don't think this is really the case. It's more like the iterator items are the keys and the values are ingnored (they could be ()). The best solution to that issue using this API would be .map(|v| (v,())).into_grouping_map().count(). It's a bit ugly, mainly because the map is required for non-Copy items. Maybe I could add a shortcut for that .map(|v| (v,())).into_grouping_map()? I'm not sure, it feels weird. I feel like a separate function just for count would be better. In fact I think this API shines better with general folding operations. count (and collect) is a more specialized operation that can have its own optimizations if it was a separate function.

Little side note: the real case key == value is much more complex because of ownership. It would probably require tradeoffs in the API or an entirely separate API which I don't think would be worth it.

Possible alternative: defining a GroupingMap trait (or PairItertools trait?) for these methods. I don't yet have an opinion either way, but I'd like to make sure we consider this alternative.

Coupled with @sollyucko proposals it would be possible, but I see 2 problems:

we're back to using map as replacement for into_grouping_map_by.
even with more clearer method names it would feel confusing. Having a function like into_grouping_map_by that tells me we're entering a separate context is more explicit and less confusing. (At least to me, I'm probably biased)

Ps: I'm currently on vacation without my pc so I can't write any code. I'll think about all of this better when I'll get back at home.

Edit: I forgot to say that @sollyucko proposal by itself makes sense, but I personally can't decide between that and the current approach. Other opinions are welcome :)

…ertools::counts in rust-itertools#468

jswrenn · 2020-12-14T19:01:49Z

@SkiFire13 sorry for the delay on this. Do you feel like it's ready to merge? If so, I'll merge it!

SkiFire13 · 2020-12-15T09:17:08Z

There's still #465 (comment) which is a performance issue. I haven't decided yet whether:

Keep it as it is. Pros: simple, safe. Cons: slower than it could be
Add an aggregate variant which takes a Option<&mut R> instead of Option<R>. Pros: fast, safe. Cons: complex user interface, more limited than it could be
Use unsafe (example). Pros: simple interface, fast. Cons: potentially unsound (someone should definitely double-check if I messed up somewhere.

Note that if we go for the 1st or 3rd options we could replace them with OccupiedEntry::replace_entry_with from hashbrown when it will land on stable (currently it hasn't been proposed yet, not even in nightly, so it will be a long time until it's available in the MSVR)

So the final question here is: do you want to discard performance, complexity in user interface or no-unsafe? I feel like this is a pretty big question that needs to be discussed together before merging this PR.

jswrenn · 2020-12-18T18:22:26Z

Of those three options, I'm most inclined to keep things as-is and merge. Its performance issues can be eventually be resolved with replace_entry_with .

A fourth option: add an R: Default bound and create a temporary value to swap into the entry; e.g.:

    pub fn aggregate<FO, R>(self, mut operation: FO) -> HashMap<K, R>
    where
        R: Default,
        FO: FnMut(Option<R>, &K, V) -> Option<R>,
    {
        let mut destination_map = HashMap::new();

        for (key, val) in self.iter {
            let entry = destination_map.entry(key);
            match entry {
                Entry::Occupied(mut entry) => {
                    let acc = entry.insert(R::default());
                    if let Some(op_res) = operation(Some(acc), entry.key(), val) {
                        entry.insert(op_res);
                    }
                },
                Entry::Vacant(entry) => {
                    if let Some(op_res) = operation(None, entry.key(), val) {
                        entry.insert(op_res);
                    }
                },
            }
        }

        destination_map
    }

SkiFire13 · 2020-12-18T19:33:51Z

I feel like adding a R: Default bound would be too limiting. For example you wouldn't be able to use fold_first if the Item is a reference.

jswrenn · 2020-12-20T15:24:52Z

Great, let's merge this as-is, then.

bors r+

bors · 2020-12-20T15:27:55Z

Build succeeded:

bors build finished

SkiFire13 added 30 commits July 21, 2020 15:40

Added basic GroupingMap functionality

3c95392

Added GroupingMap doc

4a66faf

Added into_grouping_map function on the Itertools trait

2ae7a88

Remove useless mut modifier

abd3081

Added doc for GroupingMap::aggregate

d261067

Improved doc for GroupingMap::aggregate

c233eb7

Added doc for GroupingMap::fold

41125e9

Added doc for GroupingMap::fold_first

ff45dc2

Added doc for GroupingMap::collect

98b15bc

Added doc for GroupingMap::count

5333a61

Improved doc for GroupingMap::collect

4ee1eb2

Added doc for Itertools::into_grouping_map

3ca1441

Adjusted GroupingMap doc to be in accord with Itertools::into_groupin…

adc3edf

…g_map

Fixed formatting

47eefb6

Improved doc and its formatting for Itertools::into_grouping_map

c502930

Merge branch 'master' of https://github.com/SkiFire13/itertools

f3a8cc5

Give more priority to GroupingMap::aggregate not being suggested

a4cd90e

Fixed typo in GroupingMap::aggregate doc

e8c7f0b

Added doctest for GroupingMap::aggregate

702b1fb

Added doctest for GroupingMap::fold

d5fa93d

Improved doctest for GroupingMap::aggregate

b1db219

Improved doctest for GroupingMap::aggregate

315f4d0

Improved doctest for GroupingMap::fold

3c93bb5

Added doctest for GroupingMap::fold_first

1a7b7da

Added doctest for GroupingMap::collect

12d8841

Added doctest for GroupingMap::count

47d692a

Added Itertools::into_grouping_map_by

260c9d7

Migrated GroupingMap doctests to use Itertools::into_grouping_map_by

a151d8c

Formated where-clauses to be consistent with the rest of the repo

ec8f660

Removed unwanted doc(hidden)

03857f7

phimuemue reviewed Jul 22, 2020

View reviewed changes

src/grouping_map.rs Show resolved Hide resolved

phimuemue reviewed Jul 22, 2020

View reviewed changes

Rewrite MapForGrouping Iterator::next implementation using Option::map

4d7a2d0

SkiFire13 added 2 commits July 22, 2020 22:49

Remove useless test (same as correct_grouping_map_by_collect_modulo_key)

8622626

Added into_group_map-based checks and _by and _by_key variant quicktests

6ea43b8

jswrenn added this to the next milestone Aug 3, 2020

SkiFire13 added 2 commits August 4, 2020 10:38

Merge upstream

4b2a73e

Added Clone and Debug implementations for MapForGrouping and GroupingMap

c5bf4a1

phimuemue mentioned this pull request Aug 10, 2020

New function: Count number of occurences of each item in an iterator #467

Closed

SkiFire13 added 3 commits August 20, 2020 21:23

Merge upstream

0e5dc21

Removed GroupingMap::count and related tests, has been replaced by It…

8be9300

…ertools::counts in rust-itertools#468

Merged upstream

bb5f624

bors bot merged commit 8d27ab5 into rust-itertools:master Dec 20, 2020

SkiFire13 deleted the grouping_map branch January 20, 2021 18:05

dependabot bot mentioned this pull request Mar 9, 2021

Update itertools requirement from 0.9.0 to 0.10.0 hacspec/hacspec#81

Merged

This was referenced Feb 7, 2024

Kotlin like lazy Grouping API #457

Closed

Increase C# group_by #309

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add into_grouping_map for efficient group-and-fold operations #465

Add into_grouping_map for efficient group-and-fold operations #465

SkiFire13 commented Jul 22, 2020 •

edited

phimuemue Jul 22, 2020

phimuemue Jul 22, 2020

SkiFire13 Jul 22, 2020

SkiFire13 Aug 18, 2020 •

edited

phimuemue commented Jul 22, 2020

SkiFire13 commented Jul 22, 2020

SkiFire13 commented Jul 23, 2020

phimuemue commented Aug 10, 2020

jswrenn commented Aug 10, 2020

sollyucko commented Aug 11, 2020

SkiFire13 commented Aug 11, 2020 •

edited

jswrenn commented Dec 14, 2020

SkiFire13 commented Dec 15, 2020

jswrenn commented Dec 18, 2020

SkiFire13 commented Dec 18, 2020

jswrenn commented Dec 20, 2020

bors bot commented Dec 20, 2020

Add into_grouping_map for efficient group-and-fold operations #465

Add into_grouping_map for efficient group-and-fold operations #465

Conversation

SkiFire13 commented Jul 22, 2020 • edited

phimuemue Jul 22, 2020

Choose a reason for hiding this comment

phimuemue Jul 22, 2020

Choose a reason for hiding this comment

SkiFire13 Jul 22, 2020

Choose a reason for hiding this comment

SkiFire13 Aug 18, 2020 • edited

Choose a reason for hiding this comment

phimuemue commented Jul 22, 2020

SkiFire13 commented Jul 22, 2020

SkiFire13 commented Jul 23, 2020

phimuemue commented Aug 10, 2020

jswrenn commented Aug 10, 2020

sollyucko commented Aug 11, 2020

SkiFire13 commented Aug 11, 2020 • edited

jswrenn commented Dec 14, 2020

SkiFire13 commented Dec 15, 2020

jswrenn commented Dec 18, 2020

SkiFire13 commented Dec 18, 2020

jswrenn commented Dec 20, 2020

bors bot commented Dec 20, 2020

SkiFire13 commented Jul 22, 2020 •

edited

SkiFire13 Aug 18, 2020 •

edited

SkiFire13 commented Aug 11, 2020 •

edited