Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add into_grouping_map for efficient group-and-fold operations #465

Merged
merged 66 commits into from Dec 20, 2020

Conversation

SkiFire13
Copy link
Contributor

@SkiFire13 SkiFire13 commented Jul 22, 2020

Adds two functions on the Itertools trait, into_grouping_map and into_grouping_map_by.

into_grouping_map expects an iter of (K, V) where the K will be used as key and V as value.
into_grouping_map_by expect only an iter of V as values, the keys will be calculated using the provided functions.

Both of them return a GroupingMap, which is just a wrapper on an iterator. Since it introduces a lot of related methods I thought it would be better to separate them from the Itertools trait. This also prevents duplicating every method for the _by version.

All of these functions have in common the fact they perform efficient group-and-fold operations without allocating temporary vecs like you would normally do if you used into_group_map + into_iter + map + collect::<HashMap<_, _>>().

Here's the possible issues I can see, I would like to hear some feedback before trying to fix any of them:

  • name: I initially though of grouping_by which is the java/kotlin equivalent, but later changed it to into_grouping_map to match the already existing into_group_map and to differentiate from group_by;
  • fold_first: the equivalent function in the Itertools trait is fold1 but there's an unstable stdlib function that does the same thing and is called fold_first. I decided to be consistent with the stdlib;
  • minmax return type is the already existing MinMaxResult, but the NoElements variant is never returned. I didn't want to duplicate that struct thinking it could cause confusion (and if I did what name could I have chosen?);
  • sum and product: They dont' use the Sum and Product traits but instead require V: Add<V, Output=V> and V: Mul<V, Output=V>. They're pretty much a wrapper around fold_first. I don't really know if they're meaningful or if I should just remove them;
  • almost every closure takes as one of the parameters a reference to the key. It bloats them a bit, but could be useful if computing the key is a relatively expensive operation.
  • no scan function. Even though it is an iterator adapter I could sort of "collect" it into a Vec (more like extend) but I don't really see an use for this;
  • no integration tests for the _by and _by_key versions of min, max and minmax. To be fair I was a bit lazy, but I also couldn't find any integration test for the normal minmax_by and minmax_by_key so I though it was fine; added;
  • no benchmark (I don't think it is required but probably would be cool to compare this with into_group_map;
  • I didn't want to touch the rest of the library, but I guess we could implement into_group_map in terms of into_grouping_map?

Related issues: #457, #309
Related PR: #406 (in particular relates to the into_group_map_by_fold function that was proposed but later removed)

SkiFire13 added 30 commits July 21, 2020 15:40
src/grouping_map.rs Outdated Show resolved Hide resolved
Comment on lines +105 to +108
let acc = destination_map.remove(&key);
if let Some(op_res) = operation(acc, &key, val) {
destination_map.insert(key, op_res);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to remove and insert back into destination_map? Couldn't we facilitate the entry-API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using FnMut(Option<R>, ...) forces the caller to deal with Option, i.e. the caller has to deal with Some/None, essentially checking the same thing that is already checked in aggregate in terms of existence in destination_map. It may be ok, but we could avoid these checks in the callers by having two functions (one for the None-case, one for the Some-case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to remove and insert back into destination_map? Couldn't we facilitate the entry-API?

remove and insert are needed if we want to pass an owned accumulator to the closure, like the normal Iterator::fold. The alternative is passing a mutable reference but this is more restricting for the end user.
Maybe I should add a second aggregate function that provides mutable access instead? This would allow a more efficient implementation in some cases, like in GroupingMap::collect (currently it duplicates aggregate's code).

Using FnMut(Option<R>, ...) forces the caller to deal with Option, i.e. the caller has to deal with Some/None, essentially checking the same thing that is already checked in aggregate in terms of existence in destination_map. It may be ok, but we could avoid these checks in the callers by having two functions (one for the None-case, one for the Some-case).

Wouldn't this just move the check from the closure to the aggregate function?
Performance wise it should remain the same.
But what if the caller wants to mutate an external variable in both the case the map contains the value or not? It can't do that with 2 closures.

Copy link
Contributor Author

@SkiFire13 SkiFire13 Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this, in particular since issue #467. I made a couple of basic benchmarks and the diffeference in speed is not negligible. For example on a simple collect of 1_000_000 i32 grouped by their reminder by 10 or 10_000, the entry version performed 2x better! But at what cost?

  • I had to add another closure for the init.
  • The init closure can't take a reference to the key as parameter because Entry::or_insert_with_key is unstable. Not sure but there may be a way to do it anyway.
  • I had to remove the key parameter from the other closure as well
  • I had to pass a mutable reference to the accumulator instead of an owned value because the Entry api gives me only that.

The problem is that the current approach move a lot the accumulator and makes two lookup in the map, one for remove and one for insert. Moving the accumulator is required if we don't want to restrict the API, so I wondered if I could use the OccupiedEntry and VacantEntry API to make only one lookup, but unfortunately currently it's impossible. If OccupiedEntry had a way to remove the entry, returning the value, but also returning the just freed VacantEntry it would be perfect, but there's no such API. I tried to imitate it using unsafe and MaybeUninit and this is the result:

https://gist.github.com/SkiFire13/12a61302e35ffe52c0337e0959fb15f6

(Edit: updated the gist with a version that uses less unsafe code and also reduces by half the number of writes to the map in case of an OccupiedEntry)

On the previous example it runs 1.6x faster than the current approach but 1.3 slower than the normal entry. On another example (count the number of occurences of a bunch of i32s) it performed 2.5x the current approach and the same as the normal entry. This could actually be used to unify the implementation with #468, however I really don't like that I had to use unsafe for this. There's also the problem that MaybeUninit wasn't stable in rust 1.32.0

Or maybe should I just add an aggregate_mut/aggregate_update (has anyone better names?) that provides a restricted but faster API using entry?

@phimuemue
Copy link
Member

* `minmax` return type is the already existing `MinMaxResult`, but the `NoElements` variant is never returned. I didn't want to duplicate that struct thinking it could cause confusion (and if I did what name could I have chosen?);

I think you came to a sensible decision, even if it means that users have to consider the impossible NoElements if they match on the result.

I think that ideally minmax would return Option<MinMaxResult>, and MinMaxResult would not have the NoElements variant. This would clearly be a breaking change; I am not sure what the best option is here. @jswrenn @bluss Do we have precedence for breaking changes? Could we deprecate MinMaxResult in favor of a new MinMax (which has no empty-case)?

* no integration tests for the `_by` and `_by_key` versions of `min`, `max` and `minmax`. 

In general, I think we should quickcheck-test that the new functions are equivalent to into_group_map with the equivalents done on the vectors manually.

* I didn't want to touch the rest of the library, but I guess we could implement `into_group_map` in terms of `into_grouping_map`?

If we decide to accept this, I would second this.

@SkiFire13
Copy link
Contributor Author

I think that ideally minmax would return Option<MinMaxResult>, and MinMaxResult would not have the NoElements variant. This would clearly be a breaking change; I am not sure what the best option is here. @jswrenn @bluss Do we have precedence for breaking changes? Could we deprecate MinMaxResult in favor of a new MinMax (which has no empty-case)?

Another idea I had was to add a new MinMax and provide methods to convert MinMaxResult<T> into Option<MinMax<T>> and MinMax<T> into MinMaxResult<T>. This would remove the need for the extra match but would introduce a bit of duplication. Bonus point: MinMax wouldn't need an into_option, it can be directly converted to a tuple.

In general, I think we should quickcheck-test that the new functions are equivalent to into_group_map with the equivalents done on the vectors manually.

Nice idea, I'm definitely gonna add this.

@SkiFire13
Copy link
Contributor Author

I was wondering in the issues section and notice #358 which made me realize minmax requires just PartialOrd. I want to point out that all the implementations of min, max and minmax for GroupingMap in this PR require Ord, just like the stdlib does for Iterator::{min, max}. This is inconsistent with the normal minmax but consistent within itself, and won't need a breaking change if normal minmax will ever be fixed.

@jswrenn jswrenn added this to the next milestone Aug 3, 2020
@phimuemue
Copy link
Member

#467 brought up this: Is the key==value situation (instead of having explicit key-value-pairs) prevalent enough to either make it the default, or at least support it in a convenient manner?

(As of now, we either have to specify a (K,V)-iterator or provide a function to compute the key.)

@jswrenn
Copy link
Member

jswrenn commented Aug 10, 2020

Sorry I haven't yet had time to thoroughly review all this. :(


Do we have precedence for breaking changes? Could we deprecate MinMaxResult in favor of a new MinMax (which has no empty-case)?

@phimuemue Itertools has deprecated/changed/removed plenty of things in the past. I'm fine with making breaking changes, especially if also we take the opportunity to finally remove some long-deprecated methods that conflict with Iterator. My impression is that itertools types are seldom in the public API of crates, so version bumps tend to rarely cause issues for end-users.


Adds two functions on the Itertools trait, into_grouping_map and into_grouping_map_by.

into_grouping_map expects an iter of (K, V) where the K will be used as key and V as value.
into_grouping_map_by expect only an iter of V as values, the keys will be calculated using the provided functions.

Both of them return a GroupingMap, which is just a wrapper on an iterator. Since it introduces a lot of related methods I thought it would be better to separate them from the Itertools trait. This also prevents duplicating every method for the _by version.

Possible alternative: defining a GroupingMap trait (or PairItertools trait?) for these methods. I don't yet have an opinion either way, but I'd like to make sure we consider this alternative.

@sollyucko
Copy link

Maybe change the GroupingMap methods to be more clear about the fact that they operate on the values for each key separately (perhaps ..._each)?

@SkiFire13
Copy link
Contributor Author

SkiFire13 commented Aug 11, 2020

#467 brought up this: Is the key==value situation (instead of having explicit key-value-pairs) prevalent enough to either make it the default, or at least support it in a convenient manner?

(As of now, we either have to specify a (K,V)-iterator or provide a function to compute the key.)

I don't think this is really the case. It's more like the iterator items are the keys and the values are ingnored (they could be ()). The best solution to that issue using this API would be .map(|v| (v,())).into_grouping_map().count(). It's a bit ugly, mainly because the map is required for non-Copy items. Maybe I could add a shortcut for that .map(|v| (v,())).into_grouping_map()? I'm not sure, it feels weird. I feel like a separate function just for count would be better. In fact I think this API shines better with general folding operations. count (and collect) is a more specialized operation that can have its own optimizations if it was a separate function.

Little side note: the real case key == value is much more complex because of ownership. It would probably require tradeoffs in the API or an entirely separate API which I don't think would be worth it.

Possible alternative: defining a GroupingMap trait (or PairItertools trait?) for these methods. I don't yet have an opinion either way, but I'd like to make sure we consider this alternative.

Coupled with @sollyucko proposals it would be possible, but I see 2 problems:

  • we're back to using map as replacement for into_grouping_map_by.
  • even with more clearer method names it would feel confusing. Having a function like into_grouping_map_by that tells me we're entering a separate context is more explicit and less confusing. (At least to me, I'm probably biased)

Ps: I'm currently on vacation without my pc so I can't write any code. I'll think about all of this better when I'll get back at home.

Edit: I forgot to say that @sollyucko proposal by itself makes sense, but I personally can't decide between that and the current approach. Other opinions are welcome :)

@jswrenn
Copy link
Member

jswrenn commented Dec 14, 2020

@SkiFire13 sorry for the delay on this. Do you feel like it's ready to merge? If so, I'll merge it!

@SkiFire13
Copy link
Contributor Author

There's still #465 (comment) which is a performance issue. I haven't decided yet whether:

  • Keep it as it is. Pros: simple, safe. Cons: slower than it could be

  • Add an aggregate variant which takes a Option<&mut R> instead of Option<R>. Pros: fast, safe. Cons: complex user interface, more limited than it could be

  • Use unsafe (example). Pros: simple interface, fast. Cons: potentially unsound (someone should definitely double-check if I messed up somewhere.

Note that if we go for the 1st or 3rd options we could replace them with OccupiedEntry::replace_entry_with from hashbrown when it will land on stable (currently it hasn't been proposed yet, not even in nightly, so it will be a long time until it's available in the MSVR)

So the final question here is: do you want to discard performance, complexity in user interface or no-unsafe? I feel like this is a pretty big question that needs to be discussed together before merging this PR.

@jswrenn
Copy link
Member

jswrenn commented Dec 18, 2020

Of those three options, I'm most inclined to keep things as-is and merge. Its performance issues can be eventually be resolved with replace_entry_with .

A fourth option: add an R: Default bound and create a temporary value to swap into the entry; e.g.:

    pub fn aggregate<FO, R>(self, mut operation: FO) -> HashMap<K, R>
    where
        R: Default,
        FO: FnMut(Option<R>, &K, V) -> Option<R>,
    {
        let mut destination_map = HashMap::new();

        for (key, val) in self.iter {
            let entry = destination_map.entry(key);
            match entry {
                Entry::Occupied(mut entry) => {
                    let acc = entry.insert(R::default());
                    if let Some(op_res) = operation(Some(acc), entry.key(), val) {
                        entry.insert(op_res);
                    }
                },
                Entry::Vacant(entry) => {
                    if let Some(op_res) = operation(None, entry.key(), val) {
                        entry.insert(op_res);
                    }
                },
            }
        }

        destination_map
    }

@SkiFire13
Copy link
Contributor Author

I feel like adding a R: Default bound would be too limiting. For example you wouldn't be able to use fold_first if the Item is a reference.

@jswrenn
Copy link
Member

jswrenn commented Dec 20, 2020

Great, let's merge this as-is, then.

bors r+

@bors
Copy link
Contributor

bors bot commented Dec 20, 2020

Build succeeded:

@bors bors bot merged commit 8d27ab5 into rust-itertools:master Dec 20, 2020
@SkiFire13 SkiFire13 deleted the grouping_map branch January 20, 2021 18:05
This was referenced Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants