Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kotlin like lazy Grouping API #457

Closed
SkiFire13 opened this issue Jul 2, 2020 · 4 comments
Closed

Kotlin like lazy Grouping API #457

SkiFire13 opened this issue Jul 2, 2020 · 4 comments

Comments

@SkiFire13
Copy link
Contributor

I'm opening this to request/discuss a possible Grouping API like Kotlin's Grouping/groupingBy API. See Kotlin's Grouping reference and its source code.

It would provide an easy way to perform efficient group-and-fold operations without allocating additional Vecs (unlike into_group_map), while also maintaning unique keys (unlike group_by). The disadvantage is that it can only support fold-like operations.

The implementation should be pretty simple, I've already written a pretty minimal working sample that includes:

  • a fold function, like a normal fold except it associate each key with the result of folding each element that maps to that key;
  • a fold_first function, like fold but without an initial value, instead the first element acts as the first value;
  • a count function that associate each key with the number of elements that maps to that key;
  • a aggregate function that provides the basic functionality we can use to write all the other functions

https://gist.github.com/SkiFire13/a0010ff658d905dcbc1e1f5f6ae910e0

The downsides of my implementation are:

  • It can only produce an HashMap. This could be solved with a Map trait but we're still waiting for GATs for that. It is required because I don't only extend the map, but also remove elements from it;
  • It doesn't provide a way to specify the hasher (it could but then when/if we will switch to a Map trait then it would break more things);
  • fold's initial value must be Clone but I can't see any way around that;
  • It could be confused with group_by although Kotlin has both groupBy and groupingBy and I haven't seen someone complaining;
  • It pretty much reuses Kotlin's docs and tests but I haven't checked which license the Kotlin stdlib uses;
  • It misses code examples from the docs;
  • fold_first is consistent with the stdlib corrispondent but inconsistent with Itertools' fold1.

I'm also thinking of other possible extensions with functions like:

  • other flavors of fold;
  • max/min/maxmin and friends;
  • sum but I found it would be pretty hard to use the Sum trait since it requires a whole iterator and Grouping can only fold.
@phimuemue
Copy link
Member

Hi there! How easy would it be to express these operations in terms of simpler iterator operations?

From what I've tried, there may be some value in these functions, but wouldn't it be easier (and more in line with existing methods) to generalize into_group_map to into_group_map_by (or similar) so that the caller is not forced to go with the Vecs?

I would imagine something where the caller could customize key, Vec::new, push in the following:

lookup.entry(key).or_insert(Vec::new()).push(val);

Of course, the new method would not require Item=(K, V) since it could compute they key (and, thus, the key type) itself.

@SkiFire13
Copy link
Contributor Author

SkiFire13 commented Jul 13, 2020

Yes, it would be possible. What you're proposing is essentially merging the GroupingByExt::grouping_by and Grouping::aggregate/fold/fold_first methods of my example, which is 100% possible from a technical point of view. This would also save a method call, but would keep the total arguments needed invariant, which could be become confusing with 3 arguments on a single call, 2 of which are closures.

The biggest problem of your approach IMO is that we would have to choose between implementing it like fold or fold_first/fold1. And what about other useful methods like count, max ecc ecc? I don't think we can find suitable names while maintaing them in line with the existing methods. And ever if we found them, we would pollute the Itertools trait with too many methods. We could require the end users to write them themself when needed, which I find non-optimal and also against the crate's idea of providing convenience methods.

However, I agree we could find a name more in line with the existing ones. into_group_map_by could be an option, but if we go for the separate methods way then it would conflict with the expectation that it directly produces a HashMap. I like the grouping_by name because it conveys the meaning that the second operation is done while grouping, and doesn't group right away like one would expect from a function that starts with into_group_.

Edit: formatting

bors bot added a commit that referenced this issue Dec 20, 2020
465: Add into_grouping_map for efficient group-and-fold operations r=jswrenn a=SkiFire13

Adds two functions on the `Itertools` trait, `into_grouping_map` and `into_grouping_map_by`. 

`into_grouping_map` expects an iter of `(K, V)` where the `K` will be used as key and `V` as value. 
`into_grouping_map_by` expect only an iter of `V` as values, the keys will be calculated using the provided functions. 

Both of them return a `GroupingMap`, which is just a wrapper on an iterator. Since it introduces a lot of related methods I thought it would be better to separate them from the `Itertools` trait. This also prevents duplicating every method for the `_by` version.

All of these functions have in common the fact they perform efficient group-and-fold operations without allocating temporary vecs like you would normally do if you used `into_group_map` + `into_iter` + `map` + `collect::<HashMap<_, _>>()`.

Here's the possible issues I can see, I would like to hear some feedback before trying to fix any of them:
- name: I initially though of `grouping_by` which is the java/kotlin equivalent, but later changed it to `into_grouping_map` to match the already existing `into_group_map` and to differentiate from `group_by`;
- `fold_first`: the equivalent function in the `Itertools` trait is `fold1` but there's an unstable stdlib function that does the same thing and is called `fold_first`. I decided to be consistent with the stdlib;
- `minmax` return type is the already existing `MinMaxResult`, but the `NoElements` variant is never returned. I didn't want to duplicate that struct thinking it could cause confusion (and if I did what name could I have chosen?);
- `sum` and `product`: They dont' use the `Sum` and `Product` traits but instead require `V: Add<V, Output=V>` and `V: Mul<V, Output=V>`. They're pretty much a wrapper around `fold_first`. I don't really know if they're meaningful or if I should just remove them;
- almost every closure takes as one of the parameters a reference to the key. It bloats them a bit, but could be useful if computing the key is a relatively expensive operation.
- no `scan` function. Even though it is an iterator adapter I could sort of "collect" it into a `Vec` (more like extend) but I don't really see an use for this;
- ~~no integration tests for the `_by` and `_by_key` versions of `min`, `max` and `minmax`. To be fair I was a bit lazy, but I also couldn't find any integration test for the normal `minmax_by` and `minmax_by_key` so I though it was fine;~~ added;
- no benchmark (I don't think it is required but probably would be cool to compare this with `into_group_map`;
- I didn't want to touch the rest of the library, but I guess we could implement `into_group_map` in terms of `into_grouping_map`?

Related issues: #457, #309
Related PR: #406 (in particular relates to the `into_group_map_by_fold` function that was proposed but later removed)

Co-authored-by: SkiFire13 <skifire13.1@gmail.com>
@Philippe-Cholet
Copy link
Member

@SkiFire13 I see this issue and I'm wondering if there is anything left to do after #465 or if I can close this.

@SkiFire13
Copy link
Contributor Author

I think the APIs introduced in #465 should be enough, if someone needs more they can always open a new issue. Sorry for forgetting to close the issue with that PR.

By the way, I wonder if #309 (the other issue I linked in that PR) is also solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants