Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: add collector accumulator #353

Open
HDembinski opened this issue Apr 6, 2020 · 3 comments
Open

Feature request: add collector accumulator #353

HDembinski opened this issue Apr 6, 2020 · 3 comments
Assignees
Labels
enhancement New feature or request project idea Could be a fellow project Upstream Best addressed in Boost.Histogram

Comments

@HDembinski
Copy link
Member

A simple and widely useful accumulator would be the Collector (name to be refined if needed). The accumulator holds a std::vector and appends sample values to the vector. Users should be able to view the contents as a numpy array.

Motivation

The usual accumulators were designed to have a very small state, so that one can have very many of them. This accumulator is coming from other end of the spectrum, it uses the maximum amount of storage to hold all samples which ended up in a certain bin. Having the full sample of values in each bin is useful in a variety of contexts, to do unbinned fits in each bin, to compute the median, to compute a kernel density estimate.

Technical challenges?

Do we need a custom view again for the accumulator or maybe just a buffer interface? It would be nice if a collector instance would act like a normal numpy array (at least read-only, possibly read-write), which means it should support slicing, masking, advanced indexing access, can be passed to numpy ufuncs etc.

@henryiii
Copy link
Member

I think the return from this would exactly be an awkward array, FYI. @jpivarski

@jpivarski
Copy link
Member

Do I understand correctly that a Collector for a 2-axis histogram with 2 bins in the first axis and 3 bins in the second histogram would collect data like

[
    [1.1, 2.2, 3.3],   # bin (0, 0)
    [],                # bin (0, 1)
    [4.4, 5.5]         # bin (0, 2)
],
[
    [6.6],             # bin (1, 0)
    [7.7, 8.8],        # bin (1, 1)
    [9.9]              # bin (1, 2)
]

That is, the data to be collected (for an unbinned fit, KDE, etc.) are variable-length lists of numbers only—no records or n-tuples of numbers—and that there's one per bin for some regular or sparse binning? If so, then at most what you need is a jagged array. That, by itself, is not too complicated and might fall on the "reimplement" side of the reimplement/dependency trade-off.

If the objects in each of these bins is an n-tuple of numbers, it's still feasible, but if they get any more general, then you might want to use Awkward Array as a dependency.

For presenting these in Python as slicable objects, you might want to have a custom implementation in C++ and only wrap them as Awkward Arrays as an optional dependency in Python. That way, you get all the slicing/broadcasting/etc. logic without taking Awkward as a C++ dependency, which Boost Histogram should not (one of its selling points is lightweight dependencies, and I don't think it could be included in Boost with a non-Boost dependency, right?).

Another thing to consider: while filling it, it can't be a jagged array implemented with offsets. It needs to be an array of pointers to growable buffers (std::vector) and only later copied into an Awkward Array for slicing. There's a slight possibility that you could implement it as a starts/stops ListArray, preallocated with extra space between each sublist, but you would have to frequently move sublists to keep it defragmented. (It's not clear that constantly moving the sublists around to keep sublists from growing into each other would be better than a single copy at the end, but that's a question of algorithms.)

@henryiii henryiii added the enhancement New feature or request label May 29, 2020
@HDembinski HDembinski self-assigned this Jun 22, 2020
@HDembinski
Copy link
Member Author

HDembinski commented Jun 22, 2020

I started working on this, because I need it now for an analysis.

We can implement accumulators in boost-histogram that are not in boostorg/histogram, so in principle we are free here to add third party dependencies. For boostorg/histogram, adding third-party dependencies would be an issue.

I realized that there are two collectors which seem useful, one is just keeping a collection of weights per bin, so it would be a variable-length array of doubles in each bin (a std::vector in each cell in C++). That's actually what I need right now. This would be the collector that corresponds to the weighted_sum.

The other collector would keep a variable-length array of two doubles, for a weight and a sample. That would be the collector that corresponds to weighted_mean.

@henryiii henryiii added project idea Could be a fellow project Upstream Best addressed in Boost.Histogram labels Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request project idea Could be a fellow project Upstream Best addressed in Boost.Histogram
Projects
None yet
Development

No branches or pull requests

3 participants