Feature request: add collector accumulator #353

HDembinski · 2020-04-06T08:05:19Z

A simple and widely useful accumulator would be the Collector (name to be refined if needed). The accumulator holds a std::vector and appends sample values to the vector. Users should be able to view the contents as a numpy array.

Motivation

The usual accumulators were designed to have a very small state, so that one can have very many of them. This accumulator is coming from other end of the spectrum, it uses the maximum amount of storage to hold all samples which ended up in a certain bin. Having the full sample of values in each bin is useful in a variety of contexts, to do unbinned fits in each bin, to compute the median, to compute a kernel density estimate.

Technical challenges?

Do we need a custom view again for the accumulator or maybe just a buffer interface? It would be nice if a collector instance would act like a normal numpy array (at least read-only, possibly read-write), which means it should support slicing, masking, advanced indexing access, can be passed to numpy ufuncs etc.

henryiii · 2020-05-27T17:46:08Z

I think the return from this would exactly be an awkward array, FYI. @jpivarski

jpivarski · 2020-05-27T18:08:22Z

Do I understand correctly that a Collector for a 2-axis histogram with 2 bins in the first axis and 3 bins in the second histogram would collect data like

[
    [1.1, 2.2, 3.3],   # bin (0, 0)
    [],                # bin (0, 1)
    [4.4, 5.5]         # bin (0, 2)
],
[
    [6.6],             # bin (1, 0)
    [7.7, 8.8],        # bin (1, 1)
    [9.9]              # bin (1, 2)
]

That is, the data to be collected (for an unbinned fit, KDE, etc.) are variable-length lists of numbers only—no records or n-tuples of numbers—and that there's one per bin for some regular or sparse binning? If so, then at most what you need is a jagged array. That, by itself, is not too complicated and might fall on the "reimplement" side of the reimplement/dependency trade-off.

If the objects in each of these bins is an n-tuple of numbers, it's still feasible, but if they get any more general, then you might want to use Awkward Array as a dependency.

For presenting these in Python as slicable objects, you might want to have a custom implementation in C++ and only wrap them as Awkward Arrays as an optional dependency in Python. That way, you get all the slicing/broadcasting/etc. logic without taking Awkward as a C++ dependency, which Boost Histogram should not (one of its selling points is lightweight dependencies, and I don't think it could be included in Boost with a non-Boost dependency, right?).

Another thing to consider: while filling it, it can't be a jagged array implemented with offsets. It needs to be an array of pointers to growable buffers (std::vector) and only later copied into an Awkward Array for slicing. There's a slight possibility that you could implement it as a starts/stops ListArray, preallocated with extra space between each sublist, but you would have to frequently move sublists to keep it defragmented. (It's not clear that constantly moving the sublists around to keep sublists from growing into each other would be better than a single copy at the end, but that's a question of algorithms.)

HDembinski · 2020-06-22T16:13:57Z

I started working on this, because I need it now for an analysis.

We can implement accumulators in boost-histogram that are not in boostorg/histogram, so in principle we are free here to add third party dependencies. For boostorg/histogram, adding third-party dependencies would be an issue.

I realized that there are two collectors which seem useful, one is just keeping a collection of weights per bin, so it would be a variable-length array of doubles in each bin (a std::vector in each cell in C++). That's actually what I need right now. This would be the collector that corresponds to the weighted_sum.

The other collector would keep a variable-length array of two doubles, for a weight and a sample. That would be the collector that corresponds to weighted_mean.

henryiii added the enhancement New feature or request label May 29, 2020

HDembinski self-assigned this Jun 22, 2020

henryiii added project idea Could be a fellow project Upstream Best addressed in Boost.Histogram labels Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: add collector accumulator #353

Feature request: add collector accumulator #353

HDembinski commented Apr 6, 2020

henryiii commented May 27, 2020

jpivarski commented May 27, 2020

HDembinski commented Jun 22, 2020 •

edited

Feature request: add collector accumulator #353

Feature request: add collector accumulator #353

Comments

HDembinski commented Apr 6, 2020

Motivation

Technical challenges?

henryiii commented May 27, 2020

jpivarski commented May 27, 2020

HDembinski commented Jun 22, 2020 • edited

HDembinski commented Jun 22, 2020 •

edited