Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds collector accumulator #378

Draft
wants to merge 4 commits into
base: develop
Choose a base branch
from
Draft

Adds collector accumulator #378

wants to merge 4 commits into from

Conversation

HDembinski
Copy link
Member

@HDembinski HDembinski commented Jun 22, 2020

  • C++ collector class for weights
  • Python wrapper for collector (halfway)
  • Tests
  • Docs

I don't know how to wrap this in an awkward array as a view. Without awkward, it would most naturally be represented as array((<shape of histogram>), dtype=object)

@HDembinski
Copy link
Member Author

@henryiii To add a new accumulator, I currently have to change the code in many places. Not only do I need to register my accumulator in C++, but I also need to add it explicitly to src/boost_histogram/accumulators.py and src/boost_histogram/cpp/accumulators.py. It would be great to automate this. Adding things should be easy.

@HDembinski
Copy link
Member Author

@henryiii mypy fails with a wrong positive. When are we dropping Python 2 support? It is hindering this patch.

@henryiii
Copy link
Member

You can disable mypy with # type: ignore if you need to.

But "Adding things should be easy." - we need to be careful - the procedure is clear and standard - if we automate too much, either with runtime magic (bad) or generation scripts (better), then that introduces more tooling to maintain, more special things unique to this one library only. Are we really planning for that many additions here? We have to recompile anyway, and we don't have this exposed as a public API for external extension modules, so keeping it a little repetitive but simple should benefit us in the long run.

Now if we come up with a way to add custom additions (which should be doable for storages), then we would benefit from a generation tool, that would be a public API and should be designed as such (and then used internally, too).

When are we dropping Python 2 support?

With Version 1.0, probably mid-Summer. However, it is acceptable to leave off some features as Python 3 only.

@HDembinski
Copy link
Member Author

But "Adding things should be easy." - we need to be careful - the procedure is clear and standard - if we automate too much, either with runtime magic (bad) or generation scripts (better), then that introduces more tooling to maintain, more special things unique to this one library only. Are we really planning for that many additions here? We have to recompile anyway, and we don't have this exposed as a public API for external extension modules, so keeping it a little repetitive but simple should benefit us in the long run.

I can't follow your reasoning. The accumulators are a customization point, perhaps not for users but for us devs. When I add an accumulator, I don't want to manually change the code in several places.

Why not drop Python 2 support now? 1.0 seems arbitrary. It is either dropping Python 2 or I have to rewrite my code for this patch.

@HDembinski
Copy link
Member Author

If you look into the code, you can see how I automated this.

@HDembinski
Copy link
Member Author

Any repetition in code is bad, we want to be DRY.

@henryiii
Copy link
Member

array((), dtype=object)

It's slow and ugly, but fine for a first run. We could add easily Awkward support later.

@henryiii
Copy link
Member

Why not drop Python 2 support now? 1.0 seems arbitrary.

Randomly deciding that a feature patch should cause a major Python compatibility change is arbitrary. I have an outline and plan that has been announced and followed for about a year. Only the timing has been thrown off (mostly by COVID-19 creating an extra month of work for me). We need a roughly feature complete version (1.0), and then we can drop Python 2 support. That way, if we are picked up by experiment stacks that are stuck in Python 2, we can still be used, and we can back port fixes if needed. That's why I've put so much work into the Python 2 porting of a variety of features. If you want to wait until 1.0 is ready to merge this patch, though, that's fine with me. I can also help fix it in the near future.

@henryiii
Copy link
Member

henryiii commented Jun 23, 2020

Any repetition in code is bad, we want to be DRY.

This is not an absolute rule, just a guiding principle. Also, we really aren't talking about code duplication, but rather the equivalent of definitions - it's a little irritating to list items in multiple places, but it provides static code analysis benefits - not just for MyPy, but also for code completion tools, Sphinx (which can't build the C++ code, so relies on the Python files only), and for human readers of the code. For a simplified and rather bad comparison, this is why from x import * is bad - you can't see where things come from without digging further, but if you list each item instead of *, you can trace down where things come from easily. Not an absolute rule either, but often, for mostly static code, being explicit helps tools and users down the line.

I'm not against code duplication, but I don't like additions that break static analysis.

@@ -1,15 +1,23 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function

from ._core.accumulators import Sum, Mean, WeightedSum, WeightedMean
Copy link
Member

@henryiii henryiii Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is taking a very simple static list, and doing run time manipulations with a function that has more lines than the code it replaces, breaking static analysis. We are also losing any ability to not follow the specific naming scheme in the future if something different is added.

If we add unit tests for a new type here, that will immediately break if a developer forgets to update this static list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, PyBind11 is anything but DRY...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember, _core is monkey-patched for documentation, so everything in it should be explicitly imported. It is also ignored for static analysis, so there again, everything should be explicitly imported. Explicit is better than implicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (unrelated to list accumulators) change is also what is breaking Python 2!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what pybind11 has to do with it, and on the contrary, it is a good example for being dry. It is even stated in their docs, that they strongly prefer minimal code to do the work. Minimal code equates avoiding redundancy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your counter arguments make no sense to me. The code is explicit, explicit in the forwarding and transformation rules. I don't have a problem with not being able to do static analysis here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have another solution that allows me to easily add an accumulator without changing the code in several places, then go ahead. For now this is better than it was before.

@HDembinski
Copy link
Member Author

HDembinski commented Jun 30, 2020

Any repetition in code is bad, we want to be DRY.

This is not an absolute rule, just a guiding principle. Also, we really aren't talking about code duplication, but rather the equivalent of definitions - it's a little irritating to list items in multiple places, but it provides static code analysis benefits - not just for MyPy, but also for code completion tools, Sphinx (which can't build the C++ code, so relies on the Python files only), and for human readers of the code. For a simplified and rather bad comparison, this is why from x import * is bad - you can't see where things come from without digging further, but if you list each item instead of *, you can trace down where things come from easily. Not an absolute rule either, but often, for mostly static code, being explicit helps tools and users down the line.

I'm not against code duplication, but I don't like additions that break static analysis.

We have different priorities. I consider static analyis a minor priority, because it is really not that important in this library. A good design is one, which requires changes only in one place to add a new accumulator. One of the core principles of boost::histogram is to make it easy to add new storages, axes, accumulators. I want the same to be true for boost-histogram. We have rules how the Pythonic names relate to the C++ names. These rules can be written in code.

Edit: To be precise, I want it to be easy to add accumulators, axes, and storages in C++. The wrapping to Python should work largely automatic, using TMP in C++ and dynamic processing on the Python side.

@HDembinski
Copy link
Member Author

We need a roughly feature complete version (1.0), and then we can drop Python 2 support.

Why do we need that? Only because you wrote it in a plan?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants