Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add extension point for dataset-loader associations #2246

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

jnazaren
Copy link
Contributor

Implements a possible solution according to what @ssanderson suggested in #2199

@coveralls
Copy link

coveralls commented Jul 11, 2018

Coverage Status

Coverage increased (+0.1%) to 86.042% when pulling 65f582f on dataset_loader_associations into f288356 on master.

Copy link
Contributor

@ssanderson ssanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnazaren I haven't taken a careful look at this, but I figured I should post what I have so far here.

self._column_loaders.clear()


global_pipeline_dispatcher = PipelineDispatcher()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a global registry for this? In general, I think we should avoid the pattern of having a mutable global registry of stuff unless it's absolutely necessary.

pl : PipelineLoader
The PipelineLoader to register for the column or dataset columns
"""
assert isinstance(pl, PipelineLoader)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two notes on this:

  1. I don't think we need to be doing type checking on these inputs. A pipeline loader can be any object that implements load_adjusted_array with the correct signature. Forcing all the loaders to be subclasses of a base class isn't necessary. If a user wants to validate that their class implements the correct methods, that might make sense, but I had actually forgotten that the PipelineLoader base class existed, and we definitely have loaders downstream that don't subclass that base class.
  2. It's almost never a good idea to use assert for validating user input. assert statements get disabled if you run python with -O2. They're meant to be used for sanity checks in the internal logic of your program, not for verifying that a value received from a user matches some contract. Copy/pasting from a reply that I wrote elsewhere earlier this summer:

When Should You Use Assertions?

Generally speaking, you should use assertions for things that indicate that
there's a bug in your program or library that's severe enough that you'd prefer
for the program to crash immediately with a useful error rather than
continuing. In particular, you should only use assertions for things that you
have reason to believe to be true.

That generally means that you should use assertions for things like
preconditions and postconditions of functions that are internal to the library,
and that you shouldn't use assertions for things like validating user input
(whether or not the user passes you valid inputs isn't under your control, so
it can't be indicative of a bug in your part of the program).

A common scenario where assertions are useful is to have a user-facing API
layer that validates that input is well formed that then passes the known-good
input into internal library functions. You might include assertions in the
internal library functions to ensure that we don't accidentally call the
internal function on un-validated data, e.g.:

def user_facing_function(possibly_bad_data):
    good_data = validate(possibly_bad_data)
    return _internal_function(good_data)

def _internal_function(good_data):
    assert is_good(good_data), "good_data isn't good. Did you forget to validate?"
    # do stuff

dataset_loaders : dict[DataSet -> PipelineLoader]
Map from datasets to pipeline loader for those datasets.
"""
def __init__(self, column_loaders=None, dataset_loaders=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to segregate these into column_loaders and dataset_loaders? We could just accept loaders with a dictionary from either BoundColumn or DataSet -> loader.

# already registered, allowing users to register their own loaders
# early on in extensions
if isinstance(data, BoundColumn):
if data not in self._column_loaders:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems dangerous to me to silently ignore conflicts if a loader gets registered for a dataset multiple times. I would expect this to be an error, and if not, I would have expected the later registration to win.

Independently of the above, dict.setdefault is a built-in way to write a key to a dictionary only if it's not present.

self.register(dataset, pl)

def __call__(self, column):
if column in self._column_loaders:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generally more idiomatic in python to write this as:

try:
    return self._column_loaders[column]
except KeyError:
    raise LookupError(...)

c.f. https://blogs.msdn.microsoft.com/pythonengineering/2016/06/29/idiomatic-python-eafp-versus-lbyl/, for example

self.column_loaders = mappingproxy(self._column_loaders)
if dataset_loaders is not None:
for dataset, pl in dataset_loaders.items():
self.register(dataset, pl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to try to avoid calling methods on self in constructors. Generally when editing a method, most people assume that the class is fully-constructed by the time the method is called (for example, they assume that all attributes will be set). Calling a method in a constructor can violate that assumption.

# we register our default loader last, after any loaders from users
# have been registered via extensions
register_pipeline_loader(USEquityPricing, pipeline_loader)

def choose_loader(column):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to create a new function here anymore. global_pipeline_dispatcher is already a callable with the right signature.

cls.add_class_callback(clear_all_associations)

def test_load_not_registered(self):
fake_col_instance = FakeColumn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked carefully at these tests yet, but in general, it's not expected that anyone should ever construct BoundColumn instances explicitly. The usual way to get a bound column is to do:

class SomeDataSet(DataSet):
    column = Column(...)

SomeDataSet.column  # This evaluates to a BoundColumn because Column implements the descriptor protocol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants