New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

ENH: Add extension point for dataset-loader associations #2246

Open

jnazaren wants to merge 6 commits into master from dataset_loader_associations

Contributor

jnazaren commented Jul 11, 2018

Implements a possible solution according to what @ssanderson suggested in #2199

Jacob Nazarenko added 3 commits

July 11, 2018 11:36


          ENH: Create PipelineDispatcher class for column/loader associations

7d41d4c


          TST: Add tests for PipelineDispatcher

274a762


          STY: Fix flake8

450b7c1

jnazaren requested review from ssanderson and llllllllll

July 11, 2018 20:35

jnazaren mentioned this pull request

Added loaders parameter to run_algorithm() in run_algo.py #2199

Closed

coveralls commented Jul 11, 2018 •

edited

Coverage increased (+0.1%) to 86.042% when pulling 65f582f on dataset_loader_associations into f288356 on master.


          BUG: Fix teardown bug with global dispatcher instance

65f582f

ssanderson reviewed

View reviewed changes

Contributor

ssanderson left a comment

@jnazaren I haven't taken a careful look at this, but I figured I should post what I have so far here.

zipline/pipeline/dispatcher.py Outdated

		self._column_loaders.clear()


		global_pipeline_dispatcher = PipelineDispatcher()

Contributor

ssanderson Jul 12, 2018

Do we need a global registry for this? In general, I think we should avoid the pattern of having a mutable global registry of stuff unless it's absolutely necessary.

zipline/pipeline/dispatcher.py Outdated

+                      pl : PipelineLoader
+                          The PipelineLoader to register for the column or dataset columns
+                      """
+                      assert isinstance(pl, PipelineLoader)

Contributor

ssanderson Jul 12, 2018

Two notes on this:

I don't think we need to be doing type checking on these inputs. A pipeline loader can be any object that implements load_adjusted_array with the correct signature. Forcing all the loaders to be subclasses of a base class isn't necessary. If a user wants to validate that their class implements the correct methods, that might make sense, but I had actually forgotten that the PipelineLoader base class existed, and we definitely have loaders downstream that don't subclass that base class.
It's almost never a good idea to use assert for validating user input. assert statements get disabled if you run python with -O2. They're meant to be used for sanity checks in the internal logic of your program, not for verifying that a value received from a user matches some contract. Copy/pasting from a reply that I wrote elsewhere earlier this summer:

When Should You Use Assertions?

Generally speaking, you should use assertions for things that indicate that
there's a bug in your program or library that's severe enough that you'd prefer
for the program to crash immediately with a useful error rather than
continuing. In particular, you should only use assertions for things that you
have reason to believe to be true.

That generally means that you should use assertions for things like
preconditions and postconditions of functions that are internal to the library,
and that you shouldn't use assertions for things like validating user input
(whether or not the user passes you valid inputs isn't under your control, so
it can't be indicative of a bug in your part of the program).

A common scenario where assertions are useful is to have a user-facing API
layer that validates that input is well formed that then passes the known-good
input into internal library functions. You might include assertions in the
internal library functions to ensure that we don't accidentally call the
internal function on un-validated data, e.g.:

def user_facing_function(possibly_bad_data):
    good_data = validate(possibly_bad_data)
    return _internal_function(good_data)

def _internal_function(good_data):
    assert is_good(good_data), "good_data isn't good. Did you forget to validate?"
    # do stuff

zipline/pipeline/dispatcher.py Outdated

+                  dataset_loaders : dict[DataSet -> PipelineLoader]
+                      Map from datasets to pipeline loader for those datasets.
+                  """
+                  def __init__(self, column_loaders=None, dataset_loaders=None):

Contributor

ssanderson Jul 12, 2018

Is there a reason to segregate these into column_loaders and dataset_loaders? We could just accept loaders with a dictionary from either BoundColumn or DataSet -> loader.

zipline/pipeline/dispatcher.py Outdated

+                      # already registered, allowing users to register their own loaders
+                      # early on in extensions
+                      if isinstance(data, BoundColumn):
+                          if data not in self._column_loaders:

Contributor

ssanderson Jul 12, 2018

It seems dangerous to me to silently ignore conflicts if a loader gets registered for a dataset multiple times. I would expect this to be an error, and if not, I would have expected the later registration to win.

Independently of the above, dict.setdefault is a built-in way to write a key to a dictionary only if it's not present.

zipline/pipeline/dispatcher.py Outdated

+                              self.register(dataset, pl)
+                  def __call__(self, column):
+                      if column in self._column_loaders:

Contributor

ssanderson Jul 12, 2018

It's generally more idiomatic in python to write this as:

try:
    return self._column_loaders[column]
except KeyError:
    raise LookupError(...)

c.f. https://blogs.msdn.microsoft.com/pythonengineering/2016/06/29/idiomatic-python-eafp-versus-lbyl/, for example

zipline/pipeline/dispatcher.py Outdated

+                      self.column_loaders = mappingproxy(self._column_loaders)
+                      if dataset_loaders is not None:
+                          for dataset, pl in dataset_loaders.items():
+                              self.register(dataset, pl)

Contributor

ssanderson Jul 12, 2018

I tend to try to avoid calling methods on self in constructors. Generally when editing a method, most people assume that the class is fully-constructed by the time the method is called (for example, they assume that all attributes will be set). Calling a method in a constructor can violate that assumption.

zipline/utils/run_algo.py Outdated

+                      # we register our default loader last, after any loaders from users
+                      # have been registered via extensions
+                      register_pipeline_loader(USEquityPricing, pipeline_loader)
                       def choose_loader(column):

Contributor

ssanderson Jul 12, 2018

We don't need to create a new function here anymore. global_pipeline_dispatcher is already a callable with the right signature.

tests/pipeline/test_dispatcher.py Outdated

+                      cls.add_class_callback(clear_all_associations)
+                  def test_load_not_registered(self):
+                      fake_col_instance = FakeColumn(

Contributor

ssanderson Jul 12, 2018

I haven't looked carefully at these tests yet, but in general, it's not expected that anyone should ever construct BoundColumn instances explicitly. The usual way to get a bound column is to do:

class SomeDataSet(DataSet):
    column = Column(...)

SomeDataSet.column  # This evaluates to a BoundColumn because Column implements the descriptor protocol.

Jacob Nazarenko added 2 commits

July 17, 2018 18:21


          MAINT: Revise PipelineDispatcher and remove global state

cdbf1eb


          Merge master into dataset_loader_associations

aac863e

ssanderson mentioned this pull request

Design Document: Pipeline Domains #2265

Closed

5 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment