Add tests (and probably fix) writing partitioned Literal storage as Producer output #141

JacobHayes · 2021-12-08T06:19:28Z

Writing Producer output to Literal storage was recently added with tests confirming non-partitioned use. Most of the logic should be good for use with partitioned Producers (ie: those implementing map), but probably will need a couple small fixes + a test.

The first thing that comes to mind as likely to error is that the Producer's autogenerated input and output Artifacts will have type=Int(...) or type=List(...) instead of type=Collection(...). We'll have to see if there's a good way to know (or infer) whether a given input/output literal should be partitioned.

The text was updated successfully, but these errors were encountered:

JacobHayes · 2022-01-06T16:09:17Z

Perhaps we can determine whether a Producer is mapped and if so, change the generated Literal output Artifacts to have type=Collection(...).

To determine if mapped, we should mostly be safe to just check whether map is implemented (ie: hasattr(cls, "map")). This may be a bit circular (we define the output artifacts first, then use that metadata to validate/generate map), but should be ok since we only auto-generate a map method for non-partitioned cases. We may want to set an attribute on the generated map method that we can check for to handle Producer subclassing (that way we don't just see a base class's generated map and then think it is partitioned).

--

One general type system caveat: the Collection logic currently assumes each "partition" of the collection is itself "list like" or "concatenatable" (eg: pd.DataFrame or database table). For example, Collection(element=Int64()) actually corresponds to a python type hint of -> list[int] (def build(...): return [1]), but these Literal uses are more likely -> int (def build(...): return 1). Perhaps we add an extra flag to Collection like scalar_partitions to disambiguate?

This makes sense for the output (because we run build multiple times), but any input artifacts with scalar_partitions might need to be flexible on list[int] or int dependent on whether map defines a single partition as the dependency or multiple (which is unfortunately, not known/possible to validate until "runtime"). Perhaps we can limit this to "inputs are always lists" in the short term.

JacobHayes added the bug Something isn't working label Dec 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tests (and probably fix) writing partitioned Literal storage as Producer output #141

Add tests (and probably fix) writing partitioned Literal storage as Producer output #141

JacobHayes commented Dec 8, 2021

JacobHayes commented Jan 6, 2022

Add tests (and probably fix) writing *partitioned* Literal storage as Producer output #141

Add tests (and probably fix) writing *partitioned* Literal storage as Producer output #141

Comments

JacobHayes commented Dec 8, 2021

JacobHayes commented Jan 6, 2022

Add tests (and probably fix) writing partitioned Literal storage as Producer output #141

Add tests (and probably fix) writing partitioned Literal storage as Producer output #141