Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests (and probably fix) writing *partitioned* Literal storage as Producer output #141

Open
JacobHayes opened this issue Dec 8, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@JacobHayes
Copy link
Member

Writing Producer output to Literal storage was recently added with tests confirming non-partitioned use. Most of the logic should be good for use with partitioned Producers (ie: those implementing map), but probably will need a couple small fixes + a test.

The first thing that comes to mind as likely to error is that the Producer's autogenerated input and output Artifacts will have type=Int(...) or type=List(...) instead of type=Collection(...). We'll have to see if there's a good way to know (or infer) whether a given input/output literal should be partitioned.

@JacobHayes JacobHayes added the bug Something isn't working label Dec 8, 2021
@JacobHayes
Copy link
Member Author

Perhaps we can determine whether a Producer is mapped and if so, change the generated Literal output Artifacts to have type=Collection(...).

To determine if mapped, we should mostly be safe to just check whether map is implemented (ie: hasattr(cls, "map")). This may be a bit circular (we define the output artifacts first, then use that metadata to validate/generate map), but should be ok since we only auto-generate a map method for non-partitioned cases. We may want to set an attribute on the generated map method that we can check for to handle Producer subclassing (that way we don't just see a base class's generated map and then think it is partitioned).

--

One general type system caveat: the Collection logic currently assumes each "partition" of the collection is itself "list like" or "concatenatable" (eg: pd.DataFrame or database table). For example, Collection(element=Int64()) actually corresponds to a python type hint of -> list[int] (def build(...): return [1]), but these Literal uses are more likely -> int (def build(...): return 1). Perhaps we add an extra flag to Collection like scalar_partitions to disambiguate?

This makes sense for the output (because we run build multiple times), but any input artifacts with scalar_partitions might need to be flexible on list[int] or int dependent on whether map defines a single partition as the dependency or multiple (which is unfortunately, not known/possible to validate until "runtime"). Perhaps we can limit this to "inputs are always lists" in the short term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant