Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate ancillary classes to Pydantic Models #60

Merged
merged 7 commits into from Jul 15, 2021
Merged

Migrate ancillary classes to Pydantic Models #60

merged 7 commits into from Jul 15, 2021

Conversation

JacobHayes
Copy link
Member

@JacobHayes JacobHayes commented Jul 13, 2021

As an alternative to the serialization work in #39, we decided to port many of the core classes to Pydantic models. In addition to ~automatic serialization (at least to dict, each Backend will need to handle from there), Pydantic simplifies class definitions (akin to dataclasses), provides a standard validation mechanism, and improves error reporting.

I haven't yet handled Artifact, Producer, nor Graph, but I think those can be handled simply enough:

  • Artifact can move to a model; the "accumulation" of class and instance annotations/statistics can be done in @validator methods
  • Producer has a small "plan" laid out in code comments: add Producer inputs as standard model arguments with build and map arguments referencing a (non-proper) subset of those.
  • Graph can move to a model, maybe with a bit of work around the ArtifactBox (not sure how that'll serialize - may not be an issue)

Additionally for Artifact and Producer, we'll probably want separate "ArtifactPartition" and "ProducerRun" style sub-models respectively to track more granular details (partition keys, runtime metadata, etc) - but we'll punt on these for now.

@JacobHayes JacobHayes self-assigned this Jul 13, 2021
@codecov
Copy link

codecov bot commented Jul 13, 2021

Codecov Report

Merging #60 (23a46ca) into golden (c914ea8) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            golden       #60   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           15        16    +1     
  Lines          567       583   +16     
  Branches        71        72    +1     
=========================================
+ Hits           567       583   +16     
Impacted Files Coverage Δ
src/arti/graphs/core.py 100.00% <ø> (ø)
src/arti/annotations/core.py 100.00% <100.00%> (ø)
src/arti/artifacts/core.py 100.00% <100.00%> (ø)
src/arti/fingerprints/core.py 100.00% <100.00%> (ø)
src/arti/formats/core.py 100.00% <100.00%> (ø)
src/arti/internal/models.py 100.00% <100.00%> (ø)
src/arti/producers/core.py 100.00% <100.00%> (ø)
src/arti/storage/core.py 100.00% <100.00%> (ø)
src/arti/types/core.py 100.00% <100.00%> (ø)
src/arti/versions/core.py 100.00% <100.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8cf97fa...23a46ca. Read the comment docs.

Copy link

@danielhfrank danielhfrank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks neat so far! I think my main comment here is the one about abc.

One other syntax thing - I think I recall there being some way to make it possible to instantiate pydantic models using positional args instead of kwargs, I think if you used the pydantic.dataclass syntax. But then I believe there was a tradeoff about how you had to specify config options (like the extra thing you did). Try to play around with it! Not a necessity but might be nice if it starts to feel like a headache to kwarg everything

MODELS_REQUIRING_SUBCLASS = []


def requires_subclass(cls: type) -> type:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a little while to realize this was being used as a decorator, a comment would be nice!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know what, now that I'm seeing about what this mechanism is trying to accomplish, I think it'd be possible to do with abc's abstract classes. It's a bit more of a standard mechanism, have you gotten to look at it? I'm 95% sure that I have used it in combination with pydantic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good idea w/ abc! Two questions (may experiment in a bit):

  • the requires_subclass ones are intended to be "dynamic" (setting arbitrary attributes) so won't have anything to mark abstract here. 😅 Does subclassing ABC alone prevent direct instantiation or what do you think there?
  • do you remember getting a metaclass conflict w/ pydantic+ABC? Think I could resolve w/ a custom metaclass if so, but might be slightly messy

If ABCs don't work cleanly, another option I thought about was an _abstract class attr flagging whether to allow/disallow instantiation and adding an __init_subclass__ that sets to "allow" if not set explicitly on each subclass.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, they already thought of ABCMeta of course (their ModelMetaclass subclasses ABCMeta)! Now to figure out the abstract part 🔍

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so as I'm looking at this stuff again, I'm reminded that the only way to define an "abstract class" is for it to have methods (or properties) that are annotated as abstract. And if I understand your question, you're saying that you're not sure what it is that should be marked abstract in some of these cases.

So in the case of Version, it's fairly clear that you can make fingerprint an abstract method and you're good to go. In some of the other cases, it's not as clear. When I'm fighting with a library like that, it is sometimes a sign that I could stand to rethink my strategy. Like, why does Annotation need to be abstract? Or what about Type? I wonder if it might be useful to use a Union instead of a dummy base class, so something like Type = Union[String, Int, Null], something like that.

Happy to talk in more detail, this stuff is really fun to play around with!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if I understand your question, you're saying that you're not sure what it is that should be marked abstract in some of these cases.

Yeah, some tension is the ambiguous semantics for "abstract". With the current approach, one could trivially subclass any @requires_subclass class with no additional properties and instantiate - is that ok or should that still be "abstract"?

For Type or Annotation specifically, yes, subclassing alone may be enough: there is additional semantic meaning intrinsic in the subclass. For example, Geo(Type) or PII(Annotation) alone is meaningful (though of course, expanding with eg: srid: int or sensitive: bool wouldn't hurt).

So in the case of Version, it's fairly clear that you can make fingerprint an abstract method

Ah, good call. 👍

why does Annotation need to be abstract? Or what about Type? I wonder if it might be useful to use a Union instead of a dummy base class, so something like Type = Union[String, Int, Null], something like that.

Good question. 🤔 The intent on these two (and a lot of the other classes in the core.py modules) is that external modules (plugins, user code, etc) will define additional subclasses known only at runtime.

I think you're right on Union style behavior but because it's 1. dynamic (precluding Union) and 2. arbitrary data rather than abstract behavior (precluding ABC and Protocol), we may be forced out of the "happy path"...? The only other thing I can think of is (ab)using inheritance to create "dynamic" discriminated Unions (I want to check both isinstance(x, Type) and isinstance(x, Geo), depending on the context) similar to now.

--

Seems pydantic has an open issue about (static) discriminated unions (which would kinda be like discriminator='__class_key__' in this PR) and another one with a desire for dynamic unions (they went with a registry too, but using __init_subclass__ and missing the actual validator part).

Unless you can think of anything else, I will probably:

  • replace @register_subclass with an __abstract__: ClassVar[bool], add an __init_subclass__ defaulting __abstract__ = False, and change the validator to check for cls.__abstract__ (changes not necessary, I just like it a bit more)
  • add some good comments summarizing a handful of these things above the __abstract__ flag (hopefully a bit more concisely, now that you've helped clarify the intent)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is all veering into pure discussion territory; I can tell these are early strokes that will be revised, so don't consider any of this blocking)

OK, the bit about user-defined types is an interesting twist, I can see how that gets in the way implementation-wise of Unions. I still think there's a question here worth considering: what are the actual requirements for an Annotation? Or for a Type? Like, what does code that interacts with those objects expect them to be able to do? I think that trying to implement some functionality will likely be illuminating. If it's just that they need to be serializable, for example, then perhaps requiring that they be pydantic models is sufficient. Serialization of user-defined types presents its own set of challenges, which you can see succinctly described in the second issue that you linked. I worked on my own library that tried to deal with this sort of thing, happy to discuss further if/when that comes up!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serializing arbitrary data is the main need within the framework (which pydantic covers, as you say), but they will be instantiated by users and most consuming code will be "purpose" built. Eg:

  • Types are used by the TypeSystems to convert arti Types <-> third party types
  • Annotations is largely for "data ops", but may influence (user) code "responding" to different subclasses specifically

Guess adding comments to Annotation, Type, etc would be useful sooner than later. 😁

Does this feel a bit tidier than the registry?


@validator("dt", always=True)
@classmethod
def _requires_timezone(cls, dt: datetime) -> datetime:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, now we're seeing where pydantic comes in handy!

@JacobHayes
Copy link
Member Author

Gonna go ahead and merge, but will circle back for Artifact, Producer, and Graph

@JacobHayes JacobHayes merged commit 3548f02 into golden Jul 15, 2021
@JacobHayes JacobHayes deleted the pydantic branch July 15, 2021 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants