Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model: need to avoid a "pure" Union or "serialize" it with type info #333

Open
yarikoptic opened this issue Jan 20, 2021 · 4 comments
Open
Labels
schema Issues relating to metadata schema

Comments

@yarikoptic
Copy link
Member

this is to continue on our metadata meetup with @satra @jwodder @AlmightyYakob @waxlamp discussing the difficulties of establishing sensible while generic and scalable UI for (dandiset) metadata editing.

One of the problematic cases is the use of Union type/construct in the model, and then relying on pydantic to "do the right choice" of the underlying model while consuming data from serialized (yaml or json) data.
Here is the Unions we have ATM (0.10.0-40-gcee1cbd):

$> grep Union dandi/models.py | grep -v 'import'
    unitCode: Union[str, AnyUrl] = Field(None, nskey="schema")
    value: Union[str, bool, int, float, List[Union[str, bool, int, float]]] = Field(
    propertyID: Union[IdentifierType, AnyUrl, str] = Field(
Identifier = Union[AnyUrl, PropertyValue, str]
    repository: Union[str, AnyUrl] = Field(
    wasAssociatedWith: Optional[Union[Person, Organization, Software]] = Field(
    contributor: Optional[List[Union[Person, Organization]]] = Field(
    about: Optional[List[Union[Disorder, Anatomy, Identifier]]] = Field(
    studyTarget: Optional[List[Union[str, AnyUrl]]] = Field(
    wasGeneratedBy: Optional[Union[Activity, AnyUrl]] = Field(
    contributor: List[Union[Person, Organization]] = Field(
    doi: Optional[Union[str, AnyUrl]] = Field(
    encodingFormat: Union[str, AnyUrl] = Field(

The core assumption is that our types would have some attributes unique to them, so the choice of the type becomes unambiguous. It is not the case ATM (if ever could be achieved), and here are some thoughts/observations:

Order of types should not matter but it does, preventing round trip to .json and back

ATM pydantic would choose the first type (in the list within Union) which 'satisfies' the data. Unfortunately IMHO this is a mis-feature and ideally we should not rely on it. It makes it virtually impossible to define/use two types which have only a semantic difference by being different types and otherwise having the same attributes. E.g.,

  • we cannot place any two TypeModel subclasses which have no non-optional attributes into a Union (e.g. Anatomy and Disorder for about above; Disorder does have an attribute dxdate but it is optional), since then former would always take "precedence" when we just read in serialized data without any type annotation. So ATM we cannot do R/T on those into/from serialized form which has no type annotations. Yes, we could "improve" the schema models by making them more 'specialized' (magic word to mention here: "Ontologies") but IMHO it would be "artificial"/duplicate/fragile (depending on the case ;)), since semantically we already know that they are different types, since we defined them separately.
  • we have to "order" types from more specialized to a more generic one. E.g. in above we have a number of Union[str, AnyUrl]. We would need to swap the order because otherwise no R/T is possible since AnyUrl serializes just into a URL
quick demo
def test_yarik_assumptions():
    import json
    from dandi.models import DandiBaseModel, Union, Field, AnyUrl

    class WithUrl(DandiBaseModel):
        a: Union[str, AnyUrl] = Field(None, nskey="schema")

    url = AnyUrl("http://purl.obolibrary.org/obo/PATO_0000384",
                        scheme="http",
                        host="purl.obolibrary.org",
                        tld="org",
                        host_type="domain",
                        path="/obo/PATO_0000384")
    v1 = WithUrl(a=url)
    print(f'str: {v1}')
    print(f'repr: {v1!r}')
    v1_json = v1.json(exclude_unset=True, exclude_none=True)
    print(f'json: {v1_json}')
    v1_rt = WithUrl(**json.loads(v1_json))
    print(f'R/T repr: {v1_rt!r}')

results in

str: a=AnyUrl('http://purl.obolibrary.org/obo/PATO_0000384', scheme='http', host='purl.obolibrary.org', tld='org', host_type='domain', path='/obo/PATO_0000384')
repr: WithUrl(a=AnyUrl('http://purl.obolibrary.org/obo/PATO_0000384', scheme='http', host='purl.obolibrary.org', tld='org', host_type='domain', path='/obo/PATO_0000384'))
json: {"a": "http://purl.obolibrary.org/obo/PATO_0000384"}
R/T repr: WithUrl(a='http://purl.obolibrary.org/obo/PATO_0000384')
well we could (and probably should) swap and place `AnyUrl` first. But it would "introduce" a *super power* that any url-like string would become a AnyUrl upon R/T. Some might consider it a feature, I would not: I do not think we should "mutate" types upon a validator just stating that something doesn't look like something else.
  • we would need to add testing of our schema to ensure so it is unambigous and order does not matter (burden/fragility again in case of model extensions) or otherwise we would cause data types "mutation" upon migration between schemas
  • it makes it ambiguous to the user seeing serialized data on what type it is if multiple can "match" (think Organization vs Person if only name, url, email are provided) (see another example below).
  • ideally then we need to maintain "pydantic" code (not just json schema; and ideally with exact versions of pydantic used ;) ) for each version to be able to migrate serialized metadata while accounting for possibly different order effects across different versions of the models

So my question is ...

is there is some existing already type construct we could use instead of Union or in addition (alike a "decorator" for each value then which would disclose the "type") so that serialized data (given the schema) could be unambigously deserialized?

E.g. taking for example the top of https://github.com/dandi/dandi-api-datasets/blob/master/000004/dandiset.yaml (in current schema):

about:
- identifier: MTL
  name: Medial Temporal Lobe

which corresponds to

about: Optional[List[Union[Disorder, Anatomy, Identifier]]] = Field(

where it takes my unquestionably high expertise in the matter to say that it is an Anatomy and not e.g. Disorder... (Exercising on the case of the List[Union] not just pure Union but I think it should generalize). I could see a following serialization which would disambiguate

about:
- anatomy:
  - identifier: MTL
    name: Medial Temporal Lobe
about:
- identifier: MTL
  name: Medial Temporal Lobe
  _schemaType: anatomy

Both would make serialization only a bit heavier, but clear(er) to humans and computers (UI) and thus allow for R/T. The latter one is IMHO more "generic" if for every value participating in a Unioned type we export the _schemaType. That would then apply only to values which are part of the union, and make it easily "upgradeable": if we make some attribute from a single to Union type, it will just add that "protected" attribute to an existing record without changing actual "layout".

BUT it would make it quite ugly in case of that simple case of AnyUrl, str. My answer would be: that is a reasonable compromise to achieve unambiguous R/T at a minor cost of visual/human readability. But may be we could do better, and just establish a "programmable" rule that in the case of the first Type chosen in the Union, we omit explicit _schemaType? then it could be coded in any schema handling.

Yet another (wild idea) alternative is to forget about JSON and use its superset YAML for serialization, and then provide type annotations in the comments

about:
- identifier: MTL  # type: anatomy
  name: Medial Temporal Lobe

even though it would likely inflict more developer pains (in particular in JS world; and I have no clue if it would be feasible to achieve with pydantic ATM) while making it more user-friendly. That would also keep opportunities for later using additional features of YAML: anchors (&) and references (*) in case of those circular entities (not the topic of this "post") we touched upon.

yarikoptic added a commit that referenced this issue Jan 20, 2021
…generic

See #333 for more information

=== Do not change lines below ===
{
 "chain": [],
 "cmd": "sed -i -e 's/str, AnyUrl/AnyUrl, str/g' {inputs}",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [
  "dandi/models.py"
 ],
 "outputs": [
  "dandi/models.py"
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^
@satra
Copy link
Member

satra commented Jan 21, 2021

thanks @yarikoptic for this detailed issue. there are a few additional things - this is true in json schema as well, not just pydantic, where you can specifiy union of allowed types in the schema. unfortunately there is no type information in the data itself. the validator then can determine if the data fits any of the schemas. one thing we can do is to add "type" (as is the case in jsonld) as a property in every object and make this a readonly value. here is the schema.org jsonld example for book with identifier.

{
 "@context": "https://schema.org/",
 "@type": "Book",
 "name": "Library linked data in the cloud : OCLC's experiments with new models of resource description",
 "author": "Carol Jean Godby",
 "isbn": "9781627052191"
 "identifier": {
 "@type": "PropertyValue",
   "propertyID": "OCoLC",
   "value":  "889647468"
   },
 "sameAs": "http://www.worldcat.org/oclc/889647468"
}

this will not address trivial types like AnyURL and str, but should handle the more complex types.

@waxlamp
Copy link
Member

waxlamp commented Jan 27, 2021

The relevant notion we are dancing around here is that of the tagged union. Unfortunately, it seems that pydantic explicitly doesn't yet support this. Here is someone's attempt at a custom validator for Pydantic that uses Literal values to act as the type tag.

I only have rather vague advice for this situation, which boils down to: let's keep things as simple as possible. In this case, simplicity meshes with explicitness (making it a twofer in terms of Python Zen), requiring something like a @type field uniformly on all models. It is explicit for obvious reasons, and simple because it makes type detection a uniform problem across all models.

It may feel wrong because we shouldn't have to do it, and it makes the schemata themselves bulkier, but IMO that would be made up for by forcing the extra complexity on the machine, and detaching it from our human selves (keeping in mind that we are trying, as a design goal, not to expose DANDI users to the raw schemata). I believe this would also allow an easier path to UI customization (should that be necessary).

@jjnesbitt
Copy link
Member

Just want to point out that it's important that if we are going to include a type tags on unions, it specifically needs to be e.g.

schemaKey: anatomy

in order for the client UI component to recognize it first hand. Otherwise, we'd still have to transform the schema on the client, which is not something we want to do.

@satra
Copy link
Member

satra commented Jan 27, 2021

@AlmightyYakob and @waxlamp - i did a bunch of work on this yesterday to better understand the issues and i'm getting closer to a newer model. indeed certain specific types of union are what is causing trouble. not all unions. and the other piece is the union in identifier. i hope to resolve both of these things today or by tomorrow.

@yarikoptic yarikoptic added the schema Issues relating to metadata schema label Apr 15, 2021
jwodder pushed a commit to dandi/dandi-schema that referenced this issue May 20, 2021
…generic

See dandi/dandi-cli#333 for more information

=== Do not change lines below ===
{
 "chain": [],
 "cmd": "sed -i -e 's/str, AnyUrl/AnyUrl, str/g' {inputs}",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [
  "dandi/models.py"
 ],
 "outputs": [
  "dandi/models.py"
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema Issues relating to metadata schema
Projects
None yet
Development

No branches or pull requests

4 participants