Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add unique items validation to constrained lists #2618

Merged
merged 5 commits into from Dec 10, 2021
Merged

feat: Add unique items validation to constrained lists #2618

merged 5 commits into from Dec 10, 2021

Conversation

nuno-andre
Copy link
Contributor

@nuno-andre nuno-andre commented Apr 1, 2021

Change Summary

Add unique_items property and validation to ConstrainedList (and conlist).

Related issue number

fix #2011

Checklist

  • Unit tests for the changes exist
  • Tests pass on CI and coverage remains at 100%
  • Documentation reflects the changes where applicable
  • changes/<pull request or issue id>-<github username>.md file added describing change
    (see changes/README.md for details)
  • My PR is ready to review, please add a comment including the phrase "please review" to assign reviewers

Rationale

This proposal solves two related issues:

  • {"type": "array", "uniqueItems": true} schemas being interpreted as a Python set.
    JSON Schema specs define array as an ordered sequence of zero or more values and its instance equality as both are arrays, and have an equal value item-for-item. So both casting JSON arrays to Python sets and defining Python sets as JSON arrays in the schema can lead to validation errors and/or data corruption.
  • conset used instead of uniqueItems
    In addition to the ordering issue, ConstrainedSet casts arrays to Python sets with no previous validation of items uniqueness; which is fully consistent with the Python set behavior, but opposite to that of the schema with which it is equated.

A unique_items validation in ConstrainedList would solve both issues while keeping the current conset behavior untouched.

This PR is a POC and lacks the proper tests and docs, and also it does not replicate the property into other data models (FieldInfo?). If you consider this proposal relevant, I will be glad to complete it for evaluation.

To be evaluated

  • Implementation follows the specs, with one exception: default value is None instead of False so as "uniqueItems": false will not be added to the schema unless explicitly stated.

  • In the feature request the property name is unique. This PR named it unique_items for consistency with max_items and min_items.

- Current error message is a placeholder. As specs do not state which mathematical object corresponds to this schema, there's no clearly defined concept of multiplicity. E.g. in [1, 1, 2, 2, 2, 3] one could understand: two items (1, 2), three items (1, 2, 2), or five items (1, 1, 2, 2, 2). I tend to see it as a set, so the first statement makes more sense to me.

Copy link
Member

@samuelcolvin samuelcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good start, but needs documentation, fixes and some tweaks.

We need to decide (what do others think?) whether we should support uniqueness checks for things that can't be hashed but can be equalsed - e.g. list, dict, set etc.

pydantic/types.py Outdated Show resolved Hide resolved
@classmethod
def unique_items_validator(cls, v: 'Optional[List[T]]') -> 'Optional[List[T]]':
if cls.unique_items and v and len(set(v)) != len(v):
raise errors.ListUniqueItemsError(not_unique=len(v) - len(set(v)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can avoid duplicating the len(v) and len(set(v)) calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we going to do for lists of unhashable types?

Maybe we just decide that is a limitation of this feature, but at the least we should document it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are incidentally duplicated because the exception message is just a placeholder.

Current error message is a placeholder. As specs do not state which mathematical object corresponds to this schema, there's no clearly defined concept of multiplicity. E.g. in [1, 1, 2, 2, 2, 3] one could understand: two items (1, 2), three items (1, 2, 2), or five items (1, 1, 2, 2, 2). I tend to see it as a set, so the first statement makes more sense to me.

I still don't know how many elements are supposed to be leftover if validation fails. Do you have an opinion on this? If not I can try to open an issue at the JSON Schema repo.

But in this case it would also be a micro-optimization, since the second calls are only made on exceptions, and this would save two STORE_FAST and two LOAD_FAST in a successful validation.


We can add a failover for unhashable types. E.g. something such as (with the appropriate exceptions)

try:
    if v and len(set(v)) != len(v):
        raise ValueError
except TypeError:
    unique = list()
    if len([unique.append(i) for i in v if i not in unique]) != len(v):
        raise ValueError from None

I can benchmark some options. IMHO, as in the previous case, I think it is worth optimizing the check at the expense of a more expensive validation error. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failover for unhashable types seems too tricky to mypy 😅 Would you accept a type: ignore if it turns out to be the most performant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's currently no instance-based output structure for validation errors (only schema-based). So, if it's okay with you, I'll continue with a generic error message to move on to the unhashable types validation.

@samuelcolvin
Copy link
Member

please update.

@WilliamDEdwards
Copy link

WilliamDEdwards commented Sep 9, 2021

Will this be merged without support for unhashable types (for now)? I'd love to use this with lists.

@nuno-andre
Copy link
Contributor Author

please review!

@WilliamDEdwards it's already mergeable for hashable and unhashable types 😉

@nuno-andre nuno-andre changed the title Add unique items validation to constrained lists - POC Add unique items validation to constrained lists Sep 10, 2021
@nuno-andre nuno-andre changed the title Add unique items validation to constrained lists feat: Add unique items validation to constrained lists Sep 10, 2021
@WilliamDEdwards
Copy link

please review!

@WilliamDEdwards it's already mergeable for hashable and unhashable types 😉

Ah, I should've read the full discussion. Can't wait :)

add failover for unhashable types
check keyword value to call the validator
add some tests
@WilliamDEdwards
Copy link

Does this PR need any work? I'd contribute, but it looks pretty finished.

Copy link
Member

@samuelcolvin samuelcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments, let me know what you think.

The "hash or not to hash" questions is a difficult one. I'm not sure what I think about it.

pydantic/types.py Outdated Show resolved Hide resolved

def conlist(item_type: Type[T], *, min_items: int = None, max_items: int = None) -> Type[List[T]]:
if v and len([unique.append(i) for i in v if i not in unique]) != len(v): # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modifying unique inside a comprehension is somewhat unorthodox, also (if we don't care about how many items are not unique, which I will mostly be the case) we can do this faster by raising the error as soon as we find a duplicate.

Something like

for i, value in enumerate(v, start=1):
    if value in v[i:]:
        raise errors.ListUniqueItemsError()

I haven't done any profiling but I feel something like this should be quicker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just profiled it and you're totally right about performance. Also, the difference between the set and the loop checks is almost negligible (even when the duplicate is at the end of the iterable).

I also agree that the most of the time we don't care about how many items aren't unique, and furthermore, this calculation reduces the performance. So I'm going to refactor this PR to push it again.

@samuelcolvin
Copy link
Member

please update.

@samuelcolvin
Copy link
Member

coverage is failing.

@nuno-andre
Copy link
Contributor Author

Yes, I'm sorry. I mixed up different checks.

But this issue in mypy.py is not related to my PR, isn't it?

@samuelcolvin samuelcolvin merged commit 91ecfd6 into pydantic:master Dec 10, 2021
@samuelcolvin
Copy link
Member

thanks so much.

@WilliamDEdwards
Copy link

@samuelcolvin When are you planning to release a new version which includes this patch?

@nuno-andre nuno-andre deleted the conlist branch December 11, 2021 20:48
@WilliamDEdwards
Copy link

@samuelcolvin When are you planning to release a new version which includes this patch?

For future readers: this has been included in v1.9.0.

@WilliamDEdwards
Copy link

I think this may not be working well with optional lists:

'NoneType' object is not iterable

Haven't done any further investigation yet, so not 100% sure this is because of unique_items.

@nuno-andre
Copy link
Contributor Author

@WilliamDEdwards Please, could you provide a minimal reproducible example? This is currently working in 1.9.0:

from typing import Any, Optional
from pydantic import BaseModel, conlist

class Model(BaseModel):
    prop: Optional[conlist(Any, unique_items=True)]

m = Model()
print(m.prop)
#> None

@WilliamDEdwards
Copy link

@nuno-andre Sorry for not being more specific yesterday. I hadn't been able to make an MRE, but wanted to report the issue (if it is an issue, I don't exclude PEBKAC).

Here's the MRE. It's worth noting that I use List with Field rather than conlist, because mypy doesn't seem to work well with con*.

Code:

from pydantic import BaseModel, Field
from typing import Optional, List

class TestSchema(BaseModel):
  test: Optional[List[str]] = Field(default=..., unique_items=True)

Example:

>>> TestSchema(test=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pydantic/main.py", line 331, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for TestSchema
test
  'NoneType' object is not iterable (type=type_error)

>>> TestSchema(test=['test'])
TestSchema(test=['test'])

@ChibuezeOnejeme
Copy link

ChibuezeOnejeme commented May 18, 2022

guys i want, to make my schema columns, unique and no duplicates

`Class whiskey_dantic(BaseModel):
name: str = Field(..., unique_items=True)
price: str = Field(..., unique_items=True)
class Config:
orm_mode =True

I am getting this error:
ValueError: On-field "name" the following field constraints are set but not enforced: unique_items.
For more details see https://pydantic-docs.helpmanual.io/usage/schema/#unenforced-field-constraints

@WilliamDEdwards
Copy link

Hi @nuno-andre,

@WilliamDEdwards Please, could you provide a minimal reproducible example? This is currently working in 1.9.0:

from typing import Any, Optional
from pydantic import BaseModel, conlist

class Model(BaseModel):
    prop: Optional[conlist(Any, unique_items=True)]

m = Model()
print(m.prop)
#> None

The problem is in specifying None explicitly. Model() works, Model(prop=None) doesn't, with this error.

@nuno-andre
Copy link
Contributor Author

nuno-andre commented Oct 16, 2022

@WilliamDEdwards Thank you for raising this issue!

I've created the issue (#4626) and a PR (#4627).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sequence constraint to enforce only unique items
5 participants