New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add unique items validation to constrained lists #2618
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good start, but needs documentation, fixes and some tweaks.
We need to decide (what do others think?) whether we should support uniqueness checks for things that can't be hashed but can be equalsed - e.g. list
, dict
, set
etc.
pydantic/types.py
Outdated
@classmethod | ||
def unique_items_validator(cls, v: 'Optional[List[T]]') -> 'Optional[List[T]]': | ||
if cls.unique_items and v and len(set(v)) != len(v): | ||
raise errors.ListUniqueItemsError(not_unique=len(v) - len(set(v))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can avoid duplicating the len(v)
and len(set(v))
calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we going to do for lists of unhashable types?
Maybe we just decide that is a limitation of this feature, but at the least we should document it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are incidentally duplicated because the exception message is just a placeholder.
Current error message is a placeholder. As specs do not state which mathematical object corresponds to this schema, there's no clearly defined concept of multiplicity. E.g. in
[1, 1, 2, 2, 2, 3]
one could understand: two items (1, 2), three items (1, 2, 2), or five items (1, 1, 2, 2, 2). I tend to see it as a set, so the first statement makes more sense to me.
I still don't know how many elements are supposed to be leftover if validation fails. Do you have an opinion on this? If not I can try to open an issue at the JSON Schema repo.
But in this case it would also be a micro-optimization, since the second calls are only made on exceptions, and this would save two STORE_FAST
and two LOAD_FAST
in a successful validation.
We can add a failover for unhashable types. E.g. something such as (with the appropriate exceptions)
try:
if v and len(set(v)) != len(v):
raise ValueError
except TypeError:
unique = list()
if len([unique.append(i) for i in v if i not in unique]) != len(v):
raise ValueError from None
I can benchmark some options. IMHO, as in the previous case, I think it is worth optimizing the check at the expense of a more expensive validation error. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failover for unhashable types seems too tricky to mypy 😅 Would you accept a type: ignore
if it turns out to be the most performant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's currently no instance-based output structure for validation errors (only schema-based). So, if it's okay with you, I'll continue with a generic error message to move on to the unhashable types validation.
please update. |
Will this be merged without support for unhashable types (for now)? I'd love to use this with lists. |
please review! @WilliamDEdwards it's already mergeable for hashable and unhashable types 😉 |
Ah, I should've read the full discussion. Can't wait :) |
add failover for unhashable types check keyword value to call the validator add some tests
Does this PR need any work? I'd contribute, but it looks pretty finished. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few comments, let me know what you think.
The "hash or not to hash" questions is a difficult one. I'm not sure what I think about it.
pydantic/types.py
Outdated
|
||
def conlist(item_type: Type[T], *, min_items: int = None, max_items: int = None) -> Type[List[T]]: | ||
if v and len([unique.append(i) for i in v if i not in unique]) != len(v): # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modifying unique inside a comprehension is somewhat unorthodox, also (if we don't care about how many items are not unique, which I will mostly be the case) we can do this faster by raising the error as soon as we find a duplicate.
Something like
for i, value in enumerate(v, start=1):
if value in v[i:]:
raise errors.ListUniqueItemsError()
I haven't done any profiling but I feel something like this should be quicker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just profiled it and you're totally right about performance. Also, the difference between the set
and the loop checks is almost negligible (even when the duplicate is at the end of the iterable).
I also agree that the most of the time we don't care about how many items aren't unique, and furthermore, this calculation reduces the performance. So I'm going to refactor this PR to push it again.
please update. |
coverage is failing. |
Yes, I'm sorry. I mixed up different checks. But this issue in |
thanks so much. |
@samuelcolvin When are you planning to release a new version which includes this patch? |
For future readers: this has been included in v1.9.0. |
I think this may not be working well with optional lists:
Haven't done any further investigation yet, so not 100% sure this is because of |
@WilliamDEdwards Please, could you provide a minimal reproducible example? This is currently working in 1.9.0: from typing import Any, Optional
from pydantic import BaseModel, conlist
class Model(BaseModel):
prop: Optional[conlist(Any, unique_items=True)]
m = Model()
print(m.prop)
#> None |
@nuno-andre Sorry for not being more specific yesterday. I hadn't been able to make an MRE, but wanted to report the issue (if it is an issue, I don't exclude PEBKAC). Here's the MRE. It's worth noting that I use Code:
Example:
|
guys i want, to make my schema columns, unique and no duplicates `Class whiskey_dantic(BaseModel): I am getting this error: |
Hi @nuno-andre,
The problem is in specifying |
@WilliamDEdwards Thank you for raising this issue! |
Change Summary
Add
unique_items
property and validation toConstrainedList
(andconlist
).Related issue number
fix #2011
Checklist
changes/<pull request or issue id>-<github username>.md
file added describing change(see changes/README.md for details)
Rationale
This proposal solves two related issues:
{"type": "array", "uniqueItems": true}
schemas being interpreted as a Python set.JSON Schema specs define
array
as an ordered sequence of zero or more values and its instance equality as both are arrays, and have an equal value item-for-item. So both casting JSON arrays to Python sets and defining Python sets as JSON arrays in the schema can lead to validation errors and/or data corruption.conset
used instead ofuniqueItems
In addition to the ordering issue,
ConstrainedSet
casts arrays to Python sets with no previous validation of items uniqueness; which is fully consistent with the Pythonset
behavior, but opposite to that of the schema with which it is equated.A
unique_items
validation inConstrainedList
would solve both issues while keeping the currentconset
behavior untouched.This PR is a POC and lacks the proper tests and docs, and also it does not replicate the property into other data models (FieldInfo
?). If you consider this proposal relevant, I will be glad to complete it for evaluation.To be evaluated
Implementation follows the specs, with one exception: default value is
None
instead ofFalse
so as"uniqueItems": false
will not be added to the schema unless explicitly stated.In the feature request the property name is
unique
. This PR named itunique_items
for consistency withmax_items
andmin_items
.- Current error message is a placeholder. As specs do not state which mathematical object corresponds to this schema, there's no clearly defined concept of multiplicity. E.g. in[1, 1, 2, 2, 2, 3]
one could understand: two items (1, 2), three items (1, 2, 2), or five items (1, 1, 2, 2, 2). I tend to see it as a set, so the first statement makes more sense to me.