Enhanced `st.from_regex()` strategy with `alphabet=...` argument and filter-rewriting #3479

Zac-HD · 2022-10-17T02:23:49Z

Currently, the st.from_regex() strategy produces strings which might include any unicode character (or byte, for bytestrings). However, this can be frustrating if you want strings restricted to some subset of codepoints, e.g. those in a particular encoding (#1664).

I therefore propose adding an alphabet=... strategy, which accepts a collection of length-one strings, or sampled_from() strategy of the same (or st.characters(), for Unicode strings). All generated characters must then be valid according to this set. If there are no matches for the pattern which also satisfy the alphabet restriction, an error should be raised; if this only requires dropping some arms of an alternation or subsets of charactersets, that's OK.

(I considered allowing out-of-alphabet literals etc., but that would violate the invariant we need for good encoding support and also make filter-rewriting with regex intersection work differently. Better to be restrictive but consistent.)

Once we've got that working, it should be feasible to complete the last #2701-style filter rewriting tricks:

Support st.text()/st.binary() filtered with re.compile(...).find/match/fullmatch
Support st.from_regex() with the same filters - note that this will require some upstream work in greenery.
See also Use greenery and regex_transformer to merge pattern and patternProperties keywords python-jsonschema/hypothesis-jsonschema#85

The text was updated successfully, but these errors were encountered:

mristin · 2022-10-17T06:11:39Z

@Zac-HD Thanks for this feature!

Here's a use case of ours in case you ever wonder how it might be used: this is important when you want to test XML - JSON conversions for the same specs. XML does not support the whole character range of UTF-32 (for example,  is invalid), so the JSON schema needs to necessarily include always a second pattern for XML convertability.

jenstroeger · 2023-01-09T23:06:02Z

I was looking into a related scenario where I wanted to include/exclude certain Unicode scripts and codepoints. Using Python’s own re package wouldn’t have the support but the regex package does, e.g.

>>> regex.match(r"\p{Script=Latin}+", "שלום")
>>> regex.match(r"\p{Script=Latin}+", "hello")
<regex.Match object; span=(0, 5), match='hello'>
>>> re.match(r"\p{Script=Latin}+", "hello")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
re.error: bad escape \p at position 0

Would it be possible/make sense to build from_regex() using the regex package to get access to more goodness?

Zac-HD · 2023-01-09T23:39:40Z

We can't require the regex package (wouldn't work for e.g. CPython or PyPy devs), but I'd love to support it in e.g. hypothesis.extra.regex_package. Also unclear how much code we can share between them, but that's an implementation concern, and I think the user confusion would be manageable.

jenstroeger · 2023-01-09T23:58:58Z

[…] but I'd love to support it in e.g. hypothesis.extra.regex_package.

That’d be perfect 👍🏼

Zac-HD · 2023-09-15T09:15:54Z

Closing this issue as we now have alphabet= support, and filter-rewriting is only nice-to-have due to both limited use-cases and implementation complexity 🙂

Zac-HD added enhancement it's not broken, but we want it to be better performance go faster! use less memory! labels Oct 17, 2022

Zac-HD mentioned this issue Oct 30, 2022

Check for annotated-types constraints in st.from_type(Annotated[T, ...]) #3356

Closed

Zac-HD mentioned this issue Nov 8, 2022

I think we should remove Regex annotated-types/annotated-types#9

Closed

Zac-HD mentioned this issue Feb 20, 2023

New arguments to from_schema() to constrain generated strings python-jsonschema/hypothesis-jsonschema#101

Closed

Zac-HD mentioned this issue Sep 4, 2023

Add alphabet= argument to st.from_regex() #3730

Merged

Zac-HD closed this as completed Sep 15, 2023

Zac-HD mentioned this issue Nov 19, 2023

Filter-rewriting 2: rewrite harder #3795

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced `st.from_regex()` strategy with `alphabet=...` argument and filter-rewriting #3479

Enhanced `st.from_regex()` strategy with `alphabet=...` argument and filter-rewriting #3479

Zac-HD commented Oct 17, 2022

mristin commented Oct 17, 2022

jenstroeger commented Jan 9, 2023

Zac-HD commented Jan 9, 2023

jenstroeger commented Jan 9, 2023

Zac-HD commented Sep 15, 2023

Enhanced st.from_regex() strategy with alphabet=... argument and filter-rewriting #3479

Enhanced st.from_regex() strategy with alphabet=... argument and filter-rewriting #3479

Comments

Zac-HD commented Oct 17, 2022

mristin commented Oct 17, 2022

jenstroeger commented Jan 9, 2023

Zac-HD commented Jan 9, 2023

jenstroeger commented Jan 9, 2023

Zac-HD commented Sep 15, 2023

Enhanced `st.from_regex()` strategy with `alphabet=...` argument and filter-rewriting #3479

Enhanced `st.from_regex()` strategy with `alphabet=...` argument and filter-rewriting #3479