Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced st.from_regex() strategy with alphabet=... argument and filter-rewriting #3479

Closed
Zac-HD opened this issue Oct 17, 2022 · 5 comments
Labels
enhancement it's not broken, but we want it to be better performance go faster! use less memory!

Comments

@Zac-HD
Copy link
Member

Zac-HD commented Oct 17, 2022

Currently, the st.from_regex() strategy produces strings which might include any unicode character (or byte, for bytestrings). However, this can be frustrating if you want strings restricted to some subset of codepoints, e.g. those in a particular encoding (#1664).

I therefore propose adding an alphabet=... strategy, which accepts a collection of length-one strings, or sampled_from() strategy of the same (or st.characters(), for Unicode strings). All generated characters must then be valid according to this set. If there are no matches for the pattern which also satisfy the alphabet restriction, an error should be raised; if this only requires dropping some arms of an alternation or subsets of charactersets, that's OK.

(I considered allowing out-of-alphabet literals etc., but that would violate the invariant we need for good encoding support and also make filter-rewriting with regex intersection work differently. Better to be restrictive but consistent.)

Once we've got that working, it should be feasible to complete the last #2701-style filter rewriting tricks:

@Zac-HD Zac-HD added enhancement it's not broken, but we want it to be better performance go faster! use less memory! labels Oct 17, 2022
@mristin
Copy link
Contributor

mristin commented Oct 17, 2022

@Zac-HD Thanks for this feature!

Here's a use case of ours in case you ever wonder how it might be used: this is important when you want to test XML - JSON conversions for the same specs. XML does not support the whole character range of UTF-32 (for example, � is invalid), so the JSON schema needs to necessarily include always a second pattern for XML convertability.

@jenstroeger
Copy link
Contributor

I was looking into a related scenario where I wanted to include/exclude certain Unicode scripts and codepoints. Using Python’s own re package wouldn’t have the support but the regex package does, e.g.

>>> regex.match(r"\p{Script=Latin}+", "שלום")
>>> regex.match(r"\p{Script=Latin}+", "hello")
<regex.Match object; span=(0, 5), match='hello'>
>>> re.match(r"\p{Script=Latin}+", "hello")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
re.error: bad escape \p at position 0

Would it be possible/make sense to build from_regex() using the regex package to get access to more goodness?

@Zac-HD
Copy link
Member Author

Zac-HD commented Jan 9, 2023

We can't require the regex package (wouldn't work for e.g. CPython or PyPy devs), but I'd love to support it in e.g. hypothesis.extra.regex_package. Also unclear how much code we can share between them, but that's an implementation concern, and I think the user confusion would be manageable.

@jenstroeger
Copy link
Contributor

[…] but I'd love to support it in e.g. hypothesis.extra.regex_package.

That’d be perfect 👍🏼

@Zac-HD
Copy link
Member Author

Zac-HD commented Sep 15, 2023

Closing this issue as we now have alphabet= support, and filter-rewriting is only nice-to-have due to both limited use-cases and implementation complexity 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement it's not broken, but we want it to be better performance go faster! use less memory!
Projects
None yet
Development

No branches or pull requests

3 participants