Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hypothesis.extra.numpy only generates strings of length at most one #2229

Closed
DRMacIver opened this issue Nov 25, 2019 · 5 comments · Fixed by #2245
Closed

hypothesis.extra.numpy only generates strings of length at most one #2229

DRMacIver opened this issue Nov 25, 2019 · 5 comments · Fixed by #2245
Labels
legibility make errors helpful and Hypothesis grokable

Comments

@DRMacIver
Copy link
Member

DRMacIver commented Nov 25, 2019

For reasons I have not fully determined, if you run the following:

HYPOTHESIS_DO_NOT_ESCALATE=true python -m pytest tests/numpy/test_gen_data.py -ktest_may_not_fill_with_non_nan_when_unique_is_set_and_type_is_not_number

You get the following error:

hypothesis.errors.HypothesisDeprecationWarning: Generated array element '\U000eb1120' from from_dtype(dtype('<U')) cannot be represented as dtype dtype('<U') - instead it becomes '\U000eb112' (type <class 'numpy.str_'>).  Consider using a more precise strategy, for example passing the `width` argument to `floats()`, as this will be an error in a future version.

The confusion is not that this code fails with HYPOTHESIS_DO_NOT_ESCALATE set but that it doesn't without it set, because our code for this is all wrong.

The reason for this is that 'U' is something of a lie of a dtype. Consider the following code:

>>> import numpy as np
>>> x = np.zeros(shape=1, dtype='U')
>>> x
array([''], dtype='<U1')
>>> x[0] = 'foo'
>>> x
array(['f'], dtype='<U1')

The 'U' dtype is actually a family of dtypes each of bounded width. When you create an array of unicode objects there's an implicit fixed sized limit on every element. As we create our arrays using np.zeros, this results in all unicode we generate being implicitly truncaed to elements of size one.

The same issue presumably exists with byte strings.

You can see this more directly by the fact that the following test passes but emits a pile of deprecation warnings:

from hypothesis import given
from hypothesis.extra.numpy import arrays

@given(arrays(shape=100, dtype='U'))
def test_short(x):
     assert all(len(t) <= 1 for t in x)
@aarchiba
Copy link
Contributor

It's also worth looking out for trouble with python2 versus python3 here.

@DRMacIver
Copy link
Member Author

It's also worth looking out for trouble with python2 versus python3 here.

True! I've only tested this on Python 3. Though given that we're in the dying days of Python 2 support if it presents much trouble we may just want to wait on fixing this until January...

@Zac-HD
Copy link
Member

Zac-HD commented Nov 25, 2019

Related to #2085... I'd probably just deprecate all usage of unsized string dtypes, have from_dtype treat unsized as size one, and be done with it.

Not sure how that's interacting with DO_NOT_ESCALATE though.

@Zac-HD Zac-HD added the legibility make errors helpful and Hypothesis grokable label Nov 25, 2019
@Zac-HD
Copy link
Member

Zac-HD commented Nov 25, 2019

Alternatively we could add special handling for string arrays, to fill them differently, but I'd rather not.

@DRMacIver
Copy link
Member Author

Alternatively we could add special handling for string arrays, to fill them differently, but I'd rather not.

It wouldn't be super hard to do. We could generate string arrays as object arrays, then convert to the right dtype at the end of generation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
legibility make errors helpful and Hypothesis grokable
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants