`hypothesis.extra.numpy` only generates strings of length at most one #2229

DRMacIver · 2019-11-25T15:11:34Z

For reasons I have not fully determined, if you run the following:

HYPOTHESIS_DO_NOT_ESCALATE=true python -m pytest tests/numpy/test_gen_data.py -ktest_may_not_fill_with_non_nan_when_unique_is_set_and_type_is_not_number

You get the following error:

hypothesis.errors.HypothesisDeprecationWarning: Generated array element '\U000eb1120' from from_dtype(dtype('<U')) cannot be represented as dtype dtype('<U') - instead it becomes '\U000eb112' (type <class 'numpy.str_'>).  Consider using a more precise strategy, for example passing the `width` argument to `floats()`, as this will be an error in a future version.

The confusion is not that this code fails with HYPOTHESIS_DO_NOT_ESCALATE set but that it doesn't without it set, because our code for this is all wrong.

The reason for this is that 'U' is something of a lie of a dtype. Consider the following code:

>>> import numpy as np
>>> x = np.zeros(shape=1, dtype='U')
>>> x
array([''], dtype='<U1')
>>> x[0] = 'foo'
>>> x
array(['f'], dtype='<U1')

The 'U' dtype is actually a family of dtypes each of bounded width. When you create an array of unicode objects there's an implicit fixed sized limit on every element. As we create our arrays using np.zeros, this results in all unicode we generate being implicitly truncaed to elements of size one.

The same issue presumably exists with byte strings.

You can see this more directly by the fact that the following test passes but emits a pile of deprecation warnings:

from hypothesis import given
from hypothesis.extra.numpy import arrays

@given(arrays(shape=100, dtype='U'))
def test_short(x):
     assert all(len(t) <= 1 for t in x)

The text was updated successfully, but these errors were encountered:

aarchiba · 2019-11-25T15:56:27Z

It's also worth looking out for trouble with python2 versus python3 here.

DRMacIver · 2019-11-25T16:08:49Z

It's also worth looking out for trouble with python2 versus python3 here.

True! I've only tested this on Python 3. Though given that we're in the dying days of Python 2 support if it presents much trouble we may just want to wait on fixing this until January...

Zac-HD · 2019-11-25T20:51:46Z

Related to #2085... I'd probably just deprecate all usage of unsized string dtypes, have from_dtype treat unsized as size one, and be done with it.

Not sure how that's interacting with DO_NOT_ESCALATE though.

Zac-HD · 2019-11-25T20:53:42Z

Alternatively we could add special handling for string arrays, to fill them differently, but I'd rather not.

DRMacIver · 2019-11-25T21:41:29Z

Alternatively we could add special handling for string arrays, to fill them differently, but I'd rather not.

It wouldn't be super hard to do. We could generate string arrays as object arrays, then convert to the right dtype at the end of generation.

Zac-HD added the legibility make errors helpful and Hypothesis grokable label Nov 25, 2019

Zac-HD mentioned this issue Nov 30, 2019

Generate longer strings for unsized dtypes #2245

Merged

Zac-HD closed this as completed in #2245 Dec 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`hypothesis.extra.numpy` only generates strings of length at most one #2229

`hypothesis.extra.numpy` only generates strings of length at most one #2229

DRMacIver commented Nov 25, 2019 •

edited

aarchiba commented Nov 25, 2019

DRMacIver commented Nov 25, 2019

Zac-HD commented Nov 25, 2019

Zac-HD commented Nov 25, 2019 •

edited

DRMacIver commented Nov 25, 2019

hypothesis.extra.numpy only generates strings of length at most one #2229

hypothesis.extra.numpy only generates strings of length at most one #2229

Comments

DRMacIver commented Nov 25, 2019 • edited

aarchiba commented Nov 25, 2019

DRMacIver commented Nov 25, 2019

Zac-HD commented Nov 25, 2019

Zac-HD commented Nov 25, 2019 • edited

DRMacIver commented Nov 25, 2019

`hypothesis.extra.numpy` only generates strings of length at most one #2229

`hypothesis.extra.numpy` only generates strings of length at most one #2229

DRMacIver commented Nov 25, 2019 •

edited

Zac-HD commented Nov 25, 2019 •

edited