Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix timedelta*float and median/percentile/quantile NaT handling #21726

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
28 changes: 16 additions & 12 deletions numpy/core/src/umath/loops.c.src
Original file line number Diff line number Diff line change
Expand Up @@ -1124,15 +1124,17 @@ TIMEDELTA_md_m_multiply(char **args, npy_intp const *dimensions, npy_intp const
BINARY_LOOP {
const npy_timedelta in1 = *(npy_timedelta *)ip1;
const double in2 = *(double *)ip2;
if (in1 == NPY_DATETIME_NAT) {
if (in1 == NPY_DATETIME_NAT || npy_isnan(in2)) {
*((npy_timedelta *)op1) = NPY_DATETIME_NAT;
}
else {
double result = in1 * in2;
if (npy_isfinite(result)) {
*((npy_timedelta *)op1) = (npy_timedelta)result;
}
else {
/* `nearbyint` avoids warnings (should not matter here, though) */
double result = nearbyint(in1 * in2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is nearbyint similar to python round?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roughly speaking, yes. Of course it will round C-style. It is really the same thing as np.rint, except that it should give a warning (which should not really matter. If it sets a warning it should be the one we set anyway, but...)

npy_timedelta int_res = (npy_timedelta)result;
*((npy_timedelta *)op1) = int_res;
if ((double)int_res != result || int_res == NPY_DATETIME_NAT) {
/* Conversion creates a new NaT from non-NaN/NaT input */
npy_set_floatstatus_invalid();
*((npy_timedelta *)op1) = NPY_DATETIME_NAT;
}
}
Expand All @@ -1145,15 +1147,17 @@ TIMEDELTA_dm_m_multiply(char **args, npy_intp const *dimensions, npy_intp const
BINARY_LOOP {
const double in1 = *(double *)ip1;
const npy_timedelta in2 = *(npy_timedelta *)ip2;
if (in2 == NPY_DATETIME_NAT) {
if (in2 == NPY_DATETIME_NAT || npy_isnan(in1)) {
*((npy_timedelta *)op1) = NPY_DATETIME_NAT;
}
else {
double result = in1 * in2;
if (npy_isfinite(result)) {
*((npy_timedelta *)op1) = (npy_timedelta)result;
}
else {
/* `nearbyint` avoids warnings (should not matter here, though) */
double result = nearbyint(in1 * in2);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this, also in that: Is this actually enough?

The input is effectively int64, so by casting it to double we throw away precision, this means that:

In [6]: np.timedelta64(2**58 + 7, "us") - np.timedelta64(2**58 + 7, "us") * 1.
Out[6]: numpy.timedelta64(7,'us')

woops. long double could fix this when available (extended precision has 64bit mantissa), but that doesn't always exist...

So, maybe the whole expectation of this rounding correct is too much, or maybe we should consider the above a serious bug?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, maybe the whole expectation of this rounding correct is too much, or maybe we should consider the above a serious bug?

I think this should be considered a bug if and only if the analogous behavior in int64 is considered a bug:

td64 = np.timedelta64(2**58 + 7, "us")
val = td64.view("i8")

>>> val - int(val*1.0)
7

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is the same thing. But int() always truncates too (i.e. no correct rounding). And you call int() explicitly, here we call it internally hidden to the user and thus effectively breaking the time precision for very large values.

But yeah, maybe it is just what it is, as Chuck keeps saying, maybe we really need a high-precision floating point time format (double-double, or whatever)...

npy_timedelta int_res = (npy_timedelta)result;
*((npy_timedelta *)op1) = int_res;
if ((double)int_res != result || int_res == NPY_DATETIME_NAT) {
/* Conversion creates a new NaT from non-NaN/NaT input */
npy_set_floatstatus_invalid();
*((npy_timedelta *)op1) = NPY_DATETIME_NAT;
}
}
Expand Down
46 changes: 27 additions & 19 deletions numpy/lib/function_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -3836,8 +3836,10 @@ def _median(a, axis=None, out=None, overwrite_input=False):
kth = [szh - 1, szh]
else:
kth = [(sz - 1) // 2]
# Check if the array contains any nan's
if np.issubdtype(a.dtype, np.inexact):

# Check if the array contains any nan's or NaT's (unordered values)
supports_nans = np.issubdtype(a.dtype, np.inexact) or a.dtype.kind == 'm'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to support dt64 here, something like:

if a.dtype.kind == "M":
     td_res = _median(a.view("m8"), ...)
     return td_res.view(a.dtype)

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could. I won't make it a priority though (i.e. we can follow up if we want these changes and allow median). We could also replace the call to mean() with a calculation that does: a + (b-a) / 2 (or *0.5) for example.

if supports_nans:
kth.append(-1)

if overwrite_input:
Expand Down Expand Up @@ -3869,7 +3871,7 @@ def _median(a, axis=None, out=None, overwrite_input=False):
# using out array if needed.
rout = mean(part[indexer], axis=axis, out=out)
# Check if the array contains any nan's
if np.issubdtype(a.dtype, np.inexact) and sz > 0:
if supports_nans and sz > 0:
# If nans are possible, warn and replace by nans like mean would.
rout = np.lib.utils._median_nancheck(part, rout, axis)

Expand Down Expand Up @@ -4655,11 +4657,16 @@ def _quantile(
# --- Setup
arr = np.asanyarray(arr)
values_count = arr.shape[axis]

# Check if the array contains any nan's or NaT's (unordered values)
supports_nans = (
np.issubdtype(arr.dtype, np.inexact) or arr.dtype.kind in 'Mm')

# The dimensions of `q` are prepended to the output shape, so we need the
# axis being sampled from `arr` to be last.
DATA_AXIS = 0
if axis != DATA_AXIS: # But moveaxis is slow, so only call it if axis!=0.
arr = np.moveaxis(arr, axis, destination=DATA_AXIS)

if axis != 0: # But moveaxis is slow, so only call it if necessary.
arr = np.moveaxis(arr, axis, destination=0)
# --- Computation of indexes
# Index where to find the value in the sorted array.
# Virtual because it is a floating point value, not an valid index.
Expand All @@ -4674,14 +4681,14 @@ def _quantile(
virtual_indexes = np.asanyarray(virtual_indexes)
if np.issubdtype(virtual_indexes.dtype, np.integer):
# No interpolation needed, take the points along axis
if np.issubdtype(arr.dtype, np.inexact):
if supports_nans:
# may contain nan, which would sort to the end
arr.partition(concatenate((virtual_indexes.ravel(), [-1])), axis=0)
slices_having_nans = np.isnan(arr[-1])
slices_having_nans = np.isnan(arr[-1, ...])
else:
# cannot contain nan
arr.partition(virtual_indexes.ravel(), axis=0)
slices_having_nans = np.array(False, dtype=bool)
slices_having_nans = False
result = take(arr, virtual_indexes, axis=0, out=out)
else:
previous_indexes, next_indexes = _get_indexes(arr,
Expand All @@ -4693,16 +4700,14 @@ def _quantile(
previous_indexes.ravel(),
next_indexes.ravel(),
))),
axis=DATA_AXIS)
if np.issubdtype(arr.dtype, np.inexact):
slices_having_nans = np.isnan(
take(arr, indices=-1, axis=DATA_AXIS)
)
axis=0)
if supports_nans:
slices_having_nans = np.isnan(arr[-1, ...])
else:
slices_having_nans = None
slices_having_nans = False
# --- Get values from indexes
previous = np.take(arr, previous_indexes, axis=DATA_AXIS)
next = np.take(arr, next_indexes, axis=DATA_AXIS)
previous = arr[previous_indexes]
next = arr[next_indexes]
# --- Linear interpolation
gamma = _get_gamma(virtual_indexes, previous_indexes, method)
result_shape = virtual_indexes.shape + (1,) * (arr.ndim - 1)
Expand All @@ -4711,10 +4716,13 @@ def _quantile(
next,
gamma,
out=out)
if np.any(slices_having_nans):

if slices_having_nans is not False and slices_having_nans.any():
if result.ndim == 0 and out is None:
# can't write to a scalar
result = arr.dtype.type(np.nan)
result = arr[-1]
elif result.dtype.kind in 'mM':
result[..., slices_having_nans] = "NaT"
else:
result[..., slices_having_nans] = np.nan
return result
Expand Down
70 changes: 69 additions & 1 deletion numpy/lib/tests/test_function_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2907,6 +2907,16 @@ def test_basic(self):
assert_equal(np.percentile(x, 0), np.nan)
assert_equal(np.percentile(x, 0, method='nearest'), np.nan)

def test_basic_times(self):
# Quantiles have more extensive tests (rounding may be problematic)
res = np.percentile(np.array(["2010", "2012"], dtype="M8[s]"), 50)
assert res.dtype == "M8[s]"
assert res == np.datetime64("2011", "s")

res = np.percentile(np.array([1, 3], dtype="m8[D]"), 50)
assert res.dtype == "m8[D]"
assert res == np.timedelta64(2, "D")

def test_fraction(self):
x = [Fraction(i, 2) for i in range(8)]

Expand Down Expand Up @@ -2995,6 +3005,28 @@ def test_linear_interpolation(self,
np.testing.assert_equal(np.asarray(actual).dtype,
np.dtype(expected_dtype))

@pytest.mark.parametrize("dtype", ["M8[s]", "m8[ns]"])
@pytest.mark.parametrize("method", [
'inverted_cdf', 'averaged_inverted_cdf', 'closest_observation',
'interpolated_inverted_cdf', 'hazen', 'weibull', 'linear',
'median_unbiased', 'normal_unbiased',
'nearest', 'lower', 'higher', 'midpoint'])
def test_linear_interpolation_times(self, method, dtype):
arr = np.zeros(1, dtype=dtype)
td_dtype = dtype.lower() # use timedelta64 to add to the datetime64

float_arr = np.array([15.0, 20.0, 35.0, 40.0, 50.0]) * 1000
arr = arr + float_arr.astype(td_dtype)
actual = np.percentile(arr, 40.0, method=method)
assert actual.dtype == dtype

expected_float = np.percentile(float_arr, 40.0, method=method)
diff = abs(actual - expected_float.astype(dtype)).astype(np.int64)
if diff == 1:
# TODO: Is this an OK rounding error? Use xfail in case it is not.
pytest.xfail()
assert actual == expected_float.astype(dtype)

TYPE_CODES = np.typecodes["AllInteger"] + np.typecodes["AllFloat"] + "O"

@pytest.mark.parametrize("dtype", TYPE_CODES)
Expand Down Expand Up @@ -3390,6 +3422,24 @@ def test_nan_q(self):
with pytest.raises(ValueError, match="Percentiles must be in"):
np.percentile([1, 2, 3, 4.0], q)

@pytest.mark.parametrize("dtype", ["m8[D]", "M8[s]"])
@pytest.mark.parametrize("pos", [0, 23, 10])
def test_nat_basic(self, dtype, pos):
# NaT and NaN should behave the same, do basic tests for NaT:
a = np.arange(0, 24, dtype=dtype)
a[pos] = "NaT"
res = np.percentile(a, 30)
assert res.dtype == dtype
assert np.isnat(res)
res = np.percentile(a, [30, 60])
assert res.dtype == dtype
assert np.isnat(res).all()

a = np.arange(0, 24*3, dtype=dtype).reshape(-1, 3)
a[pos, 1] = "NaT"
res = np.percentile(a, 30, axis=0)
assert_array_equal(np.isnat(res), [False, True, False])


class TestQuantile:
# most of this is already tested by TestPercentile
Expand All @@ -3408,7 +3458,6 @@ def test_basic(self):
assert_equal(np.quantile(x, 1), 3.5)
assert_equal(np.quantile(x, 0.5), 1.75)

@pytest.mark.xfail(reason="See gh-19154")
def test_correct_quantile_value(self):
a = np.array([True])
tf_quant = np.quantile(True, False)
Expand Down Expand Up @@ -3710,6 +3759,25 @@ def test_nan_behavior(self):
b[2] = np.nan
assert_equal(np.median(a, (0, 2)), b)

@pytest.mark.parametrize("dtype", ["m8[s]"])
@pytest.mark.parametrize("pos", [0, 23, 10])
def test_nat_behavior(self, dtype, pos):
# TODO: Median does not support Datetime, due to `mean`.
# NaT and NaN should behave the same, do basic tests for NaT.
a = np.arange(0, 24, dtype=dtype)
a[pos] = "NaT"
res = np.median(a)
assert res.dtype == dtype
assert np.isnat(res)
res = np.percentile(a, [30, 60])
assert res.dtype == dtype
assert np.isnat(res).all()

a = np.arange(0, 24*3, dtype=dtype).reshape(-1, 3)
a[pos, 1] = "NaT"
res = np.median(a, axis=0)
assert_array_equal(np.isnat(res), [False, True, False])

def test_empty(self):
# mean(empty array) emits two warnings: empty slice and divide by 0
a = np.array([], dtype=float)
Expand Down
8 changes: 5 additions & 3 deletions numpy/lib/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1039,9 +1039,11 @@ def _median_nancheck(data, result, axis):
# Without given output, it is possible that the current result is a
# numpy scalar, which is not writeable. If so, just return nan.
if isinstance(result, np.generic):
return data.dtype.type(np.nan)

result[n] = np.nan
return data.take(-1, axis=axis)
elif result.dtype.kind in 'mM':
result[n] = "NaT"
else:
result[n] = np.nan
return result

def _opt_info():
Expand Down