New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix timedelta*float and median/percentile/quantile NaT handling #21726
Conversation
Median does not currently support Datetimes. This is because `mean` does not support datetimes. The reason for `mean` not supporting datetimes, is that it is (understandibly) written as `(a + b)/2`, but datetimes cannot be added so that we would have to use a different scheme, such as `a + (b - a) / 2`.
Unlike Median, the interpolation formula used by quantile/percentile does support Datetimes just fine, so this also adds basic tests for that. Due to the integral nature, times suffer round-off errors that seem not ideal. This is an issue, but distinct from the NaT one?
The test was fixed by numpygh-19857
This fixes both rounding to be correct and that NaT is propagated correctly (and used when any overflow occurs). Unfortunately, this makes the loop a whole lot slower...
32e3562
to
30c21ca
Compare
} | ||
else { | ||
/* `nearbyint` avoids warnings (should not matter here, though) */ | ||
double result = nearbyint(in1 * in2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this, also in that: Is this actually enough?
The input is effectively int64, so by casting it to double we throw away precision, this means that:
In [6]: np.timedelta64(2**58 + 7, "us") - np.timedelta64(2**58 + 7, "us") * 1.
Out[6]: numpy.timedelta64(7,'us')
woops. long double
could fix this when available (extended precision has 64bit mantissa), but that doesn't always exist...
So, maybe the whole expectation of this rounding correct is too much, or maybe we should consider the above a serious bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, maybe the whole expectation of this rounding correct is too much, or maybe we should consider the above a serious bug?
I think this should be considered a bug if and only if the analogous behavior in int64 is considered a bug:
td64 = np.timedelta64(2**58 + 7, "us")
val = td64.view("i8")
>>> val - int(val*1.0)
7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is the same thing. But int()
always truncates too (i.e. no correct rounding). And you call int()
explicitly, here we call it internally hidden to the user and thus effectively breaking the time precision for very large values.
But yeah, maybe it is just what it is, as Chuck keeps saying, maybe we really need a high-precision floating point time format (double-double, or whatever)...
} | ||
else { | ||
/* `nearbyint` avoids warnings (should not matter here, though) */ | ||
double result = nearbyint(in1 * in2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is nearbyint
similar to python round
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
roughly speaking, yes. Of course it will round C-style. It is really the same thing as np.rint
, except that it should give a warning (which should not really matter. If it sets a warning it should be the one we set anyway, but...)
if np.issubdtype(a.dtype, np.inexact): | ||
|
||
# Check if the array contains any nan's or NaT's (unordered values) | ||
supports_nans = np.issubdtype(a.dtype, np.inexact) or a.dtype.kind == 'm' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want to support dt64 here, something like:
if a.dtype.kind == "M":
td_res = _median(a.view("m8"), ...)
return td_res.view(a.dtype)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could. I won't make it a priority though (i.e. we can follow up if we want these changes and allow median). We could also replace the call to mean()
with a calculation that does: a + (b-a) / 2
(or *0.5
) for example.
Might be nice to get back to but diverged a lot anyway, so closing for now. |
Currently has three changesets in there:
It implements correct rounding and NaT support in
timedelta64 * double
.Median does not currently support Datetimes. This is because
mean
does not support datetimes.The reason for
mean
not supporting datetimes, is that it is(understandibly) written as
(a + b)/2
, but datetimes cannotbe added so that we would have to use a different scheme, such as
a + (b - a) / 2
.However: Median does support timedelta, and this fixes NaT support there.
Unlike Median, the interpolation formula used by quantile/percentile
does support Datetimes just fine, so this also adds basic tests for
that.
I now suspect the rounding is correct here, with the first changes applies.
@jbrockmendel, this is pretty much ready I suspect, but I am honestly not sure about the changes, so half the reason was to ping someone with more time experience.
There would be two things to note:
timedelta64 * float
loop is of significantly slower with the added checks and correct rounding. (Not sure if this matters, but wanted to note it)timedelta64 * float
loop changes need a few tests for rounding mode,NaT
creation and floating point error reporting.I am happy to split this up My initial intention was to close gh-20376, gh-11627, and gh-11620. The added fix for the timedelta loops, fixes quantile for percentile a bit. But should likely be pulled out to not block the rest.
There should be better ways to do those NaN or NaT branches in median/quantile, but....
EDIT: I am not sure I will prioritize this without prodding, so if it stalls, don't hesitate to close.