Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix timedelta*float and median/percentile/quantile NaT handling #21726

Closed
wants to merge 4 commits into from

Conversation

seberg
Copy link
Member

@seberg seberg commented Jun 10, 2022

Currently has three changesets in there:

  1. It implements correct rounding and NaT support in timedelta64 * double.

  2. Median does not currently support Datetimes. This is because
    mean does not support datetimes.
    The reason for mean not supporting datetimes, is that it is
    (understandibly) written as (a + b)/2, but datetimes cannot
    be added so that we would have to use a different scheme, such as
    a + (b - a) / 2.

    However: Median does support timedelta, and this fixes NaT support there.

  3. Unlike Median, the interpolation formula used by quantile/percentile
    does support Datetimes just fine, so this also adds basic tests for
    that.
    I now suspect the rounding is correct here, with the first changes applies.

@jbrockmendel, this is pretty much ready I suspect, but I am honestly not sure about the changes, so half the reason was to ping someone with more time experience.
There would be two things to note:

  1. The timedelta64 * float loop is of significantly slower with the added checks and correct rounding. (Not sure if this matters, but wanted to note it)
  2. The timedelta64 * float loop changes need a few tests for rounding mode, NaT creation and floating point error reporting.

I am happy to split this up My initial intention was to close gh-20376, gh-11627, and gh-11620. The added fix for the timedelta loops, fixes quantile for percentile a bit. But should likely be pulled out to not block the rest.

There should be better ways to do those NaN or NaT branches in median/quantile, but....


EDIT: I am not sure I will prioritize this without prodding, so if it stalls, don't hesitate to close.

Median does not currently support Datetimes.  This is because
`mean` does not support datetimes.
The reason for `mean` not supporting datetimes, is that it is
(understandibly) written as `(a + b)/2`, but datetimes cannot
be added so that we would have to use a different scheme, such as
`a + (b - a) / 2`.
Unlike Median, the interpolation formula used by quantile/percentile
does support Datetimes just fine, so this also adds basic tests for
that.
Due to the integral nature, times suffer round-off errors that seem
not ideal.  This is an issue, but distinct from the NaT one?
This fixes both rounding to be correct and that NaT is propagated
correctly (and used when any overflow occurs).
Unfortunately, this makes the loop a whole lot slower...
}
else {
/* `nearbyint` avoids warnings (should not matter here, though) */
double result = nearbyint(in1 * in2);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this, also in that: Is this actually enough?

The input is effectively int64, so by casting it to double we throw away precision, this means that:

In [6]: np.timedelta64(2**58 + 7, "us") - np.timedelta64(2**58 + 7, "us") * 1.
Out[6]: numpy.timedelta64(7,'us')

woops. long double could fix this when available (extended precision has 64bit mantissa), but that doesn't always exist...

So, maybe the whole expectation of this rounding correct is too much, or maybe we should consider the above a serious bug?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, maybe the whole expectation of this rounding correct is too much, or maybe we should consider the above a serious bug?

I think this should be considered a bug if and only if the analogous behavior in int64 is considered a bug:

td64 = np.timedelta64(2**58 + 7, "us")
val = td64.view("i8")

>>> val - int(val*1.0)
7

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is the same thing. But int() always truncates too (i.e. no correct rounding). And you call int() explicitly, here we call it internally hidden to the user and thus effectively breaking the time precision for very large values.

But yeah, maybe it is just what it is, as Chuck keeps saying, maybe we really need a high-precision floating point time format (double-double, or whatever)...

}
else {
/* `nearbyint` avoids warnings (should not matter here, though) */
double result = nearbyint(in1 * in2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is nearbyint similar to python round?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roughly speaking, yes. Of course it will round C-style. It is really the same thing as np.rint, except that it should give a warning (which should not really matter. If it sets a warning it should be the one we set anyway, but...)

if np.issubdtype(a.dtype, np.inexact):

# Check if the array contains any nan's or NaT's (unordered values)
supports_nans = np.issubdtype(a.dtype, np.inexact) or a.dtype.kind == 'm'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to support dt64 here, something like:

if a.dtype.kind == "M":
     td_res = _median(a.view("m8"), ...)
     return td_res.view(a.dtype)

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could. I won't make it a priority though (i.e. we can follow up if we want these changes and allow median). We could also replace the call to mean() with a calculation that does: a + (b-a) / 2 (or *0.5) for example.

@seberg
Copy link
Member Author

seberg commented Feb 7, 2024

Might be nice to get back to but diverged a lot anyway, so closing for now.

@seberg seberg closed this Feb 7, 2024
@seberg seberg added the 64 - Good Idea Inactive PR with a good start or idea. Consider studying it if you are working on a related issue. label Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug 64 - Good Idea Inactive PR with a good start or idea. Consider studying it if you are working on a related issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: NaT handling change 1.21.2 -> 1.21.3 triggering issue in np.median
2 participants