ENH: stats.ecdf: add `evaluate` and `plot` methods; restructure result #18212

mdhaber · 2023-03-29T06:44:20Z

Reference issue

What does this implement/fix?

This implements many of the changes to the ecdf result object suggested in #18136 (comment):

Replace EmpiricalDistributionFunction attribute points with attribute probabilities
Remove ECDFResult attribute x; instead use EmpiricalDistributionFunction attribute quantiles
Add methods evaluate and plot to EmpiricalDistributionFunction

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

rng = np.random.default_rng(3810954496107292580)
res = stats.ecdf(rng.normal(size=100))

# Here is much of what's new in one statement
np.testing.assert_allclose(res.sf.evaluate(res.sf.quantiles),
                           res.sf.probabilities)
# Similarly,
np.testing.assert_allclose(res.cdf.evaluate(res.cdf.quantiles),
                           res.cdf.probabilities)

low, high = res.sf.confidence_interval()
ax = plt.gca()
res.sf.plot(ax, color='C0')
low.plot(ax, color='C1')
high.plot(ax, color='C1')
plt.xlabel('quantile')
plt.ylabel('probability')
plt.title('Empirical survival function and 95% CI')
plt.show()

Was:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

rng = np.random.default_rng(3810954496107292580)
res = stats.ecdf(rng.normal(size=100))

# Here is most of what's new in one line
np.testing.assert_allclose(res.sf.evaluate(res.sf.q), res.sf.p)
# Similarly,
np.testing.assert_allclose(res.cdf.evaluate(res.cdf.q), res.cdf.p)

# Also note that confidence interval bounds now have `evaluate` method
q = [-3] + list(res.sf.q) + [3]
low, high = res.sf.confidence_interval()
plt.step(q, res.sf.evaluate(q), color='C0', where='<removed>')  # correct value of `where` removed to make a point
plt.step(q, low.evaluate(q), color='C1', where='<removed>')
plt.step(q, high.evaluate(q), color='C1', where='<removed>')
plt.xlabel('quantile')
plt.ylabel('probability')
plt.title('Empirical survival function and 95% CI')
plt.show()

Additional information

Please review #18136 (comment) for justification.

It will probably be easiest to review commit by commit. See commit messages.

One thing that would be in scope that I didn't implement here is to rename ecdf to nonparametric_fit. The latter might not be a good name after all because that might be confused with nonparametric regression or spline fitting.

@lorentzenchr I thought you would be interested in this given your requests for ecdf to return a distribution object. After the distribution infrastructure overhaul is complete, perhaps we can make it easy to convert an ECDFResult into a new distribution object. In the meantime, I hope that the evaluate method of the sf and cdf attributes is a good start toward what you were after, and I'd be happy if someone else would like to add ppf, isf, and rvs attributes.

…onResult`s

tupui

Thanks Matt! I made some suggestions.

add ppf, isf, and rvs attributes

I would postpone it at least for ppf and rvs. Personally, I really don't like these names and would prefer to wait for us to discuss the new distribution API. (There are more than one articles online about people being confused about the naming 😅)

tupui · 2023-03-29T23:30:42Z

scipy/stats/_survival.py

+        low = EmpiricalDistributionFunction(self.q, np.clip(low, 0, 1),
+                                            None, None, self._kind)
+        high = EmpiricalDistributionFunction(self.q, np.clip(high, 0, 1),
+                                            None, None, self._kind)
+        return ConfidenceInterval(low, high)


I am not sure I understand why you would want to have the nesting. Why do you want the output of CI to have a full object?

See plotting code above, for example. It makes it nicer to be able to evaluate the CI bounds at arbitrary points. Without this, the code might look like:

# Also note that confidence interval bounds now have `evaluate` method q = [-3] + list(res.sf.q) + [3] low, high = res.sf.confidence_interval() plt.step(q, res.sf.evaluate(q), color='C0', where='<removed>') # removed to make a point plt.step(q, [1, list(low), low[-1]], color='C1', where='<removed>') plt.step(q, [1, list(high), high[-1]], color='C1', where='<removed>')

which just isn't quite as nice.

In general, I don't see why the need to evaluate the CI bounds at arbitrary points is much less than the need to evaluate the point estimate at arbitrary points. So if there needs to be a way to evaluate the point estimate of sf at arbitrary points (and this was requested by a few reviewers of gh-18035), why shouldn't its upper and lower bounds?

Likewise, I think a very simple plotting method would be useful for the point estimate because it saves the user from needing to remember and execute several steps:

Evaluate the function not only at all the unique times, but also at a time before the first sample time and after the last sample time. Those edge points are important to make the plot look right.

Use plt.step instead of plt.plot

Set appropriate options. To get a sense of how much convenience this adds, what value be passed for where? I removed it from the code above to make this more challenging, but don't cheat by looking at your email. If you think you know, would you have been sure if you hadn't seen me use it before?

The need to plot the CI bounds is just as great as the need to plot the point estimate, so if the point estimate should get a plot method, the CI bounds should get a plot method.

I would prefer if instead we would allow res.[sf,cdf].confidence_interval to get as input the new quantiles: confidence_interval(confidence_level, *, ..., quantiles). This looks natural to me and close to evaluate. Then we don't need the object nesting and in the plot function, we can just have an option to plot CI or not. Here it would bring a bit more "legitimacy" to add the plot function.

(About plot. In general I am not against that, we need to be careful about not doing too much though.)

tupui · 2023-03-29T23:32:20Z

scipy/stats/_survival.py

+    q : ndarray
+        The unique values of the sample from which the
+        `EmpiricalDistributionFunction` was estimated.
+    p : ndarray
+        The point estimates of the cumulative distribution function (CDF) or
+        its complement, the survival function (SF), corresponding with `q`.


For private attributes I don't really mind short names but I would prefer if we could use full names for public things. quantile and points (or estimate) would be clearer to me.

I can go for quantiles. I think the plural s is important to emphasize that this is several discrete points - the number of unique points in the sample.

Then the question would be what is the name of the output if "quantiles" is the input. "points" and "estimate" don't go particularly well with "quantiles". "percentiles" sounds nice, but I think probabilities is the most descriptive.

LMK if you want it changed to quantiles and probabilities.

I would go for quantiles and probabilities (totally agree for the plural, my mistake above!) Thanks 😃

OK, although by the same logic we'd write out cumulative_distribution_function and survival_function : / So I've changed it, but I lean toward q and p.

Arguable, I think it is important we discuss that for the new infra (and try things with "juniors" because a lot of the names we have are not intuitive if you've not been driving SciPy for a while nor are an expert.)

The thing which is making me tick the most with p and q is that they are single letter variables. It's too implicit to me and has a risk to be mistaken with an iterative variable in users' code.

mdhaber · 2023-03-30T00:33:14Z

I would postpone it at least for ppf and rvs.

That's fine, although I will not have a problem making icdf and ppf aliases for the same function. As much as I hesitate to to offer two ways of doing the same thing, I think it's more important to avoid cosmetic changes that will slow the adoption of the infrastructure.

From that perspective, I don't think that sf is perfect. To be domain-independent, I would recommend ccdf (complementary CDF).

tupui · 2023-03-30T00:44:00Z

I do prefer ccdf. It's more general than sf which implies a field: survival analysis. OC you can argue that ccdf is mostly useful in this context, but SciPy is supposed to err on the neutral side.

tupui · 2023-03-30T12:38:36Z

scipy/stats/_survival.py

+        try:
+            import matplotlib  # noqa
+        except ModuleNotFoundError as exc:
+            message = "matplotlib must be installed to use method `plot`."
+            raise ModuleNotFoundError(message) from exc
+
+        if ax is None:
+            import matplotlib.pyplot as plt
+            ax = plt.gca()


With have this code in a few places FitResult and also in scipy.spatial._plotutils._held_figure. We should have a single source of truth (and put this in _lib I suppose.)

scipy/stats/_survival.py

tupui · 2023-03-30T13:33:29Z

scipy/stats/_survival.py

+        low = EmpiricalDistributionFunction(self.q, np.clip(low, 0, 1),
+                                            None, None, self._kind)
+        high = EmpiricalDistributionFunction(self.q, np.clip(high, 0, 1),
+                                            None, None, self._kind)
+        return ConfidenceInterval(low, high)


I would prefer if instead we would allow res.[sf,cdf].confidence_interval to get as input the new quantiles: confidence_interval(confidence_level, *, ..., quantiles). This looks natural to me and close to evaluate. Then we don't need the object nesting and in the plot function, we can just have an option to plot CI or not. Here it would bring a bit more "legitimacy" to add the plot function.

(About plot. In general I am not against that, we need to be careful about not doing too much though.)

lorentzenchr

@mdhaber Thanks for your continued work on this. Yes, evaluate was what I was looking for.

lorentzenchr · 2023-04-01T07:27:47Z

scipy/stats/_survival.py

        sf : `~scipy.stats._result_classes.EmpiricalDistributionFunction`
            An object representing the empirical survival function.

        The `cdf` and `sf` attributes themselves have the following attributes.

-        points : ndarray
-            The point estimate of the CDF/SF at the values in `x`.
+        quantiles : ndarray


It might come as a surprise for a user that a cdf object has a quantiles attribute as in myecdf.cdf.quantiles. But you already discussed this "point" 😏

lorentzenchr · 2023-04-01T07:30:03Z

scipy/stats/_survival.py

-        points : ndarray
-            The point estimate of the CDF/SF at the values in `x`.
+        quantiles : ndarray
+            The unique values in the sample.


Suggested change

The unique values in the sample.

The unique values in the sample that defines the empirical CDF/SF.

mdhaber · 2023-04-01T17:56:22Z

Thanks @lorentzenchr. I'll incorporate your suggestion when I resolve merge conflicts.

Aside from that, does the interface make sense to you? There were some suggestions from @tupui above that I'd like your opinions about.

not allowing **kwargs in plot (ENH: stats.ecdf: add evaluate and plot methods; restructure result #18212 (comment))
passing quantiles into confidence_interval (ENH: stats.ecdf: add evaluate and plot methods; restructure result #18212 (comment))
sf vs ccdf (ENH: stats.ecdf: add evaluate and plot methods; restructure result #18212 (comment))

My thoughts on 1 and 2 are shown above, Regarding 3 I would prefer to leave sf for now since this is what users are currently familiar with.

Note that this interface was proposed after careful study of other existing interfaces (which you might want to skim) to organize a lot of information concisely while keeping our options open for future expansion (e.g. PPF-like method, RVS-like method).

Thanks for your help!

(@tirthasheshpatel I'd be interested in your thoughts, too.)

lorentzenchr · 2023-04-09T18:39:04Z

@mdhaber

Aside from that, does the interface make sense to you?

Overall, yes.

not allowing **kwargs in plot

I have no strong opinions here. What is the scipy standard approach? We could also wait for the users' feedback on this and start without it.

passing quantiles into confidence_interval

I personally would prefer to have confidence_interval similar to evaluate, perhaps with a default input of None meaning ecdf.quantiles.

sf vs ccdf

Actuaries also call it sf. I would stick with the scipy default for now. A new distribution API may change this - in the future. (Where can I find more info about that?)

mdhaber · 2023-04-09T19:11:06Z

I personally would prefer to have confidence_interval similar to evaluate, perhaps with a default input of None meaning ecdf.quantiles.

How would that affect typical use cases, like plotting?

Currently, plotting looks like:

data = rng.normal(size=100)
res = stats.ecdf(data)
low, high = res.sf.confidence_interval()
ax = plt.gca()
res.sf.plot(ax, color='C0')
low.plot(ax, color='C1')
high.plot(ax, color='C1')

With that change alone, it would become like:

data = rng.normal(size=100)
res = stats.ecdf(data)
q_min, qmax = res.sf.quantiles[0] - 1, res.sf.quantiles[-1] + 1
q = [[q_min] + list(res.sf.quantiles) + [qmax]]
low, high = res.sf.confidence_interval(quantiles=q)
ax = plt.gca()
res.sf.plot(ax, color='C0')  # only the point estimate has a plot method
plt.step()
plt.step(q, [[1] + list(low) + [low[-1]]], color='C1', where='post')
plt.step(q, [[1] + list(high) + [high[-1]]], color='C1', where='post')

Were there other associated changes to enable plotting confidence intervals? If added to the sf.plot method, would the user have any control over how the confidence intervals are plotted as they do now, or do they just get defaults?

I am also curious why a rich object (rather than array) is preferable for the point estimate but not for the confidence interval bounds? Once one learns how the object that represents the point estimate works, why is it uncomfortable to use the same interface to manipulate the objects that represent the confidence interval bounds?

Where can I find more info about that?

gh-15928

tupui · 2023-04-10T12:30:07Z

With my suggestion of having the CI function accept quantiles, you could call this bit in the general plot function res.sf.plot.

mdhaber · 2023-04-10T14:25:36Z

What is the interface of the new plot method?

lorentzenchr · 2023-04-10T14:32:51Z

@mdhaber I did not consider this complication. Not very user friendly, indeed.

IIUC, @tupui suggests in #18212 (comment) to have just one single plot function ecdf.(cdf/sf).plot(ci_level=None). As a user, I think I would prefer that.

If I plot several things in one figure, I often prefer to visualize uncertainty by transparent bands in the same color, e.g. fill_between(alpha=..).

mdhaber · 2023-04-10T14:37:48Z

Ok, we need a little more to make the suggestion concrete, though. Right now, all the options of each line can be controlled separately. Can this be done with the new method, and if so, what is the interface? (It can no longer just pass kwargs straight to Matplotlib.) If not, what can be controlled, and how?
(Since I don't have a vision for this other API, I don't think I should try to answer these questions myself.)

tupui · 2023-04-10T15:04:42Z

I think we should stay very lean on the figures we propose (i.e. maybe no fill_between otherwise I am tempted to propose a gradient colouring, as an attempt to show the distribution of error.) Hence we would only plot lines and not have options to control the style. Styling can be done after playing with the returned axes. Arguably harder, but what we propose is more a convenient/opinionated tool rather than something that could be tweak and out of the box produce paper/presentation quality figures.

What about this, I propose we go for now with something we like in terms of style. If we want, we could even make a community call to discuss good styling defaults, could be an interesting promotion exercise for the new feature. The outcome of that could be an opinionated style vetted by more people from the community.

mdhaber · 2023-04-10T16:55:51Z

Ah, yes, gradient might be nice. Making something like a transparency gradient is pretty easy, and the alpha level is quantitatively appropriate rather than linear. It's a matter of opinion whether the apparent "blurriness" is a good or bad thing.

ps = np.linspace(0.05, 0.95, 100)
for p in ps:
    low, high = res.sf.confidence_interval(confidence_level=p)
    low.plot(ax, color='C0', alpha=1-p)
    high.plot(ax, color='C0', alpha=1-p)

tupui · 2023-04-10T17:05:19Z

Ah, yes, gradient might be nice. Making something like a transparency gradient is pretty easy, and the alpha level is quantitatively appropriate rather than linear. It's a matter of opinion whether the apparent "blurriness" is a good or bad thing.

haha you did it. Sorry for dragging you in the rabbit hole 😅 I had some code to do something similar

https://github.com/tupui/batman/blob/559e1c4574865694e47f52bb202c560fc6252d2d/batman/visualization/uncertainty.py#L146-L160

This was making figures like that:

tirthasheshpatel · 2023-04-13T10:20:47Z

(@tirthasheshpatel I'd be interested in your thoughts, too.)

I see that most of the points have been discussed. Nevertheless, here's my opinion:

I like the matplotlib_kwargs name more.
I might have missed where this was discussed but why is confidence_interval not a method of the result object (i.e. res.confidence_interval()) itself? It makes more sense to have it there since other statistical tests also return an object with that method. It would also resolve the nested structure problem. Edit: I see that the cdf and sf both need this method. In that case, we can have confidence_interval take an extra parameter telling it for which method to compute the CI. Something like confidence_interval(kind: Literal['cdf'] | Literal['sf'], confidence_level=0.95, *, method='linear')
Given the current SciPy API for distributions, sf is much better.

tupui · 2023-04-13T10:45:01Z

Thanks @tirthasheshpatel 😃 FYI I opened a PR on Matt's fork to show my alternative proposal.

That was my initial proposal too. Would be fine with both though as sf/cdf are results objects. i.e. we could see the main result object as being an intermediary step.
Agreed for the current infra. I am trying to thing about the next one here 😉 e.g. I will want to die on the hill of random vs rvs. Same to not have ppf which is confusing for most newcomers, etc. etc.

mdhaber · 2023-04-14T21:36:31Z

@tirthasheshpatel Re: 2, see #18136 (comment) (which we seemed to agree on at the time). Please make a full proposal as I did there because it doesn't necessarily make sense to just change one thing.

tirthasheshpatel · 2023-04-14T22:34:51Z

If I understand #18136 (comment) correctly, we have confidence_interval as a method of the returned cdf and sf functions to:

reduce the complexity of the return object by having fewer attributes and
avoid adding more attributes when we add other distribution methods like isf since they could also have a confidence_interval method attached to them.

Right?

I was under the impression that we probably won't have confidence_interval for other distribution methods like ppf or isf which is why I proposed to add confidence_interval_cdf and confidence_interval_sf attributes to the result object itself (like lifelines' KaplanMeierFitter.fit) instead of having them as methods of the cdf and sf function objects. IMO, this doesn't make the result object any more complex; the current design just hides the extra complexity conveniently behind the cdf and sf objects. But if the plan is to return ppf and isf too in the future, then I agree that the current design would be better i.e. it would be better to have them as methods of the returned function objects instead of having it as an attribute/method of the return object itself.

For the other part of the discussion: allowing passing quantiles to the confidence_interval method, I don't know which use case would justify its addition. Maybe, we can keep it simple now and add it as the necessity comes up. But since I am not experienced in survival analysis, I don't have a strong opinion about this.

tirthasheshpatel · 2023-04-14T22:41:26Z

@tupui

Same to not have ppf which is confusing for most newcomers, etc. etc.

Agreed! But if we change now just for the ecdf function, it would come as a surprise to the user. This is why I think we should stay consistent now and change everything together once we have the new design for univariate distributions finalized. Also, if we add an alias now (#18212 (comment)) and decide on a different name later for the univariate distribution infra, we might not be able to remove/change the alias without breaking backward compatibility.

mdhaber · 2023-04-16T15:17:18Z

But if the plan is to return ppf and isf too in the future,

That is indeed something we'd like to be able to do. Statsmodels, for instance, returns that information, including confidence intervals for the ISF. See quantile and quantile_ci of SurvFunctionRight.

tupui

Hi all, I would like to move this forward 😃

I want to clarify something, I am +1 with the current proposal and I am only suggesting an alternative option (see on Matt's fork the PR I made to see how this would look like practically speaking.) I am not saying nor thinking that my option is better, just different 😃

@lorentzenchr @tirthasheshpatel what do you think? Is the current API working for you? If yes, I would suggest that we get this in and if wanted a follow up can be done. We could also continue discussing here after the PR is merged. Time is flying and we have roughly a month before the branching for the new release. I want to be sure that a version of this PR gets in. And I do think that the current proposal is a proper good API that we can work with.

What do you think?

tirthasheshpatel · 2023-04-28T20:10:18Z

what do you think? Is the current API working for you?

I am +1 too now. With more methods to join the ECDFResult object, I can see why @mdhaber's API is clearly better than the one I thought. Although I haven't reviewed the tests very closely, the implementation looks good to me. If the tests look good to you, feel free to merge @tupui!

lorentzenchr · 2023-04-29T07:25:36Z

Also very fine for me and as a user looking forward to have it soon.

tupui · 2023-05-01T15:51:13Z

All right! Thanks all 😃

mdhaber added 4 commits March 28, 2023 22:03

MAINT: stats.ecdf: replace attribute points with p

5240090

MAINT: stats.ecdf: replace attribute x with q

6dace64

ENH: stats.ecdf: add evaluate method

72a555b

ENH: stats.ecdf: make confidence interval bounds `EmpiricalDistributi…

a4b4d86

…onResult`s

mdhaber added scipy.stats enhancement A new feature or improvement labels Mar 29, 2023

mdhaber requested a review from tupui March 29, 2023 06:44

tupui reviewed Mar 29, 2023

View reviewed changes

mdhaber added 2 commits March 29, 2023 22:41

MAINT: stats.ecdf: rename probabilities, quantiles = p, q

9e9a255

ENH: stats.ecdf: add plot method

8af9d07

mdhaber changed the title ~~ENH: stats.ecdf: add evaluate method; restructure result~~ ENH: stats.ecdf: add evaluate and plot methods; restructure result Mar 30, 2023

DOC: stats.ecdf: correct documentation of probabilities, quantiles

8f40951

tupui reviewed Mar 30, 2023

View reviewed changes

lorentzenchr approved these changes Apr 1, 2023

View reviewed changes

mdhaber added 3 commits April 1, 2023 11:01

Merge remote-tracking branch 'upstream/main' into gh18136

2994a20

MAINT: stats.ecdf: implement suggestions; fix logrank plotting

8feae86

MAINT: stats.ecdf: kwargs -> matplotlib_kwargs

d6d208b

mdhaber requested review from lorentzenchr and tirthasheshpatel and removed request for lorentzenchr April 8, 2023 18:12

tupui approved these changes Apr 26, 2023

View reviewed changes

tupui added this to the 1.11.0 milestone Apr 26, 2023

tupui merged commit f797ac7 into scipy:main May 1, 2023

	The unique values in the sample.
	The unique values in the sample that defines the empirical CDF/SF.

ENH: stats.ecdf: add evaluate and plot methods; restructure result #18212

ENH: stats.ecdf: add evaluate and plot methods; restructure result #18212

Conversation

mdhaber commented Mar 29, 2023 • edited

Reference issue

What does this implement/fix?

Additional information

tupui left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdhaber Mar 30, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdhaber Mar 30, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdhaber commented Mar 30, 2023 • edited

tupui commented Mar 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lorentzenchr left a comment

Choose a reason for hiding this comment

lorentzenchr Apr 1, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdhaber commented Apr 1, 2023 • edited

lorentzenchr commented Apr 9, 2023

mdhaber commented Apr 9, 2023 • edited

tupui commented Apr 10, 2023

mdhaber commented Apr 10, 2023

lorentzenchr commented Apr 10, 2023

mdhaber commented Apr 10, 2023 • edited

tupui commented Apr 10, 2023

mdhaber commented Apr 10, 2023

tupui commented Apr 10, 2023 • edited

tirthasheshpatel commented Apr 13, 2023 • edited

tupui commented Apr 13, 2023

mdhaber commented Apr 14, 2023

tirthasheshpatel commented Apr 14, 2023 • edited

tirthasheshpatel commented Apr 14, 2023 • edited

mdhaber commented Apr 16, 2023

tupui left a comment

Choose a reason for hiding this comment

tirthasheshpatel commented Apr 28, 2023

lorentzenchr commented Apr 29, 2023

tupui commented May 1, 2023

ENH: stats.ecdf: add `evaluate` and `plot` methods; restructure result #18212

ENH: stats.ecdf: add `evaluate` and `plot` methods; restructure result #18212

mdhaber commented Mar 29, 2023 •

edited

mdhaber Mar 30, 2023 •

edited

mdhaber Mar 30, 2023 •

edited

mdhaber commented Mar 30, 2023 •

edited

lorentzenchr Apr 1, 2023 •

edited

mdhaber commented Apr 1, 2023 •

edited

mdhaber commented Apr 9, 2023 •

edited

mdhaber commented Apr 10, 2023 •

edited

tupui commented Apr 10, 2023 •

edited

tirthasheshpatel commented Apr 13, 2023 •

edited

tirthasheshpatel commented Apr 14, 2023 •

edited

tirthasheshpatel commented Apr 14, 2023 •

edited