Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: stats.ecdf: add evaluate and plot methods; restructure result #18212

Merged
merged 10 commits into from
May 1, 2023

Conversation

mdhaber
Copy link
Contributor

@mdhaber mdhaber commented Mar 29, 2023

Reference issue

gh-18136

What does this implement/fix?

This implements many of the changes to the ecdf result object suggested in #18136 (comment):

  • Replace EmpiricalDistributionFunction attribute points with attribute probabilities
  • Remove ECDFResult attribute x; instead use EmpiricalDistributionFunction attribute quantiles
  • Add methods evaluate and plot to EmpiricalDistributionFunction
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

rng = np.random.default_rng(3810954496107292580)
res = stats.ecdf(rng.normal(size=100))

# Here is much of what's new in one statement
np.testing.assert_allclose(res.sf.evaluate(res.sf.quantiles),
                           res.sf.probabilities)
# Similarly,
np.testing.assert_allclose(res.cdf.evaluate(res.cdf.quantiles),
                           res.cdf.probabilities)

low, high = res.sf.confidence_interval()
ax = plt.gca()
res.sf.plot(ax, color='C0')
low.plot(ax, color='C1')
high.plot(ax, color='C1')
plt.xlabel('quantile')
plt.ylabel('probability')
plt.title('Empirical survival function and 95% CI')
plt.show()

image

Was:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

rng = np.random.default_rng(3810954496107292580)
res = stats.ecdf(rng.normal(size=100))

# Here is most of what's new in one line
np.testing.assert_allclose(res.sf.evaluate(res.sf.q), res.sf.p)
# Similarly,
np.testing.assert_allclose(res.cdf.evaluate(res.cdf.q), res.cdf.p)

# Also note that confidence interval bounds now have `evaluate` method
q = [-3] + list(res.sf.q) + [3]
low, high = res.sf.confidence_interval()
plt.step(q, res.sf.evaluate(q), color='C0', where='<removed>')  # correct value of `where` removed to make a point
plt.step(q, low.evaluate(q), color='C1', where='<removed>')
plt.step(q, high.evaluate(q), color='C1', where='<removed>')
plt.xlabel('quantile')
plt.ylabel('probability')
plt.title('Empirical survival function and 95% CI')
plt.show()

Additional information

Please review #18136 (comment) for justification.

It will probably be easiest to review commit by commit. See commit messages.

One thing that would be in scope that I didn't implement here is to rename ecdf to nonparametric_fit. The latter might not be a good name after all because that might be confused with nonparametric regression or spline fitting.

@lorentzenchr I thought you would be interested in this given your requests for ecdf to return a distribution object. After the distribution infrastructure overhaul is complete, perhaps we can make it easy to convert an ECDFResult into a new distribution object. In the meantime, I hope that the evaluate method of the sf and cdf attributes is a good start toward what you were after, and I'd be happy if someone else would like to add ppf, isf, and rvs attributes.

@mdhaber mdhaber added scipy.stats enhancement A new feature or improvement labels Mar 29, 2023
@mdhaber mdhaber requested a review from tupui March 29, 2023 06:44
Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Matt! I made some suggestions.

add ppf, isf, and rvs attributes

I would postpone it at least for ppf and rvs. Personally, I really don't like these names and would prefer to wait for us to discuss the new distribution API. (There are more than one articles online about people being confused about the naming 😅)

Comment on lines 130 to 134
low = EmpiricalDistributionFunction(self.q, np.clip(low, 0, 1),
None, None, self._kind)
high = EmpiricalDistributionFunction(self.q, np.clip(high, 0, 1),
None, None, self._kind)
return ConfidenceInterval(low, high)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand why you would want to have the nesting. Why do you want the output of CI to have a full object?

Copy link
Contributor Author

@mdhaber mdhaber Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See plotting code above, for example. It makes it nicer to be able to evaluate the CI bounds at arbitrary points. Without this, the code might look like:

# Also note that confidence interval bounds now have `evaluate` method
q = [-3] + list(res.sf.q) + [3]
low, high = res.sf.confidence_interval()
plt.step(q, res.sf.evaluate(q), color='C0', where='<removed>')  # removed to make a point
plt.step(q, [1, list(low), low[-1]], color='C1', where='<removed>')
plt.step(q, [1, list(high), high[-1]], color='C1', where='<removed>')

which just isn't quite as nice.

In general, I don't see why the need to evaluate the CI bounds at arbitrary points is much less than the need to evaluate the point estimate at arbitrary points. So if there needs to be a way to evaluate the point estimate of sf at arbitrary points (and this was requested by a few reviewers of gh-18035), why shouldn't its upper and lower bounds?


Likewise, I think a very simple plotting method would be useful for the point estimate because it saves the user from needing to remember and execute several steps:

  • Evaluate the function not only at all the unique times, but also at a time before the first sample time and after the last sample time. Those edge points are important to make the plot look right.
  • Use plt.step instead of plt.plot
  • Set appropriate options. To get a sense of how much convenience this adds, what value be passed for where? I removed it from the code above to make this more challenging, but don't cheat by looking at your email. If you think you know, would you have been sure if you hadn't seen me use it before?

The need to plot the CI bounds is just as great as the need to plot the point estimate, so if the point estimate should get a plot method, the CI bounds should get a plot method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if instead we would allow res.[sf,cdf].confidence_interval to get as input the new quantiles: confidence_interval(confidence_level, *, ..., quantiles). This looks natural to me and close to evaluate. Then we don't need the object nesting and in the plot function, we can just have an option to plot CI or not. Here it would bring a bit more "legitimacy" to add the plot function.

(About plot. In general I am not against that, we need to be careful about not doing too much though.)

Comment on lines 18 to 23
q : ndarray
The unique values of the sample from which the
`EmpiricalDistributionFunction` was estimated.
p : ndarray
The point estimates of the cumulative distribution function (CDF) or
its complement, the survival function (SF), corresponding with `q`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For private attributes I don't really mind short names but I would prefer if we could use full names for public things. quantile and points (or estimate) would be clearer to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can go for quantiles. I think the plural s is important to emphasize that this is several discrete points - the number of unique points in the sample.

Then the question would be what is the name of the output if "quantiles" is the input. "points" and "estimate" don't go particularly well with "quantiles". "percentiles" sounds nice, but I think probabilities is the most descriptive.

LMK if you want it changed to quantiles and probabilities.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go for quantiles and probabilities (totally agree for the plural, my mistake above!) Thanks 😃

Copy link
Contributor Author

@mdhaber mdhaber Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, although by the same logic we'd write out cumulative_distribution_function and survival_function : / So I've changed it, but I lean toward q and p.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguable, I think it is important we discuss that for the new infra (and try things with "juniors" because a lot of the names we have are not intuitive if you've not been driving SciPy for a while nor are an expert.)

The thing which is making me tick the most with p and q is that they are single letter variables. It's too implicit to me and has a risk to be mistaken with an iterative variable in users' code.

@mdhaber
Copy link
Contributor Author

mdhaber commented Mar 30, 2023

I would postpone it at least for ppf and rvs.

That's fine, although I will not have a problem making icdf and ppf aliases for the same function. As much as I hesitate to to offer two ways of doing the same thing, I think it's more important to avoid cosmetic changes that will slow the adoption of the infrastructure.

From that perspective, I don't think that sf is perfect. To be domain-independent, I would recommend ccdf (complementary CDF).

image

@tupui
Copy link
Member

tupui commented Mar 30, 2023

I do prefer ccdf. It's more general than sf which implies a field: survival analysis. OC you can argue that ccdf is mostly useful in this context, but SciPy is supposed to err on the neutral side.

@mdhaber mdhaber changed the title ENH: stats.ecdf: add evaluate method; restructure result ENH: stats.ecdf: add evaluate and plot methods; restructure result Mar 30, 2023
Comment on lines +85 to +93
try:
import matplotlib # noqa
except ModuleNotFoundError as exc:
message = "matplotlib must be installed to use method `plot`."
raise ModuleNotFoundError(message) from exc

if ax is None:
import matplotlib.pyplot as plt
ax = plt.gca()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With have this code in a few places FitResult and also in scipy.spatial._plotutils._held_figure. We should have a single source of truth (and put this in _lib I suppose.)

scipy/stats/_survival.py Outdated Show resolved Hide resolved
scipy/stats/_survival.py Outdated Show resolved Hide resolved
Comment on lines 130 to 134
low = EmpiricalDistributionFunction(self.q, np.clip(low, 0, 1),
None, None, self._kind)
high = EmpiricalDistributionFunction(self.q, np.clip(high, 0, 1),
None, None, self._kind)
return ConfidenceInterval(low, high)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if instead we would allow res.[sf,cdf].confidence_interval to get as input the new quantiles: confidence_interval(confidence_level, *, ..., quantiles). This looks natural to me and close to evaluate. Then we don't need the object nesting and in the plot function, we can just have an option to plot CI or not. Here it would bring a bit more "legitimacy" to add the plot function.

(About plot. In general I am not against that, we need to be careful about not doing too much though.)

Copy link
Contributor

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdhaber Thanks for your continued work on this. Yes, evaluate was what I was looking for.

sf : `~scipy.stats._result_classes.EmpiricalDistributionFunction`
An object representing the empirical survival function.

The `cdf` and `sf` attributes themselves have the following attributes.

points : ndarray
The point estimate of the CDF/SF at the values in `x`.
quantiles : ndarray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might come as a surprise for a user that a cdf object has a quantiles attribute as in myecdf.cdf.quantiles. But you already discussed this "point" 😏

points : ndarray
The point estimate of the CDF/SF at the values in `x`.
quantiles : ndarray
The unique values in the sample.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The unique values in the sample.
The unique values in the sample that defines the empirical CDF/SF.

@mdhaber
Copy link
Contributor Author

mdhaber commented Apr 1, 2023

Thanks @lorentzenchr. I'll incorporate your suggestion when I resolve merge conflicts.

Aside from that, does the interface make sense to you? There were some suggestions from @tupui above that I'd like your opinions about.

  1. not allowing **kwargs in plot (ENH: stats.ecdf: add evaluate and plot methods; restructure result #18212 (comment))
  2. passing quantiles into confidence_interval (ENH: stats.ecdf: add evaluate and plot methods; restructure result #18212 (comment))
  3. sf vs ccdf (ENH: stats.ecdf: add evaluate and plot methods; restructure result #18212 (comment))

My thoughts on 1 and 2 are shown above, Regarding 3 I would prefer to leave sf for now since this is what users are currently familiar with.

Note that this interface was proposed after careful study of other existing interfaces (which you might want to skim) to organize a lot of information concisely while keeping our options open for future expansion (e.g. PPF-like method, RVS-like method).

Thanks for your help!

(@tirthasheshpatel I'd be interested in your thoughts, too.)

@mdhaber mdhaber requested review from lorentzenchr and tirthasheshpatel and removed request for lorentzenchr April 8, 2023 18:12
@lorentzenchr
Copy link
Contributor

@mdhaber

Aside from that, does the interface make sense to you?

Overall, yes.

  1. not allowing **kwargs in plot

I have no strong opinions here. What is the scipy standard approach? We could also wait for the users' feedback on this and start without it.

  1. passing quantiles into confidence_interval

I personally would prefer to have confidence_interval similar to evaluate, perhaps with a default input of None meaning ecdf.quantiles.

  1. sf vs ccdf

Actuaries also call it sf. I would stick with the scipy default for now. A new distribution API may change this - in the future. (Where can I find more info about that?)

@mdhaber
Copy link
Contributor Author

mdhaber commented Apr 9, 2023

I personally would prefer to have confidence_interval similar to evaluate, perhaps with a default input of None meaning ecdf.quantiles.

How would that affect typical use cases, like plotting?

Currently, plotting looks like:

data = rng.normal(size=100)
res = stats.ecdf(data)
low, high = res.sf.confidence_interval()
ax = plt.gca()
res.sf.plot(ax, color='C0')
low.plot(ax, color='C1')
high.plot(ax, color='C1')

With that change alone, it would become like:

data = rng.normal(size=100)
res = stats.ecdf(data)
q_min, qmax = res.sf.quantiles[0] - 1, res.sf.quantiles[-1] + 1
q = [[q_min] + list(res.sf.quantiles) + [qmax]]
low, high = res.sf.confidence_interval(quantiles=q)
ax = plt.gca()
res.sf.plot(ax, color='C0')  # only the point estimate has a plot method
plt.step()
plt.step(q, [[1] + list(low) + [low[-1]]], color='C1', where='post')
plt.step(q, [[1] + list(high) + [high[-1]]], color='C1', where='post')

Were there other associated changes to enable plotting confidence intervals? If added to the sf.plot method, would the user have any control over how the confidence intervals are plotted as they do now, or do they just get defaults?

I am also curious why a rich object (rather than array) is preferable for the point estimate but not for the confidence interval bounds? Once one learns how the object that represents the point estimate works, why is it uncomfortable to use the same interface to manipulate the objects that represent the confidence interval bounds?

Where can I find more info about that?

gh-15928

@tupui
Copy link
Member

tupui commented Apr 10, 2023

With my suggestion of having the CI function accept quantiles, you could call this bit in the general plot function res.sf.plot.

@mdhaber
Copy link
Contributor Author

mdhaber commented Apr 10, 2023

What is the interface of the new plot method?

@lorentzenchr
Copy link
Contributor

@mdhaber I did not consider this complication. Not very user friendly, indeed.

IIUC, @tupui suggests in #18212 (comment) to have just one single plot function ecdf.(cdf/sf).plot(ci_level=None). As a user, I think I would prefer that.

If I plot several things in one figure, I often prefer to visualize uncertainty by transparent bands in the same color, e.g. fill_between(alpha=..).

@mdhaber
Copy link
Contributor Author

mdhaber commented Apr 10, 2023

Ok, we need a little more to make the suggestion concrete, though. Right now, all the options of each line can be controlled separately. Can this be done with the new method, and if so, what is the interface? (It can no longer just pass kwargs straight to Matplotlib.) If not, what can be controlled, and how?
(Since I don't have a vision for this other API, I don't think I should try to answer these questions myself.)

@tupui
Copy link
Member

tupui commented Apr 10, 2023

I think we should stay very lean on the figures we propose (i.e. maybe no fill_between otherwise I am tempted to propose a gradient colouring, as an attempt to show the distribution of error.) Hence we would only plot lines and not have options to control the style. Styling can be done after playing with the returned axes. Arguably harder, but what we propose is more a convenient/opinionated tool rather than something that could be tweak and out of the box produce paper/presentation quality figures.

What about this, I propose we go for now with something we like in terms of style. If we want, we could even make a community call to discuss good styling defaults, could be an interesting promotion exercise for the new feature. The outcome of that could be an opinionated style vetted by more people from the community.

@mdhaber
Copy link
Contributor Author

mdhaber commented Apr 10, 2023

Ah, yes, gradient might be nice. Making something like a transparency gradient is pretty easy, and the alpha level is quantitatively appropriate rather than linear. It's a matter of opinion whether the apparent "blurriness" is a good or bad thing.

image

ps = np.linspace(0.05, 0.95, 100)
for p in ps:
    low, high = res.sf.confidence_interval(confidence_level=p)
    low.plot(ax, color='C0', alpha=1-p)
    high.plot(ax, color='C0', alpha=1-p)

@tupui
Copy link
Member

tupui commented Apr 10, 2023

Ah, yes, gradient might be nice. Making something like a transparency gradient is pretty easy, and the alpha level is quantitatively appropriate rather than linear. It's a matter of opinion whether the apparent "blurriness" is a good or bad thing.

haha you did it. Sorry for dragging you in the rabbit hole 😅 I had some code to do something similar

https://github.com/tupui/batman/blob/559e1c4574865694e47f52bb202c560fc6252d2d/batman/visualization/uncertainty.py#L146-L160

This was making figures like that:
Screenshot 2023-04-10 at 19 06 12

@tirthasheshpatel
Copy link
Member

tirthasheshpatel commented Apr 13, 2023

(@tirthasheshpatel I'd be interested in your thoughts, too.)

I see that most of the points have been discussed. Nevertheless, here's my opinion:

  1. I like the matplotlib_kwargs name more.
  2. I might have missed where this was discussed but why is confidence_interval not a method of the result object (i.e. res.confidence_interval()) itself? It makes more sense to have it there since other statistical tests also return an object with that method. It would also resolve the nested structure problem. Edit: I see that the cdf and sf both need this method. In that case, we can have confidence_interval take an extra parameter telling it for which method to compute the CI. Something like confidence_interval(kind: Literal['cdf'] | Literal['sf'], confidence_level=0.95, *, method='linear')
  3. Given the current SciPy API for distributions, sf is much better.

@tupui
Copy link
Member

tupui commented Apr 13, 2023

Thanks @tirthasheshpatel 😃 FYI I opened a PR on Matt's fork to show my alternative proposal.

  1. That was my initial proposal too. Would be fine with both though as sf/cdf are results objects. i.e. we could see the main result object as being an intermediary step.

  2. Agreed for the current infra. I am trying to thing about the next one here 😉 e.g. I will want to die on the hill of random vs rvs. Same to not have ppf which is confusing for most newcomers, etc. etc.

@mdhaber
Copy link
Contributor Author

mdhaber commented Apr 14, 2023

@tirthasheshpatel Re: 2, see #18136 (comment) (which we seemed to agree on at the time). Please make a full proposal as I did there because it doesn't necessarily make sense to just change one thing.

@tirthasheshpatel
Copy link
Member

tirthasheshpatel commented Apr 14, 2023

If I understand #18136 (comment) correctly, we have confidence_interval as a method of the returned cdf and sf functions to:

  • reduce the complexity of the return object by having fewer attributes and
  • avoid adding more attributes when we add other distribution methods like isf since they could also have a confidence_interval method attached to them.

Right?

I was under the impression that we probably won't have confidence_interval for other distribution methods like ppf or isf which is why I proposed to add confidence_interval_cdf and confidence_interval_sf attributes to the result object itself (like lifelines' KaplanMeierFitter.fit) instead of having them as methods of the cdf and sf function objects. IMO, this doesn't make the result object any more complex; the current design just hides the extra complexity conveniently behind the cdf and sf objects. But if the plan is to return ppf and isf too in the future, then I agree that the current design would be better i.e. it would be better to have them as methods of the returned function objects instead of having it as an attribute/method of the return object itself.

For the other part of the discussion: allowing passing quantiles to the confidence_interval method, I don't know which use case would justify its addition. Maybe, we can keep it simple now and add it as the necessity comes up. But since I am not experienced in survival analysis, I don't have a strong opinion about this.

@tirthasheshpatel
Copy link
Member

tirthasheshpatel commented Apr 14, 2023

@tupui

Same to not have ppf which is confusing for most newcomers, etc. etc.

Agreed! But if we change now just for the ecdf function, it would come as a surprise to the user. This is why I think we should stay consistent now and change everything together once we have the new design for univariate distributions finalized. Also, if we add an alias now (#18212 (comment)) and decide on a different name later for the univariate distribution infra, we might not be able to remove/change the alias without breaking backward compatibility.

@mdhaber
Copy link
Contributor Author

mdhaber commented Apr 16, 2023

But if the plan is to return ppf and isf too in the future,

That is indeed something we'd like to be able to do. Statsmodels, for instance, returns that information, including confidence intervals for the ISF. See quantile and quantile_ci of SurvFunctionRight.

Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi all, I would like to move this forward 😃

I want to clarify something, I am +1 with the current proposal and I am only suggesting an alternative option (see on Matt's fork the PR I made to see how this would look like practically speaking.) I am not saying nor thinking that my option is better, just different 😃

@lorentzenchr @tirthasheshpatel what do you think? Is the current API working for you? If yes, I would suggest that we get this in and if wanted a follow up can be done. We could also continue discussing here after the PR is merged. Time is flying and we have roughly a month before the branching for the new release. I want to be sure that a version of this PR gets in. And I do think that the current proposal is a proper good API that we can work with.

What do you think?

@tupui tupui added this to the 1.11.0 milestone Apr 26, 2023
@tirthasheshpatel
Copy link
Member

what do you think? Is the current API working for you?

I am +1 too now. With more methods to join the ECDFResult object, I can see why @mdhaber's API is clearly better than the one I thought. Although I haven't reviewed the tests very closely, the implementation looks good to me. If the tests look good to you, feel free to merge @tupui!

@lorentzenchr
Copy link
Contributor

Also very fine for me and as a user looking forward to have it soon.

@tupui tupui merged commit f797ac7 into scipy:main May 1, 2023
@tupui
Copy link
Member

tupui commented May 1, 2023

All right! Thanks all 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants