Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Area under the receiving operating characteristic curve (AUROC) calculation. #977

Open
stephenpardy opened this issue Sep 5, 2023 · 2 comments

Comments

@stephenpardy
Copy link

A common metric for classification model performance is the area under the receiving operating characteristic curve (AUROC, AUROCC, or often simply AUC) which shows the area of the curve created by plotting TPR and FPR against one another.
Wikipedia.

This metric is available in sklearn, but is currently missing from Dask-ml. One issue with computing this metric in Dask is that the standard implementation of AUROC requires searching over a sorted array and the sorting operations are often expensive in distributed computing.

I would be interested in a discussion of whether a naive version of this could be implemented using sorting and whether there are any known methods of computation that avoid the sort.

@TomAugspurger
Copy link
Member

TomAugspurger commented Sep 24, 2023 via email

@mrocklin
Copy link
Member

I don't know enough about this specific ML metric, but maybe if we can translate the problem to something more general then I can be of use.

For example, If someone had a distribution of values spread in an unsorted way across many machines and they wanted to plot an approximate sorted distribution of those points then there are a couple things they could do:

  1. The simplest would be to just sample. I suspect that it would be easy to get very good accuracy by sampling.

  2. But if for some reason they wanted to have all points influence the result then my recommendation would be to calculate a lot of approximate quantiles, for which there are good distributed algorithms. I suspect once one had, say, 100 percentiles (0%, 1%, 2%, ..., 99%, 100%) that one could take those values and reconstruct a distribution curve that was pretty close to full accuracy.

    Depending on the exact curve necessary one might need to invert the delta between neighboring percentiles or something similar. This might get mildly complex, but is the sort of thing one can probably do with ten minutes of thought and, importantly, without distributed systems experience.

I may be entirely on the wrong track through. @stephenpardy can you help by reducing your problem a little bit to general numbers / computational terms? I suspect that it'll be easier for folks to help in that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants