Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

BinaryPrecisionRecallCurve computes wrong value if used with logits, even though the docstring says this is supported #2329

Open
NoahAtKintsugi opened this issue Jan 29, 2024 · 1 comment 路 May be fixed by #1676
Labels
bug / fix Something isn't working help wanted Extra attention is needed v1.3.x

Comments

@NoahAtKintsugi
Copy link

NoahAtKintsugi commented Jan 29, 2024

馃悰 Bug

The object oriented BinaryPrecisionRecallCurve can compute substantially incorrect values if logits are passed as predictions instead of probabilities, even though the docstring says this is ok. The underlying functional version binary_precision_recall_curve seems to work correctly in both cases. Both versions attempt to convert the logits to probabilities by passing them through a sigmoid if any values are outside of the range [0, 1]. In the object oriented case this condition is incorrectly checked independently for each batch, rather than for the metric as a whole. Consequently some batches may have sigmoid applied to their scores, while others do not, resulting in an incorrect curve for the dataset as a whole.

To Reproduce

import torch
import torchmetrics

b1 = torchmetrics.AUROC(task="binary")
b2 = torchmetrics.AUROC(task="binary")

score = torch.tensor([-1.0, 0.0, 1.0])
label = torch.tensor([1.0, 0.0, 1.0])

b1.update(score, label)  # pass all three score/label pairs in as a single batch

for s, l in zip(score, label):
    b2.update(torch.tensor([s]), torch.tensor([l]))  # pass score/label pairs in one at a time

assert b1.compute().item() == 0.5
assert b2.compute().item() == 1.0

Expected behavior

I would expect both AUCs to equal 0.5, as computed with scikit-learn using sklearn.metrics.roc_auc_score(label, score).

Environment

  • TorchMetrics version (and how you installed TM, e.g. conda, pip, build from source): 1.3.0.post0
  • Python & PyTorch Version (e.g., 1.0): 3.10.13
  • Any other relevant information such as OS (e.g., Linux): MacOS

Additional context

The bug is on line https://github.com/Lightning-AI/torchmetrics/blob/master/src/torchmetrics/classification/precision_recall_curve.py#L165 -- as it is currently written the function _binary_precision_recall_curve_format should only be called on the full dataset, not on individual batches. Otherwise the behavior is wrong if some batches have all scores in the range [0, 1] but other batches do not.

Some possible solutions are: (1) update the docs not to allow for logits in the object oriented interface, since the behavior is correct for probabilities; (2) don't try to automatically infer whether to apply sigmoid -- my choice, but would be a breaking change (3) refactor _binary_precision_recall_curve_format and accept that if any values are found which require sigmoid, then all values from past batches need to have sigmoid applied (this would be tricky in the case where thresholds are specified because the scores are not kept around).

@NoahAtKintsugi NoahAtKintsugi added bug / fix Something isn't working help wanted Extra attention is needed labels Jan 29, 2024
Copy link

Hi! thanks for your contribution!, great first issue!

@Borda Borda added the v1.3.x label Jan 29, 2024
@SkafteNicki SkafteNicki linked a pull request Feb 4, 2024 that will close this issue
27 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug / fix Something isn't working help wanted Extra attention is needed v1.3.x
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants