Retrieval Metrics GPU Memory Leak #2481

astirn · 2024-03-29T21:33:19Z

🐛 Bug

Updating a RetrievalMetric in training and/or validation results in runaway GPU memory usage.

To Reproduce

In a LightningModule's def __init__(...), declare

metrics = tm.MetricCollection({
   'mrr': tm.RetrievalMRR()
})
self.train_retrieval_metrics = metrics.clone(prefix='train_')
self.val_retrieval_metrics = metrics.clone(prefix='val_')

Then in training_step(self, batch, batch_idx) and validation_step(self, batch, batch_idx), respectively call

self.train_retrieval_metrics(logits, targets, indexes)

and

self.val_retrieval_metrics(logits, targets, indexes)

where logits is a batch size x batch size matrix of logits. You can assume targets and indexes are correct since I am getting good MRR measurements.

I believe I tracked down the problem. In retrieval/base.py, the RetrievalMetric class initializes

self.add_state("indexes", default=[], dist_reduce_fx=None)
self.add_state("preds", default=[], dist_reduce_fx=None)
self.add_state("target", default=[], dist_reduce_fx=None)

using python lists. Because these are lists, the Metric class's reset(self) in metric.py

for attr, default in self._defaults.items():
    current_val = getattr(self, attr)
    if isinstance(default, Tensor):
        setattr(self, attr, default.detach().clone().to(current_val.device))
    else:
        setattr(self, attr, [])

the setattr(self, attr, default.detach().clone().to(current_val.device)) is never reached.

Expected behavior

GPU memory does not increase overtime.

Environment

TorchMetrics version: 1.3.2
Python & PyTorch Version: 3.11.8 & 2.2.1

Additional context

The text was updated successfully, but these errors were encountered:

github-actions · 2024-03-29T21:33:43Z

Hi! thanks for your contribution!, great first issue!

lucadiliello · 2024-05-17T11:55:40Z

Hello @astirn. I think the behaviour is correct, as Retrieval Metrics are designed only to be globally valid, and to be computed on the whole dataset, thus they will accumulate results till the end of the epoch.

Please try the following in your LightningModule:

def training_step(self, batch, batch_idx):
    ...
    self.train_retrieval_metrics.update(logits, targets, indexes)

def on_train_epoch_end(self):
    metrics = self.train_retrieval_metrics.compute()

With this modification, you will get correct metrics and compute them only once at the end of the training. You may consider moving results to cpu to save GPU memory if training is long (i.e. pre-training). Just add compute_on_cpu=True to metrics instantiation.

Another case is if you don't have indexes that shard across several batches (basically you have no overlapping queries between different batches). Then, you could use functional metrics directly:

def training_step(self, batch, batch_idx):
   ...
   mrr = retrieval_reciprocal_rank(logits, targets)

Be aware that the latter example assumes indexes for each batch would be like [0, ..., 0] for batch_0, [1, ..., 1] for batch_1 and so on.

astirn · 2024-05-20T17:56:40Z

Thanks for your reply. I believe Lightning AI already makes this call. IIRC, the problem I had is that GPU memory usage increases over epochs, not just within an epoch as your reply suggests it should.

I wrote my own MRR code that solves this problem and have been using since I filed this issue. Feel free to close this issue :)

lucadiliello · 2024-05-21T08:44:33Z

Could you please share your implementation of MRR to understand whether there is a bug on our end?

astirn added bug / fix Something isn't working help wanted Extra attention is needed labels Mar 29, 2024

Borda added the v1.3.x label Mar 31, 2024

Borda assigned lucadiliello Mar 31, 2024

This was referenced Apr 6, 2024

list states leak (Tensor) memory #2492

Closed

Clear list states (i.e. delete their contents), not reassign the default [] #2493

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieval Metrics GPU Memory Leak #2481

Retrieval Metrics GPU Memory Leak #2481

astirn commented Mar 29, 2024 •

edited by Borda

github-actions bot commented Mar 29, 2024

lucadiliello commented May 17, 2024

astirn commented May 20, 2024

lucadiliello commented May 21, 2024 •

edited

Retrieval Metrics GPU Memory Leak #2481

Retrieval Metrics GPU Memory Leak #2481

Comments

astirn commented Mar 29, 2024 • edited by Borda

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

github-actions bot commented Mar 29, 2024

lucadiliello commented May 17, 2024

astirn commented May 20, 2024

lucadiliello commented May 21, 2024 • edited

astirn commented Mar 29, 2024 •

edited by Borda

lucadiliello commented May 21, 2024 •

edited