New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculations nDCG using GPU are 2x slower than CPU #2287
Comments
Thank you for bringing this up. Have you observed it also with other metric then NDCG? |
I only tested NDCG at the time submitting the issue. But now I understand the cause of the issue. The inferior performance of GPU tensor results from the fact that the current implementation of NDCG does not utilize the parallel computation provided by GPU -- TorchMetrics NDCG only accept 1D tensor as inputs. To prove my observation, I tried another metric, multilabel_precision. The results showed that calculation on GPU is faster than CPU when there are hundreds of instances. However, when there is only one instance, calculation on CPU is faster than on GPU. Scripts for multilabel_precision performance testimport timeit
import torch
from torchmetrics.functional.classification import multilabel_precision
number = int(1e3)
# change 400 to 1 for comparison experiments
y_true = torch.randint(2, (400, 300))
y_pred = torch.randint(2, (400, 300))
# CPU tensor
target_cpu = y_true.clone().detach()
preds_cpu = y_pred.clone().detach()
assert target_cpu.device == torch.device("cpu")
def cpu():
return multilabel_precision(preds_cpu, target_cpu, num_labels=300)
print(f'CPU tensor: {timeit.timeit("cpu()", setup="from __main__ import cpu", number=number):.4f}')
# GPU tensor
target_gpu = y_true.clone().detach().to(device="cuda")
preds_gpu = y_pred.clone().detach().to(device="cuda")
assert target_gpu.device == torch.device("cuda:0")
def gpu():
return multilabel_precision(preds_gpu, target_gpu, num_labels=300)
print(f'GPU tensor: {timeit.timeit("gpu()", setup="from __main__ import gpu", number=number):.4f}')
Is there any special concern that torchmetrics.NDCG only accepts a single instance instead of a batch? If not, I suggest NDCG should accept batch inputs. |
@donglihe-hub thanks for reporting this issue. Sorry for the long reply time from my side. Looking at the code, it seems that the operation that takes up most computational time is I am not sure if we can actually optimize the code or the operations used in the ndcg metric does simply not parallelize that well on GPU. I try to investigate further. |
Hi! I'm running into the same issue where the ndcg metric calculation is taking too long and becomes impractical to use while training. Calculating ndcg metric for every step with tensor size around (8000, 40) [batch_size, list_size] takes 2s to complete, and is way higher than the model forward pass. After looking into the metric class implementation, I believe it is not because of the torch.unique function but the fundamental design flaw of the RetrievalMetric. The RetrievalMetric class splits the input tensor with the indexes into a list of tensors and iterates sequentially over that list, which is very slow when the number of query groups is high. The tensorflow ranking implementation of the nDCG metric with the same inputs only takes about 50ms to complete. |
馃悰 Bug
Hi TorchMetrics Team,
In the following example, nDCG calculation using GPU tensors spent 2 times longer the time using CPU tensors and numpy array.
To Reproduce
The codes were tested on both Google Colab and a Slurm cluster.
Code sample
Results:
I also tested the codes on the Slurm Cluster I'm currently using, the GPU here is an A100.
Expected behavior
The performance of calculation using GPU tensors, if not superior, should be at least close to CPU tensors.
Environment
conda
,pip
, build from source): 1.2.1 (pip)Additional context
The text was updated successfully, but these errors were encountered: