Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculations nDCG using GPU are 2x slower than CPU #2287

Open
donglihe-hub opened this issue Dec 29, 2023 · 4 comments
Open

Calculations nDCG using GPU are 2x slower than CPU #2287

donglihe-hub opened this issue Dec 29, 2023 · 4 comments
Labels
bug / fix Something isn't working help wanted Extra attention is needed v1.2.x

Comments

@donglihe-hub
Copy link

donglihe-hub commented Dec 29, 2023

馃悰 Bug

Hi TorchMetrics Team,

In the following example, nDCG calculation using GPU tensors spent 2 times longer the time using CPU tensors and numpy array.

To Reproduce

The codes were tested on both Google Colab and a Slurm cluster.

Code sample
import timeit

import numpy as np
import torch
from sklearn.metrics import ndcg_score
from torchmetrics.functional.retrieval import retrieval_normalized_dcg

# p and t are examples given by both sklearn and torchmetrics
p = [.1, .2, .3, 4, 70] * 100
t = [10, 0, 0, 1, 5] * 100

number = int(1e4)

# 1. BENCHMARK: numpy array
preds = np.asarray([p])
target = np.asarray([t])

def a():
    return ndcg_score(target, preds)

print(f'numpy array: {timeit.timeit("a()", setup="from __main__ import a", number=number):.4f}')

# 2. cpu tensor
preds_cpu = torch.tensor(p)
target_cpu = torch.tensor(t)

assert preds_cpu.device == torch.device("cpu")

def b():
    retrieval_normalized_dcg(preds_cpu, target_cpu)

print(f'CPU tensor: {timeit.timeit(f"b()", setup="from __main__ import b", number=number):.4f}')

# 3. gpu tensor
preds_gpu = torch.tensor(p, device="cuda")
target_gpu = torch.tensor(t, device="cuda")

assert preds_gpu.device == torch.device("cuda:0")

def c():
    retrieval_normalized_dcg(preds_gpu, target_gpu)

print(f'GPU tensor: {timeit.timeit("c()", setup="from __main__ import c", number=number):.4f}')

Results:

# Tesla T4
numpy array: 6.4896
CPU tensor: 5.8501
GPU tensor: 10.4120

I also tested the codes on the Slurm Cluster I'm currently using, the GPU here is an A100.

numpy array: 3.8700
CPU tensor: 2.9305
GPU tensor: 7.7575

Expected behavior

The performance of calculation using GPU tensors, if not superior, should be at least close to CPU tensors.

Environment

  • TorchMetrics version (and how you installed TM, e.g. conda, pip, build from source): 1.2.1 (pip)
  • Python & PyTorch Version (e.g., 1.0): Python 3.10.12 and 3.10.13, Torch 2.1.0 and 2.1.1
  • Any other relevant information such as OS (e.g., Linux): Ubuntu 22.04.3 LTS and Linux 5.4.204-ql-generic-12.0-19 x86_64

Additional context

@donglihe-hub donglihe-hub added bug / fix Something isn't working help wanted Extra attention is needed labels Dec 29, 2023
@donglihe-hub donglihe-hub changed the title Why calculations using GPU Tensors are Slower than CPU Tensors? Why Calculations using GPU Tensors are Slower than CPU Tensors? Dec 29, 2023
@donglihe-hub donglihe-hub changed the title Why Calculations using GPU Tensors are Slower than CPU Tensors? Calculations using GPU Tensors are 2 Times Slower than CPU Tensors? Dec 29, 2023
@donglihe-hub donglihe-hub changed the title Calculations using GPU Tensors are 2 Times Slower than CPU Tensors? Calculations using GPU Tensors are 2 Times Slower than CPU Tensors Dec 29, 2023
@Borda Borda added the v1.2.x label Dec 31, 2023
@Borda
Copy link
Member

Borda commented Dec 31, 2023

nDCG calculation using GPU tensors spent 2 times longer the time using CPU tensors and numpy array

Thank you for bringing this up. Have you observed it also with other metric then NDCG?

@donglihe-hub
Copy link
Author

donglihe-hub commented Jan 1, 2024

Thank you for bringing this up. Have you observed it also with other metric then NDCG?

I only tested NDCG at the time submitting the issue. But now I understand the cause of the issue.

The inferior performance of GPU tensor results from the fact that the current implementation of NDCG does not utilize the parallel computation provided by GPU -- TorchMetrics NDCG only accept 1D tensor as inputs.

To prove my observation, I tried another metric, multilabel_precision. The results showed that calculation on GPU is faster than CPU when there are hundreds of instances. However, when there is only one instance, calculation on CPU is faster than on GPU.

Scripts for multilabel_precision performance test

import timeit

import torch
from torchmetrics.functional.classification import multilabel_precision

number = int(1e3)

# change 400 to 1 for comparison experiments
y_true = torch.randint(2, (400, 300))
y_pred = torch.randint(2, (400, 300))

# CPU tensor
target_cpu = y_true.clone().detach()
preds_cpu = y_pred.clone().detach()

assert target_cpu.device == torch.device("cpu")

def cpu():
    return multilabel_precision(preds_cpu, target_cpu, num_labels=300)

print(f'CPU tensor: {timeit.timeit("cpu()", setup="from __main__ import cpu", number=number):.4f}')

# GPU tensor
target_gpu = y_true.clone().detach().to(device="cuda")
preds_gpu = y_pred.clone().detach().to(device="cuda")

assert target_gpu.device == torch.device("cuda:0")

def gpu():
    return multilabel_precision(preds_gpu, target_gpu, num_labels=300)

print(f'GPU tensor: {timeit.timeit("gpu()", setup="from __main__ import gpu", number=number):.4f}')
# 400 instances Results:
CPU tensor: 3.6518
GPU tensor: 0.8089
# 1 instance Results:
CPU tensor: 0.1848
GPU tensor: 0.6217

Is there any special concern that torchmetrics.NDCG only accepts a single instance instead of a batch? If not, I suggest NDCG should accept batch inputs.

@Borda Borda changed the title Calculations using GPU Tensors are 2 Times Slower than CPU Tensors Calculations nDCG using GPU are 2x slower than CPU Jan 1, 2024
@SkafteNicki
Copy link
Member

@donglihe-hub thanks for reporting this issue. Sorry for the long reply time from my side.
I been looking at the implementation of our metric for a bit of time now and it is not correct that the implementation is not using parallel computations on GPU. Just because the input is 1D does not mean that the computations cannot be parallelized.
For example doing a simple sum
image
is equally fast, regardless of input is a 1d or 2d tensor.

Looking at the code, it seems that the operation that takes up most computational time is torch.unique used here. From small experiments, it seems that this operation alone is a bottleneck:
image
the torch gpu implementation is ~15 times slower for large arrays.

I am not sure if we can actually optimize the code or the operations used in the ndcg metric does simply not parallelize that well on GPU. I try to investigate further.

@hengdashi
Copy link

Hi!

I'm running into the same issue where the ndcg metric calculation is taking too long and becomes impractical to use while training. Calculating ndcg metric for every step with tensor size around (8000, 40) [batch_size, list_size] takes 2s to complete, and is way higher than the model forward pass.

After looking into the metric class implementation, I believe it is not because of the torch.unique function but the fundamental design flaw of the RetrievalMetric. The RetrievalMetric class splits the input tensor with the indexes into a list of tensors and iterates sequentially over that list, which is very slow when the number of query groups is high.

The tensorflow ranking implementation of the nDCG metric with the same inputs only takes about 50ms to complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug / fix Something isn't working help wanted Extra attention is needed v1.2.x
Projects
None yet
Development

No branches or pull requests

5 participants
@Borda @hengdashi @SkafteNicki @donglihe-hub and others