Metric not moved to device and invalids the cpu-gpu offloading when combining with DeepSpeed #2473

qingquansong · 2024-03-26T04:28:03Z

🐛 Bug

version 1.3.1

Similar issue as: Metric not moved to device #531 when using the following code:

when running the following code:

class MyModel(LightningModule):
  def __init__(self):
            self.metrics: ModuleDict[str, MetricCollection] = ModuleDict(
                {
                    "train_metric": MetricCollection(
                        {
                            "train_accuracy_micro": Accuracy(
                                task="multiclass", num_classes=3, average="micro"
                            )
                        }
                    ),
                    "val_metric": MetricCollection(
                        {
                            "val_accuracy_micro": Accuracy(
                                task="multiclass", num_classes=3, average="micro"
                            ),
                            "val_auroc": ClasswiseWrapper(
                                AUROC(task="multiclass", num_classes=3, average=None), labels=["1", "2", "3"],
                            )
                        }
                    ),
                }
            )

  def forward(self, input):
    print(f"self.device: {self.device}")
    print(f"metric device: {self.metrics['train_metric']['train_accuracy_micro'].device}")

I got:

self.device: cuda:0
self.accuracy.device: cpu

When running with deep speed strategy, it gives me:
Invalidate trace cache @ step 327: expected module 365, but got module 365, which seems to also slow down the deep speed evaluation. (tried both one or multiple GPUs with the following config and both have the same alert)

The deep speed config is:

{
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "zero_hpz_partition_size": 4,
        "zero_quantized_gradients": true,
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "steps_per_print": 2000,
    "train_micro_batch_size_per_gpu": 16,
    "wall_clock_breakdown": false
}

Trainer created via:

trainer = L.Trainer(
            # Hardware Setup
            # --------------------------------
            devices=self.num_gpus_per_node,
            num_nodes=self.num_nodes,
            accelerator="gpu",
            # Training Configuration
            # --------------------------------
            strategy=DeepSpeedStrategy(config=self.args.deepspeed),  # the path to the json config above
        )

trainer.model is the model containing the metrics above

Expected behavior

Expect the metric to be on cuda:0.
No warning alert appears like: Invalidate trace cache @ step 327: expected module 365, but got module 365

Environment

TorchMetrics version (and how you installed TM, e.g. conda, pip, build from source): 1.3.1
Lightning version: 2.2.1
Python & PyTorch Version (e.g., 1.0): 3.10.2 & .2.1.2.1+gita8e7c98
Any other relevant information such as OS (e.g., Linux):

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

The text was updated successfully, but these errors were encountered:

github-actions · 2024-03-26T04:28:24Z

Hi! thanks for your contribution!, great first issue!

qingquansong added bug / fix Something isn't working help wanted Extra attention is needed labels Mar 26, 2024

Borda added the v1.3.x label Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric not moved to device and invalids the cpu-gpu offloading when combining with DeepSpeed #2473

Metric not moved to device and invalids the cpu-gpu offloading when combining with DeepSpeed #2473

qingquansong commented Mar 26, 2024 •

edited by Borda

github-actions bot commented Mar 26, 2024

Metric not moved to device and invalids the cpu-gpu offloading when combining with DeepSpeed #2473

Metric not moved to device and invalids the cpu-gpu offloading when combining with DeepSpeed #2473

Comments

qingquansong commented Mar 26, 2024 • edited by Borda

🐛 Bug

Expected behavior

Environment

github-actions bot commented Mar 26, 2024

qingquansong commented Mar 26, 2024 •

edited by Borda