How to train in a multi-node environment? #18423
Unanswered
rahaazad2
asked this question in
DDP / multi-GPU / multi-node
Replies: 2 comments
-
hi @rahaazad2 👋! are you changing the |
Beta Was this translation helpful? Give feedback.
0 replies
-
I had a similar problem which was caused from an interaction of argparse with PL multi gpu instantiating to solve the problem, i adjusted the parsing: |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I want to train a pytorch-lightning code in a cluster of 6 nodes (each node 1 gpu). Here's the code for training:
`
import argparse
import json
import os
import pytorch_lightning as pl
import src.data_loaders as module_data
import torch
from pytorch_lightning.callbacks import ModelCheckpoint
from src.utils import get_model_and_tokenizer
from torch.nn import functional as F
from torch.utils.data import DataLoader
class MyClassifier(pl.LightningModule):
def init(self, config):
super().init()
self.save_hyperparameters()
self.num_classes = config["arch"]["args"]["num_classes"]
self.model_args = config["arch"]["args"]
self.model, self.tokenizer = get_model_and_tokenizer(**self.model_args)
self.bias_loss = False
def cli_main():
pl.seed_everything(1234)
if name == "main":
cli_main()
`
It works fine on a single node with 4 GPUs but in multi-node setting, it seems there is no difference with single-node settings. Specifically, from the logs of the nodes I see that the RANK is correctly set (i.e., RANK 0 for master and RANK 1 to 5 for workers). However, there are two issues:
I run the code using this command
python train.py --config PATH
Beta Was this translation helpful? Give feedback.
All reactions