Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics not being logged properly on remote GPU #2479

Open
aaronwtr opened this issue Mar 27, 2024 · 4 comments
Open

Metrics not being logged properly on remote GPU #2479

aaronwtr opened this issue Mar 27, 2024 · 4 comments
Labels
bug / fix Something isn't working help wanted Extra attention is needed v1.3.x

Comments

@aaronwtr
Copy link

aaronwtr commented Mar 27, 2024

馃悰 Bug

I have written some PyTorch Lightning code for which I am using torchmetrics to evaluate. Locally on my CPU everything works fine. However, when I move my script to a cluster, and try to run it on a single GPU, I run into problems. Specifically, it seems like my error metrics are not being calculated properly (see attached image). What could be the cause here?

To Reproduce

I am loading my data with the LinkNeighbourLoader provided by PyG. I suspect the observed behaviour might be partially attributable to the way LinkNeighbourLoader sends (or doesn't send?) the data to the correct device. Torchmetrics should handle this out-of-the-box, but it might be that LinkNeighbourLoader doesn't integrate properly here.

I load my data as follows:

class HGSLNetDataModule(pl.LightningDataModule):
    def __init__(self, train_graph, test_graph, config):
        super().__init__()

        self.config = config

        self.train_graph = train_graph

        self.val_graph = None

        self.test_graph = test_graph

        self.scaler = None

    def setup(self, stage=None):
        if stage == 'fit' or stage is None:
            self.train_graph, self.val_graph, _ = self.link_split_transform()

            if os.path.exists('cache/quantile_scaler.pkl'):
                with open('cache/quantile_scaler.pkl', 'rb') as f:
                    self.scaler = pkl.load(f)
                self.train_graph.x = self.scaler.transform(self.train_graph.x)
            else:
                self.scaler = QuantileTransformer()
                self.train_graph.x = self.scaler.fit_transform(self.train_graph.x)
                with open('cache/quantile_scaler.pkl', 'wb') as f:
                    pkl.dump(self.scaler, f)
            self.val_graph.x = self.scaler.transform(self.val_graph.x)
            self.train_graph.x = torch.from_numpy(self.train_graph.x).to(torch.float32)
            self.val_graph.x = torch.from_numpy(self.val_graph.x).to(torch.float32)

            self.train_edge_labels, self.train_edge_label_index = (self.train_graph.edge_label,
                                                                   self.train_graph.edge_label_index)
            self.val_edge_labels, self.val_edge_label_index = self.val_graph.edge_label, self.val_graph.edge_label_index
        elif stage == 'test':
            self.test_graph = self.test_graph
            with open('cache/quantile_scaler.pkl', 'rb') as f:
                self.scaler = pkl.load(f)
            self.test_graph.x = self.scaler.transform(self.test_graph.x)
            self.test_graph.x = torch.from_numpy(self.test_graph.x).to(torch.float32)

            self.test_edge_labels, self.test_edge_label_index = (self.test_graph.edge_label,
                                                                 self.test_graph.edge_index)

    def train_dataloader(self):
        return LinkNeighborLoader(
            self.train_graph,
            batch_size=self.config['batch_size'],
            edge_label=self.train_edge_labels,
            edge_label_index=self.train_edge_label_index,
            num_neighbors=[20, 10],
            pin_memory=True,
            shuffle=True
        )

With similar loaders for my test and val data. Then, I initialise my metrics in where I am defining the Lightning training and testing logic as follows:

class HGSLNet(pl.LightningModule):
    def __init__(self, num_layers, hidden_channels, num_heads, config):
        super().__init__()
        self.config = config

        self.model = GATModel(num_layers, hidden_channels, num_heads)

        self.best_avg_mcc = 0
        self.best_avg_acc = 0

        self.val_mcc = MatthewsCorrCoef(task='binary')
        self.val_acc = Accuracy(task='binary')

        self.test_mcc = MatthewsCorrCoef(task='binary')
        self.test_acc = Accuracy(task='binary')

E.g., in my validation step, I log the metrics as follows:

    def validation_step(self, batch, batch_idx):
        x, edge_label_index, y = batch.x, batch.edge_label_index, batch.edge_label
        logit, proba, pred = self(x, edge_label_index)
        _y = y.float()

        self.val_batch_size = batch.edge_label_index.size(1)

        loss = F.binary_cross_entropy_with_logits(logit, _y)

        val_mcc = self.val_mcc(pred, y)
        val_acc = self.val_acc(pred, y)

        self.log('val_loss', loss, on_step=False, on_epoch=True, batch_size=self.val_batch_size)
        self.log('val_mcc', val_mcc, on_step=False, on_epoch=True, batch_size=self.val_batch_size)
        self.log('val_acc', val_acc, on_step=False, on_epoch=True, batch_size=self.val_batch_size)

        return loss

The logic is similar for my training and testing steps. This leads to the loss curves as displayed in the attached figure, where green is the run on HPC cluster GPUs and red locally on CPU.

Expected behavior

The metrics to be logged at each epoch properly.

Environment

Python environment:

  • Python==3.10.7
  • PyTorch==2.2.0+cu118
  • cudnn==8.9.4
  • cuda==12.2
  • gcc==12.1.0
  • lightning==2.1.3
  • torchmetrics== 1.3.0.post0
  • torch_geometric==2.4.0
  • pyg-lib==0.4.0

OS:
-CentOS Linux 7 (Core)

Additional context

Loss curves:
Screenshot 2024-03-27 at 19 43 08
Screenshot 2024-03-27 at 19 42 45

@aaronwtr aaronwtr added bug / fix Something isn't working help wanted Extra attention is needed labels Mar 27, 2024
Copy link

Hi! thanks for your contribution!, great first issue!

@aaronwtr
Copy link
Author

@rusty1s @jxtngx @Borda

@Borda Borda added the v1.3.x label Mar 28, 2024
@SkafteNicki
Copy link
Member

Hi @aaronwtr, thanks for raising this issue.
Have you solved the issue or do it still persist? I think a bit more information is needed here. The logging behavior should really not change when you move from one device to another (even in this case another computer system). I wonder if it is the logging that is going wrong here or is it the actually training that is going wrong e.g. if you just print to the terminal do you still produce constant metric values?

@SkafteNicki
Copy link
Member

Additionally, if you somehow can provide a fully reproducible example that would be nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug / fix Something isn't working help wanted Extra attention is needed v1.3.x
Projects
None yet
Development

No branches or pull requests

4 participants
@Borda @SkafteNicki @aaronwtr and others