[Bug] Unexpected error when upgrading DGL version from 1.1.3 to 2.1.0 #7333

jalencato · 2024-04-20T01:31:52Z

🐛 Bug

When I am switching the DGL version from 1.1.3 to 1.2.1, I have met a problem here:

2024-04-20T00:46:10.970Z	File "/graphstorm/python/graphstorm/model/embed.py", line 401, in forward
2024-04-20T00:46:10.970Z	emb = self.sparse_embeds[ntype](input_nodes[ntype], device)
2024-04-20T00:46:10.970Z	File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/nn/pytorch/sparse_emb.py", line 112, in __call__
2024-04-20T00:46:10.970Z	emb = self._tensor[idx].to(device, non_blocking=True)
2024-04-20T00:46:10.970Z	File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_tensor.py", line 205, in __getitem__
2024-04-20T00:46:10.970Z	return self.kvstore.pull(name=self._name, id_tensor=idx)
2024-04-20T00:46:10.970Z	File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/kvstore.py", line 1453, in pull
2024-04-20T00:46:10.970Z	part_id = self._part_policy[name].to_partid(id_tensor)
2024-04-20T00:46:10.970Z	File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/graph_partition_book.py", line 1096, in to_partid
2024-04-20T00:46:10.970Z	return self._partition_book.nid2partid(id_tensor, self.type_name)
2024-04-20T00:46:10.970Z	File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/graph_partition_book.py", line 789, in nid2partid
2024-04-20T00:46:10.971Z	nids = nids.numpy()
2024-04-20T00:46:10.971Z	TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

My code snippet is like:

self._sparse_embeds[ntype] = DistEmbedding(g.number_of_nodes(ntype),
                                    self.embed_size,
                                    embed_name + '_' + ntype,
                                    init_emb,
                                    part_policy=part_policy)
......

if len(input_nodes[ntype]) == 0:
   dtype = self.sparse_embeds[ntype].weight.dtype
   embs[ntype] = th.zeros((0, self.sparse_embeds[ntype].embedding_dim),
         device=device, dtype=dtype)
   continue
emb = self.sparse_embeds[ntype](input_nodes[ntype], device)

To Reproduce

Steps to reproduce the behavior:

We are getting this error when using graphstorm,

python3 /graphstorm/tools/gen_ogb_dataset.py --savepath /tmp/ogbn-arxiv-nc/ --retain-original-features true

python3 /graphstorm/tools/partition_graph.py --dataset ogbn-arxiv \
                                            --filepath /tmp/ogbn-arxiv-nc/ \
                                            --num-parts 1 \
                                            --num-trainers-per-machine 4 \
                                            --output /tmp/ogbn_arxiv_nc_train_val_1p_4t

python3 -m graphstorm.run.gs_node_classification \
       --workspace /tmp/ogbn-arxiv-nc \
       --num-trainers 1 \
       --num-servers 1 \
       --num-samplers 0 \
       --part-config /tmp/ogbn_arxiv_nc_train_val_1p_4t/ogbn-arxiv.json \
       --ip-config  /tmp/ip_list.txt \
       --ssh-port 22 \
       --cf /graphstorm/training_scripts/gsgnn_np/arxiv_nc.yaml \
       --save-perf-results-path /tmp/ogbn-arxiv-nc/models

Expected behavior

When running on DGL 1.1.3 I did not have the problem here.

Environment

DGL Version (e.g., 1.0): 2.1
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): Pytorch 2.1.0 + CUDA 12.1
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.8
CUDA/cuDNN version (if applicable): 12
GPU models and configuration (e.g. V100): I am running on AWS G4 instance
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

thvasilo · 2024-04-22T20:29:41Z

@Rhett-Ying could you take a look here? I see TODOs listed in this code to replace the numpy operations with torch ones.

Rhett-Ying · 2024-04-23T00:07:54Z

I fixed the bug in a6505e8 and it's not merged into DGL 2.1. It's ready in master branch for now. So you could try with latest DGL nightly build.

This fix will be ready in next release DGL 2.2 which will be ready in early May.

Rhett-Ying · 2024-05-11T02:26:21Z

The fix is ready in latest DGL 2.2.1. Please try with it.

thvasilo · 2024-05-14T00:38:51Z

Hi @Rhett-Ying I tried reproducing the example that @jalencato listed and got

Traceback (most recent call last):
  File "/graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py", line 190, in <module>
    main(gs_args)
  File "/graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py", line 143, in main
    trainer.fit(train_loader=dataloader, val_loader=val_dataloader,
  File "/graphstorm/python/graphstorm/trainer/np_trainer.py", line 189, in fit
    self.optimizer.step()
  File "/graphstorm/python/graphstorm/model/gnn.py", line 119, in step
    optimizer.step()
  File "/opt/gs-venv/lib/python3.9/site-packages/dgl/distributed/optim/pytorch/sparse_optim.py", line 355, in step
    alltoall(
  File "/opt/gs-venv/lib/python3.9/site-packages/dgl/distributed/optim/pytorch/utils.py", line 88, in alltoall
    alltoall_cpu(
  File "/opt/gs-venv/lib/python3.9/site-packages/dgl/distributed/optim/pytorch/utils.py", line 26, in alltoall_cpu
    dist.scatter(
  File "/opt/gs-venv/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/gs-venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3174, in scatter
    work = default_pg.scatter(output_tensors, input_tensors, opts)
RuntimeError: ProcessGroupGloo::scatter: invalid tensor type at index 0 (expected TensorOptions(dtype=long int, device=cuda:0, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)), got TensorOptions(dtype=long int, device=cpu, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))

This is on dgl==2.2.1+cu121 and Torch 2.1 running inside the GraphStorm container on g5 instances.

Rhett-Ying · 2024-05-14T01:10:39Z

@thvasilo are you running on GPU and backend is not nccl?

thvasilo · 2024-05-14T01:12:16Z

Correct

Rhett-Ying · 2024-05-14T01:15:24Z

why not use nccl as backend and does it work well?

Rhett-Ying · 2024-05-14T01:17:48Z

this seems to be a new bug and I don't know why it's triggered now.

thvasilo · 2024-05-14T01:19:23Z

I will try with nccl so far we've only used gloo in GraphStorm AFAIK. Is there a reason to avoid nccl @classicsong ?

Rhett-Ying · 2024-05-14T01:21:08Z

could you try to figure out why below tensors are on different device? both of them are supposed to be on cpu? This is the direct cause of the crash, right?

dgl/python/dgl/distributed/optim/pytorch/sparse_optim.py

Lines 358 to 359 in 6475057

    
           gather_list, 
        
           idx_split_size,

Rhett-Ying · 2024-05-14T09:01:11Z

DGL master(almost same as 2.2.1 + torch 2.1.0+cu121

I tried to run https://github.com/dmlc/dgl/blob/master/examples/distributed/rgcn/node_classification.py with --num_gpus 4 --sparse-embedding --dgl-sparse --backend gloo which utilize dgl.distributed.optim.SparseAdam(the class which crashed in your case) and it works well with gloo backend. Please note, the example use nccl for gpu training, so manually modifying code to gloo is required.

if use nccl , it crashed with below error:

File "/home/ubuntu/workspace/dgl_1/python/dgl/distributed/optim/pytorch/utils.py", line 86, in alltoall                         
            th.distributed.all_to_all(output_tensor_list, input_tensor_list)th.distributed.all_to_all(output_tensor_list, input_te
nsor_list)th.distributed.all_to_all(output_tensor_list, input_tensor_list) 
No backend type associated with device type cpu

this seems make sense as we support all_to_all_cpu() only?

Rhett-Ying · 2024-05-15T06:44:33Z

I’ve reproduced and find the blame commit: 5dfaf99

This commit add device into gather_listand idx_split_size but didn’t think about alltoall supports cpu only if backend is not nccl And device is override by device = grads.device in

dgl/python/dgl/distributed/optim/pytorch/sparse_optim.py

Line 300 in 5b34968

device = grads.device

this change is merged after DGL 1.1.3, so we hit the issue in DGL 2.2.1

In short, the direct cause is previous tensors are always in cpu, so it works well with gloo. But now, tensors are in gpu while the underlying alltoall call supports cpu tensor only if backend is gloo

thvasilo · 2024-05-16T23:42:42Z

Hi @Rhett-Ying I ran the repro example that @jalencato posted with the code from #7409 and it works fine now. I think we can close this once that PR is merged.

Rhett-Ying · 2024-05-17T00:06:31Z

@thvasilo could you run more examples to make sure that PR does not trigger any other issue?

thvasilo · 2024-05-17T00:14:21Z

If we merge this we can run our automated integration tests with the daily pip, I don't have the bandwidth to run more manual tests on the PR code.

jalencato changed the title ~~When switching to DGL 2.1.0 + Pytorch 2.1.0 on CUDA 12.0~~ [Bug] Unexpected error when upgrading DGL version from 1.1.3 to 2.1.0 Apr 20, 2024

jalencato mentioned this issue Apr 20, 2024

[Docker] Move local image to CUDA 12.1. Add option for CPU local image. awslabs/graphstorm#808

Merged

Rhett-Ying self-assigned this Apr 23, 2024

Rhett-Ying closed this as completed May 11, 2024

Rhett-Ying reopened this May 14, 2024

Rhett-Ying linked a pull request May 15, 2024 that will close this issue

[DistDGL] fix device mismatch when calling all_to_all with gloo backend #7409

Merged

8 tasks

jalencato mentioned this issue May 16, 2024

[Bug] When switching to DGL 2.1.0 + Pytorch 2.1.0 on CUDA 12.0 awslabs/graphstorm#809

Open

Rhett-Ying closed this as completed in #7409 May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Unexpected error when upgrading DGL version from 1.1.3 to 2.1.0 #7333

[Bug] Unexpected error when upgrading DGL version from 1.1.3 to 2.1.0 #7333

jalencato commented Apr 20, 2024

thvasilo commented Apr 22, 2024

Rhett-Ying commented Apr 23, 2024

Rhett-Ying commented May 11, 2024

thvasilo commented May 14, 2024

Rhett-Ying commented May 14, 2024

thvasilo commented May 14, 2024

Rhett-Ying commented May 14, 2024

Rhett-Ying commented May 14, 2024

thvasilo commented May 14, 2024

Rhett-Ying commented May 14, 2024

Rhett-Ying commented May 14, 2024 •

edited

Rhett-Ying commented May 15, 2024

thvasilo commented May 16, 2024

Rhett-Ying commented May 17, 2024

thvasilo commented May 17, 2024

[Bug] Unexpected error when upgrading DGL version from 1.1.3 to 2.1.0 #7333

[Bug] Unexpected error when upgrading DGL version from 1.1.3 to 2.1.0 #7333

Comments

jalencato commented Apr 20, 2024

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

thvasilo commented Apr 22, 2024

Rhett-Ying commented Apr 23, 2024

Rhett-Ying commented May 11, 2024

thvasilo commented May 14, 2024

Rhett-Ying commented May 14, 2024

thvasilo commented May 14, 2024

Rhett-Ying commented May 14, 2024

Rhett-Ying commented May 14, 2024

thvasilo commented May 14, 2024

Rhett-Ying commented May 14, 2024

Rhett-Ying commented May 14, 2024 • edited

Rhett-Ying commented May 15, 2024

thvasilo commented May 16, 2024

Rhett-Ying commented May 17, 2024

thvasilo commented May 17, 2024

Rhett-Ying commented May 14, 2024 •

edited