Issue with StreamingDataset when not using all GPUs on host. #91

gkroiz · 2024-04-03T22:28:57Z

🐛 Bug

When initializing a StreamingDataset object, _DistributedEnv.detect() is called (code) and in the detect() function there a world_size check (code). This check fails for my use case when I am not using all GPUs on a host, such that world_size is set to 6 but torch.cuda.device_count() will return 8. 6 % 8 != 0, thus raising the error.

Perhaps this check failure is the intended behavior, but I do not know enough about the litdata repository to understand why the code should raise an error when world_size % device_count != 0. I would imagine that torch would run various checks when setting up a torch distributed environment, such that this check would not be needed unless it provides a particular purpose to StreamingDataset() or another object. If there is any insight here on (1) if this check is needed and (2) why it is needed, that would be great!

To Reproduce

This issue came up when experimenting with LitGPT. I don't think this section or the following sections are necessary and I will leave the blank for now. Please let me know if any of these sections would help and I can add more description.

Code sample

Not needed at the moment.

Expected behavior

Not needed at the moment.

Environment

Not needed at the moment.

Additional context

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-03T22:29:19Z

Hi! thanks for your contribution!, great first issue!

tchaton · 2024-04-04T16:56:39Z

Hey gkroiz, do you want to submit a fix for it ?

gkroiz · 2024-04-04T17:49:30Z

I realized that when using _DistributedEnv.detect() in a multi-node environment where you are not using all available GPUs, it is not possible (at least with torch), to determine number of nodes. ~~I will create a PR that mentions this.~~

gkroiz · 2024-04-04T18:39:26Z

After putting some more thought here, a few things:

it makes sense that _DistributedEnv.detect() produces the error that I was seeing before, since the detect() function does not have the needed information through pytorch's APIs to determine the node count. For example, if If my world_size is 4 (meaning I'm using 4 gpus), but torch.cuda.device_count()=8, how can pytorch tell if there is 1 node being used where 4 GPUs are idle versus 4 nodes being used where on each node only 1 gpu is being used. There is too much ambiguity here, meaning that it is good for the code to raise an error. Perhaps leaving this as is makes the most sense, or perhaps adding a few lines of additional comments.
Currently, users using StreamingDataset will automatically use _DistributedEnv.detect() (ref) unless they edit the source code. Maybe it makes sense to provide some optionality so that users can provide a _DistributedEnv object when initializing a StreamingDataset object for when _DistributedEnv.detect() is insufficient, such as when using multi-CPU or when not using all of the GPUs available.

What are your thoughts here @tchaton

tchaton · 2024-04-04T18:56:12Z

@gkroiz Lot of great comments. Maybe we could have the following behaviour to unblock you.

By default, we are breaking as we don't have enough information. However, if the user provides extra environement variables: https://github.com/Lightning-AI/pytorch-lightning/tree/master/src/lightning/fabric/plugins/environments, we could accept the execution.

Passing a _DistributedEnv is also a good option.

Do you want to give a try to any of them ?

gkroiz · 2024-04-05T00:19:20Z

Yes, I can put together a PR with one of the solutions when I get the time.

gkroiz added bug Something isn't working help wanted Extra attention is needed labels Apr 3, 2024

gkroiz mentioned this issue Apr 8, 2024

Add dist env detection via env vars #95

Merged

4 tasks

tchaton closed this as completed in #95 May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with StreamingDataset when not using all GPUs on host. #91

Issue with StreamingDataset when not using all GPUs on host. #91

gkroiz commented Apr 3, 2024

github-actions bot commented Apr 3, 2024

tchaton commented Apr 4, 2024

gkroiz commented Apr 4, 2024 •

edited

gkroiz commented Apr 4, 2024

tchaton commented Apr 4, 2024 •

edited

gkroiz commented Apr 5, 2024

Issue with StreamingDataset when not using all GPUs on host. #91

Issue with StreamingDataset when not using all GPUs on host. #91

Comments

gkroiz commented Apr 3, 2024

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

github-actions bot commented Apr 3, 2024

tchaton commented Apr 4, 2024

gkroiz commented Apr 4, 2024 • edited

gkroiz commented Apr 4, 2024

tchaton commented Apr 4, 2024 • edited

gkroiz commented Apr 5, 2024

gkroiz commented Apr 4, 2024 •

edited

tchaton commented Apr 4, 2024 •

edited