Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage with datasets (specifically when multi procs are used) #1498

Open
albertz opened this issue Jan 18, 2024 · 0 comments
Open

Comments

@albertz
Copy link
Member

albertz commented Jan 18, 2024

For single GPU training, without PyTorch DataLoader multiprocessing or without MultiProcDataset, the memory usage of the dataset is maybe not too much of a problem. However, it is not uncommon to have one or multiple of these:

  • Using distributed multi GPU training, i.e. having multiple workers, causing multiple instances of the dataset in memory.
  • Using PyTorch DataLoader multiprocessing. This has the dataset then in the main proc (but it is freed there if the dataset supports finish_epoch with free_resources, see also MultiProcDataset, high memory usage #1443) and also in all the DataLoader workers.
  • Using MultiProcDataset has the dataset in each worker.
  • You usually have train, dev, maybe also devtrain.

When multiple of those are used together, the amount of dataset instances in memory is multiplied by quite a high factor.

In my case, I even have all three together. This leads to 34 instances (see #1443 (comment)).

See the related issue #1443, which is specifically about MultiProcDataset.

This issue here is to discuss potential further solutions on the problem. Those solutions probably involve a new type of dataset which has only minimal memory requirements, and e.g. mmaps the data somehow, or share the memory somehow. Probably we would use some other library for that, which does this for us. I'm not sure how HDF or Apache Arrow or similar might already do exactly that.

But this is also an ongoing discussion for PyTorch or other frameworks. E.g. see the ongoing discussions here:
pytorch/pytorch#13246
pytorch/pytorch#101699

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant