Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set dshm size for training? #1044

Open
Andrew-Su-0718 opened this issue Feb 29, 2024 · 1 comment
Open

How to set dshm size for training? #1044

Andrew-Su-0718 opened this issue Feb 29, 2024 · 1 comment

Comments

@Andrew-Su-0718
Copy link

Andrew-Su-0718 commented Feb 29, 2024

When I submit a pytorchjob with arena, I could't find parameters related to shared memory size, which is very important for pytorch training.

The size is fixed to 2Gi.

...
    - mountPath: /dev/shm
      name: dshm
...
...
  - emptyDir:
      medium: Memory
      sizeLimit: 2Gi
    name: dshm
...

Can anyone know how to set dshm size?

@Andrew-Su-0718
Copy link
Author

When I submit a pytorchjob with arena, I could't find parameters related to shared memory size, which is very important for pytorch training.

The size is fixed to 2Gi.

...
    - mountPath: /dev/shm
      name: dshm
...
...
  - emptyDir:
      medium: Memory
      sizeLimit: 2Gi
    name: dshm
...

Can anyone know how to set dshm size?

OK. I find a workaround solution.
Modified file /charts/pytorchjob/values.yaml :

shmSize: 2Gi

to

shmSize: 64Gi # or any value you want

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant