Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disk space usage problem #258

Open
showkeyjar opened this issue Jan 13, 2023 · 11 comments
Open

disk space usage problem #258

showkeyjar opened this issue Jan 13, 2023 · 11 comments

Comments

@showkeyjar
Copy link

I found one problem:

if I use xgboost_ray to train multiple models on linux, I found the "/tmp/ray/" dir size will continued growth.

and if train data is large, the system dist space run out quickly.

I try to fix it by "rm -rf /tmp/ray/", but the train process stucked in an endless loop, and wait for ray actor forever.

I guess "import xgboost_ray" may do some init for ray,

so I add "import importlib" and try to "importlib.reload('xgboost_ray')", but it not work.

please check this issue.

@rkooo567
Copy link

cc @matthewdeng what's the best way to debug object store memory usage for xgboost on ray?

@showkeyjar I think your workload has high object store usage which triggers spilling https://docs.ray.io/en/master/ray-core/objects/object-spilling.html.

When your disk usage keeps increasing, what's the output of ray memory --stats-only?

@matthewdeng
Copy link
Contributor

@showkeyjar do you have a repro for this? How much training data are you loading and how much disk space are you seeing consumed?

@Yard1
Copy link
Member

Yard1 commented Jan 13, 2023

Are you using Ray Datasets? There's an issue with xgboost-ray we are working on currently that causes the data to be loaded in a suboptimal manner, causing too much object store usage.

@showkeyjar
Copy link
Author

showkeyjar commented Jan 16, 2023

thanks for all your advice,

@rkooo567 ray memory --stats-only cannot detect any ray instance:
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting the --address flag or RAY_ADDRESS environment variable.

@matthewdeng 1395642 train data, boost round 20, disk usage 15G

train code is here: https://github.com/showkeyjar/mymodel/blob/main/train_model_ray.py

@Yard1 no, I use pandas dataframe convert to ray dataset.

@showkeyjar
Copy link
Author

showkeyjar commented Jan 16, 2023

I alleviated the problem using shell for loop script to call python train code, but I still don't know why python for loop cause disk increase.

and I'm sure that the disk incease happened at /tmp/ray/ dir.

@Yard1
Copy link
Member

Yard1 commented Jan 16, 2023

Ray is using a mechanism called object spilling, where objects that cannot fit into the memory object store are instead put on disk. Can you run the ray memory --stats-only command in a separate terminal window while the xgboost-ray training is in progress?

Also, are you running this on a single machine, or multiple machines?

@showkeyjar
Copy link
Author

@Yard1

======== Object references status: 2023-01-16 15:19:13.215008 ========
--- Aggregate object store stats across all nodes ---
Plasma memory usage 67279 MiB, 40 objects, 62.69% full, 43.41% needed
Objects consumed by Ray tasks: 67281 MiB.

@showkeyjar
Copy link
Author

I'm so depressed this issues has not been solved yet, but I found some new infomations:

  1. ray will store its temp file in /tmp/ray/session_{datetime}_XXXX_XXXX/ dir
    if we could get the ray session dir, so we can remove temp file when xgb_ray train finished.
  2. ray can specific _temp_dir when init, but it still has bug,
    so, we can specific another temp dir when we train model if fix its bug.

hope those helps.

@rkooo567
Copy link

Based on your output ^, it looks like spilling actually doesn't really happen. I guess most of disk usage is from ray logs?

@rkooo567
Copy link

Is it correct the disk usage is mostly from /tmp/ray/session_latest/logs/?

@showkeyjar
Copy link
Author

Is it correct the disk usage is mostly from /tmp/ray/session_latest/logs/?

yes, it create a link /tmp/ray/session_latest/ to /tmp/ray/session_{datetime}_XXXX_XXXX/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants