-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Reset GPU to release memory resources #5889
Comments
hey @m946107011 , great question and thanks for providing some code as best you could. While we usually may want some data (even fake data is fine), I will take a shot, and assume your for loop captures the entirety of your code.
Outside of this, while it sounds like you're using a single A100 40GB, if you are using a multi GPU set up, below is some code that you can play with that will utilize UCX, which is a faster interconnect than the default tcp. It might help with performance. Tuning most likely will be required for best performance.
Also, in terms of formatting, I know its your first question, you may want to edit it a bit to make it one solid block. I don't have edit access. You can use |
Hi: KeyError Traceback (most recent call last) File ~/anaconda3/envs/rapids/lib/python3.9/site-packages/distributed/utils.py:1940, in wait_for(fut, timeout) File ~/anaconda3/envs/rapids/lib/python3.9/asyncio/tasks.py:442, in wait_for(fut, timeout, loop) File ~/anaconda3/envs/rapids/lib/python3.9/site-packages/distributed/scheduler.py:4039, in Scheduler.start_unsafe(self) File ~/anaconda3/envs/rapids/lib/python3.9/site-packages/distributed/core.py:859, in Server.listen(self, port_or_addr, allow_offload, **kwargs) File ~/anaconda3/envs/rapids/lib/python3.9/site-packages/distributed/comm/core.py:256, in Listener.await.._() File ~/anaconda3/envs/rapids/lib/python3.9/site-packages/distributed/comm/ucx.py:527, in UCXListener.start(self) File ~/anaconda3/envs/rapids/lib/python3.9/site-packages/distributed/comm/ucx.py:158, in init_once() File ~/anaconda3/envs/rapids/lib/python3.9/site-packages/ucp/core.py:938, in init(options, env_takes_precedence, blocking_progress_mode) File ~/anaconda3/envs/rapids/lib/python3.9/site-packages/ucp/core.py:223, in ApplicationContext.init(self, config_dict, blocking_progress_mode) File ucp/_libs/ucx_context.pyx:78, in ucp._libs.ucx_api.UCXContext.init() KeyError: 'TLS' The above exception was the direct cause of the following exception: RuntimeError Traceback (most recent call last) File ~/anaconda3/envs/rapids/lib/python3.9/site-packages/distributed/core.py:672, in Server.start(self) RuntimeError: Scheduler failed to start. The above exception was the direct cause of the following exception: RuntimeError Traceback (most recent call last) Cell In[1], line 56, in MTForestNet_Multiprocess(folder_name, seed, problem_mode, main_perform, save_file_name, fast) File ~/anaconda3/envs/rapids/lib/python3.9/site-packages/dask_cuda/local_cuda_cluster.py:352, in LocalCUDACluster.init(self, CUDA_VISIBLE_DEVICES, n_workers, threads_per_worker, memory_limit, device_memory_limit, data, local_directory, shared_filesystem, protocol, enable_tcp_over_ucx, enable_infiniband, enable_nvlink, enable_rdmacm, rmm_pool_size, rmm_maximum_pool_size, rmm_managed_memory, rmm_async, rmm_release_threshold, rmm_log_directory, rmm_track_allocations, jit_unspill, log_spilling, worker_class, pre_import, **kwargs) File ~/anaconda3/envs/rapids/lib/python3.9/site-packages/distributed/deploy/local.py:253, in LocalCluster.init(self, name, n_workers, threads_per_worker, processes, loop, start, host, ip, scheduler_port, silence_logs, dashboard_address, worker_dashboard_address, diagnostics_port, services, worker_services, service_kwargs, asynchronous, security, protocol, blocked_handlers, interface, worker_class, scheduler_kwargs, scheduler_sync_interval, **worker_kwargs) File ~/anaconda3/envs/rapids/lib/python3.9/site-packages/distributed/deploy/spec.py:284, in SpecCluster.init(self, workers, scheduler, worker, asynchronous, loop, security, silence_logs, name, shutdown_on_close, scheduler_sync_interval) File ~/anaconda3/envs/rapids/lib/python3.9/site-packages/distributed/utils.py:358, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs) File ~/anaconda3/envs/rapids/lib/python3.9/site-packages/distributed/utils.py:434, in sync(loop, func, callback_timeout, *args, **kwargs) File ~/anaconda3/envs/rapids/lib/python3.9/site-packages/distributed/utils.py:408, in sync..f() File ~/anaconda3/envs/rapids/lib/python3.9/site-packages/tornado/gen.py:767, in Runner.run(self) File ~/anaconda3/envs/rapids/lib/python3.9/site-packages/distributed/deploy/spec.py:335, in SpecCluster._start(self) RuntimeError: Cluster failed to start: Scheduler failed to start. |
Hello,
I'm currently using cuML and cuDF to train multiple models on an A100-40g GPU with 96GB of RAM (Rapids24.04, CUDA11.4).
Despite attempts to manage GPU memory usage by deleting variables using del, calling gc.collect(), and using RMM methods, I consistently encounter out-of-memory (OOM) errors after training several models.
Could someone provide advice on how to reset the GPU after training each model to prevent OOM errors?
Thank you very much.
Ps:This is my first time submitting a question, so I apologize for the messy layout.
import cudf import cuml from sklearn import model_selection import cuda as cp from cuml import datasets from cuml.dask.common import utils as dask_utils from dask.distributed import Client, wait from dask_cuda import LocalCUDACluster from cuml import RandomForestClassifier as cuRF import dask_cudf from tqdm import tqdm from scipy import stats from sklearn import metrics import pickle import os import random import shutil import time import gc import warnings import numpy as np import pandas as pd from cuml import datasets import rmm
``
for file in tqdm(files):
cluster = LocalCUDACluster()
c = Client(cluster)
workers = c.has_what().keys()
dbuf = rmm.DeviceBuffer(size=1024440)
start=time.time()
file_path = f"{folder_name}/{file}"
data = dask_cudf.read_csv(file_path,chunksize="1GB")
data=data.dropna(subset=[file])
task = data.columns[0]
locals()['Task_'+str(task)]=data
locals()['Task_'+str(task)]= locals()['Task_'+str(task)].dropna(subset=[task])
locals()['Task_'+str(task)+'x']=locals()['Task'+str(task)].drop(columns=[task])
locals()['Task_'+str(task)+'y']=locals()['Task'+str(task)][task]
locals()['Task_'+str(task)+'x'] = locals()['Task'+str(task)+'x'].astype(np.float32).compute().to_numpy()
locals()['Task'+str(task)+'y'] = locals()['Task'+str(task)+'_y'].astype(np.int32).compute().to_numpy()
`
The text was updated successfully, but these errors were encountered: