Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying Worker Listen Port #1253

Open
otavioon opened this issue Sep 29, 2023 · 2 comments
Open

Specifying Worker Listen Port #1253

otavioon opened this issue Sep 29, 2023 · 2 comments

Comments

@otavioon
Copy link

Greetings!

I am encountering an issue when specifying the port for the worker to listen on. When using the traditional Dask Distributed with dask-worker (excluding GPU usage), I can utilize the --worker-port parameter to define this behavior. However, with dask-cuda-worker (version 23.10.0), I am unable to locate any option for this purpose, except for the --host parameter.
Consequently, when I execute the following command: CUDA_VISIBLE_DEVICES=0 dask-cuda-worker --scheduler-file scheduler.json --host 127.0.0.1:12345, it results in the following error:

warnings.warn(f'''
2023-09-29 13:39:00,329 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-scratch-space/worker-bpnddwo9', purging
2023-09-29 13:39:00,337 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-09-29 13:39:00,337 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-09-29 13:39:00,338 - distributed.worker - ERROR - Failed to log closing event
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
    await self.listen(start_address, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
    listener = await listen(
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
    await self.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
    sockets = netutil.bind_sockets(
  File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
    sock.bind(sockaddr)
OSError: [Errno 98] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1540, in close
    self.log_event(self.address, {"action": "closing-worker", "reason": reason})
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 723, in address
    raise ValueError("cannot get address of non-running Server")
ValueError: cannot get address of non-running Server
2023-09-29 13:39:00,340 - distributed.worker - INFO - Stopping worker. Reason: failure-to-start-<class 'OSError'>
2023-09-29 13:39:00,340 - distributed.worker - INFO - Closed worker has not yet started: Status.init
2023-09-29 13:39:00,341 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
    await self.listen(start_address, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
    listener = await listen(
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
    await self.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
    sockets = netutil.bind_sockets(
  File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
    sock.bind(sockaddr)
OSError: [Errno 98] Address already in use

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 953, in run
    async with worker:
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 630, in __aenter__
    await self
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2023-09-29 13:39:00,386 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
    await self.listen(start_address, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
    listener = await listen(
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
    await self.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
    sockets = netutil.bind_sockets(
  File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
    sock.bind(sockaddr)
OSError: [Errno 98] Address already in use

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 448, in instantiate
    result = await self.process.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 748, in start
    msg = await self._wait_until_connected(uid)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 889, in _wait_until_connected
    raise msg["exception"]
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 953, in run
    async with worker:
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 630, in __aenter__
    await self
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2023-09-29 13:39:00,391 - distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:12345'. Reason: nanny-instantiate-failed
2023-09-29 13:39:00,391 - distributed.nanny - INFO - Nanny asking worker to close. Reason: nanny-instantiate-failed
2023-09-29 13:39:00,406 - distributed.nanny - INFO - Worker process 15064 was killed by signal 15
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
    await self.listen(start_address, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
    listener = await listen(
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
    await self.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
    sockets = netutil.bind_sockets(
  File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
    sock.bind(sockaddr)
OSError: [Errno 98] Address already in use

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 362, in start_unsafe
    response = await self.instantiate()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 448, in instantiate
    result = await self.process.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 748, in start
    msg = await self._wait_until_connected(uid)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 889, in _wait_until_connected
    raise msg["exception"]
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 953, in run
    async with worker:
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 630, in __aenter__
    await self
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/.local/bin/dask-cuda-worker", line 8, in <module>
    sys.exit(worker())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/dask_cuda/cli.py", line 442, in worker
    loop.run_sync(run)
  File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 530, in run_sync
    return future_cell[0].result()
  File "/home/user/.local/lib/python3.10/site-packages/dask_cuda/cli.py", line 434, in run
    await worker
  File "/home/user/.local/lib/python3.10/site-packages/dask_cuda/cuda_worker.py", line 244, in _wait
    await asyncio.gather(*self.nannies)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Nanny failed to start.

Without using the --host parameter, everything functions as expected, although I am unable to specify the desired port. Is there a method to achieve this?

@pentschev
Copy link
Member

IIRC, --host should only bind to the IP address, so specifying a port as well will indeed not work. I guess the --worker-port parameter was just never needed and thus never added, but there's no technical reason it's not there.

If the --worker-port is important for your use case, would care to submit a pull request with that?

@otavioon
Copy link
Author

otavioon commented Oct 6, 2023

Hello,

Sorry the delay and thanks for your reply, @pentschev. I will submit a PR adding these options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants