Connection refused between Docker containers #8594

alchemine · 2022-01-20T05:37:26Z

alchemine
Jan 20, 2022

I want to cluster two containers at nodes in different networks. (IP is written randomly)

Container 1 (Scheduler)
ssh: 123.456.78.910:20022
scheduler-port: 123.456.78.910:28786
Container 2 (Worker)
ssh: 123.456.78.911:20022
worker-port: 123.456.78.911:28786

I referred to the command line here.
The results are below.

// python script in scheduler

from dask.distributed import Client, SSHCluster

def task(name):
    print(f'task-{name}')
    return f'task-{name}'

if __name__ == '__main__':
    print("- Distributed scheduler:", end='')
    with SSHCluster(['localhost', '123.456.78.911'],
                    connect_options=[{"known_hosts": None, 'password': 'vfroot'},
                                     {'known_hosts': None, 'password': 'vfroot', 'port': 20022}],
                    ) as cluster, Client(cluster) as client:
        print(client)

- Distributed scheduler:distributed.deploy.ssh - INFO - distributed.scheduler - INFO - -----------------------------------------------
distributed.deploy.ssh - INFO - distributed.scheduler - INFO - -----------------------------------------------
distributed.deploy.ssh - INFO - distributed.scheduler - INFO - Clear task state
distributed.deploy.ssh - INFO - distributed.scheduler - INFO -   Scheduler at:     tcp://172.17.0.2:8786
distributed.deploy.ssh - INFO - Usage: dask_worker.py [OPTIONS] [SCHEDULER] [PRELOAD_ARGV]...
Task exception was never retrieved
future: <Task finished name='Task-16' coro=<_wrap_awaitable() done, defined at /opt/conda/envs/rapids/lib/python3.8/asyncio/tasks.py:688> exception=Exception('Worker failed to start')>
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.8/asyncio/tasks.py", line 695, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/deploy/spec.py", line 67, in _
    await self.start()
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/deploy/ssh.py", line 130, in start
    raise Exception("Worker failed to start")
Exception: Worker failed to start
distributed.deploy.ssh - INFO - Usage: dask_worker.py [OPTIONS] [SCHEDULER] [PRELOAD_ARGV]...
Traceback (most recent call last):
  File "/root/project/parallel_example/SSHClient.py", line 14, in <module>
    with SSHCluster(['localhost', '123.456.78.911'],
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/deploy/ssh.py", line 368, in SSHCluster
    return SpecCluster(workers, scheduler, name="SSHCluster", **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/deploy/spec.py", line 284, in __init__
    self.sync(self._correct_state)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 214, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py", line 326, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py", line 309, in f
    result[0] = yield future
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/deploy/spec.py", line 371, in _correct_state_internal
    await w  # for tornado gen.coroutine support
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/deploy/spec.py", line 67, in _
    await self.start()
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/deploy/ssh.py", line 130, in start
    raise Exception("Worker failed to start")
Exception: Worker failed to start
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/deploy/spec.py", line 671, in close_clusters
    cluster.close(timeout=10)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 135, in close
    return self.sync(self._close, callback_timeout=timeout)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 214, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py", line 326, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py", line 309, in f
    result[0] = yield future
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/deploy/spec.py", line 437, in _close
    assert w.status == Status.closed, w.status
AssertionError: Status.created

Process finished with exit code 1

What's the right way to establish the connection between scheduler and worker?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection refused between Docker containers #8594

{{title}}

Replies: 0 comments

Select a reply

Connection refused between Docker containers #8594

alchemine Jan 20, 2022

Replies: 0 comments

alchemine
Jan 20, 2022