Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛[bug] Jupyter lab / Tensorboard stuck at Waiting for ... #8198

Open
jokokojote opened this issue Oct 19, 2023 · 14 comments
Open

🐛[bug] Jupyter lab / Tensorboard stuck at Waiting for ... #8198

jokokojote opened this issue Oct 19, 2023 · 14 comments
Labels

Comments

@jokokojote
Copy link

Describe the bug

I am not sure if this is a bug or I missed some basic config step, but I checked the docs multiple times and did not find any information about this:

Jupyter lab and tensorboard are stuck at "Waiting for ..." after the docker was run successfully w/o any errors shown in the logs.

Tried with 0.26.1, 0.26.0, 0.25.1 and 0.21.2 on MacOS, Ubuntu and Windows.

TensorBoard 0.26.1 logs:

<info>    [2023-10-19 10:42:13] || INFO: Scheduling TensorBoard (endlessly-proper-calf) (id: 7b13e6db-b8fb-4deb-9951-eac908d7e2b1.1)
<info>    [2023-10-19 10:42:13] || INFO: TensorBoard (endlessly-proper-calf) was assigned to an agent
<info>    [2023-10-19 10:42:13] [c91e95c9] image already found, skipping pull phase: docker.io/determinedai/environments:py-3.8-pytorch-1.12-tf-2.11-cpu-2b7e2a1
<info>    [2023-10-19 10:42:13] [c91e95c9] copying files to container: /
<info>    [2023-10-19 10:42:13] [c91e95c9] copying files to container: /run/determined
<info>    [2023-10-19 10:42:13] [c91e95c9] copying files to container: /
<info>    [2023-10-19 10:42:13] [c91e95c9] copying files to container: /
<info>    [2023-10-19 10:42:13] [c91e95c9] copying files to container: /
<info>    [2023-10-19 10:42:13] [c91e95c9] Resources for TensorBoard (endlessly-proper-calf) have started
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: tensorboard in /opt/conda/lib/python3.8/site-packages (2.11.2)
<>        [2023-10-19 10:42:15] [c91e95c9] Collecting tensorboard-plugin-profile
<>        [2023-10-19 10:42:15] [c91e95c9]   Obtaining dependency information for tensorboard-plugin-profile from https://files.pythonhosted.org/packages/ce/38/4ea8ac39967d381539b27cf1c3689012fda7b74b22dcf85f000ab003e6bc/tensorboard_plugin_profile-2.14.0-py3-none-any.whl.metadata
<>        [2023-10-19 10:42:15] [c91e95c9]   Downloading tensorboard_plugin_profile-2.14.0-py3-none-any.whl.metadata (1.0 kB)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: absl-py>=0.4 in /opt/conda/lib/python3.8/site-packages (from tensorboard) (1.4.0)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: grpcio>=1.24.3 in /opt/conda/lib/python3.8/site-packages (from tensorboard) (1.56.2)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: google-auth<3,>=1.6.3 in /opt/conda/lib/python3.8/site-packages (from tensorboard) (2.22.0)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /opt/conda/lib/python3.8/site-packages (from tensorboard) (0.4.6)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: markdown>=2.6.8 in /opt/conda/lib/python3.8/site-packages (from tensorboard) (3.4.4)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: numpy>=1.12.0 in /opt/conda/lib/python3.8/site-packages (from tensorboard) (1.24.4)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: protobuf<4,>=3.9.2 in /opt/conda/lib/python3.8/site-packages (from tensorboard) (3.20.3)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: requests<3,>=2.21.0 in /opt/conda/lib/python3.8/site-packages (from tensorboard) (2.31.0)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: setuptools>=41.0.0 in /opt/conda/lib/python3.8/site-packages (from tensorboard) (68.0.0)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /opt/conda/lib/python3.8/site-packages (from tensorboard) (0.6.1)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /opt/conda/lib/python3.8/site-packages (from tensorboard) (1.8.1)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: werkzeug>=1.0.1 in /opt/conda/lib/python3.8/site-packages (from tensorboard) (2.3.6)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: wheel>=0.26 in /opt/conda/lib/python3.8/site-packages (from tensorboard) (0.38.4)
<>        [2023-10-19 10:42:15] [c91e95c9] Collecting gviz-api>=1.9.0 (from tensorboard-plugin-profile)
<>        [2023-10-19 10:42:15] [c91e95c9]   Downloading gviz_api-1.10.0-py2.py3-none-any.whl (13 kB)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: six>=1.10.0 in /opt/conda/lib/python3.8/site-packages (from tensorboard-plugin-profile) (1.16.0)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: cachetools<6.0,>=2.0.0 in /opt/conda/lib/python3.8/site-packages (from google-auth<3,>=1.6.3->tensorboard) (5.3.1)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.8/site-packages (from google-auth<3,>=1.6.3->tensorboard) (0.3.0)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: rsa<5,>=3.1.4 in /opt/conda/lib/python3.8/site-packages (from google-auth<3,>=1.6.3->tensorboard) (4.9)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: urllib3<2.0 in /opt/conda/lib/python3.8/site-packages (from google-auth<3,>=1.6.3->tensorboard) (1.26.16)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: requests-oauthlib>=0.7.0 in /opt/conda/lib/python3.8/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard) (1.3.1)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: importlib-metadata>=4.4 in /opt/conda/lib/python3.8/site-packages (from markdown>=2.6.8->tensorboard) (6.8.0)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard) (2.0.4)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard) (3.4)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard) (2023.7.22)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: MarkupSafe>=2.1.1 in /opt/conda/lib/python3.8/site-packages (from werkzeug>=1.0.1->tensorboard) (2.1.3)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.8/site-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard) (3.16.2)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /opt/conda/lib/python3.8/site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard) (0.5.0)
<>        [2023-10-19 10:42:15] [c91e95c9] Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.8/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard) (3.2.2)
<>        [2023-10-19 10:42:15] [c91e95c9] Downloading tensorboard_plugin_profile-2.14.0-py3-none-any.whl (5.6 MB)
<>        [2023-10-19 10:42:15] [c91e95c9]    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.6/5.6 MB 35.6 MB/s eta 0:00:00
<>        [2023-10-19 10:42:16] [c91e95c9] Installing collected packages: gviz-api, tensorboard-plugin-profile
<>        [2023-10-19 10:42:16] [c91e95c9] Successfully installed gviz-api-1.10.0 tensorboard-plugin-profile-2.14.0
<warning> [2023-10-19 10:42:16] [c91e95c9] Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<>        [2023-10-19 10:42:16] [c91e95c9] + test -f startup-hook.sh
<>        [2023-10-19 10:42:16] [c91e95c9] + set +x
<warning> [2023-10-19 10:42:17] [c91e95c9] [47] determined.exec.tensorboard: Tensorboard not responding to HTTP: HTTPConnectionPool(host='localhost', port=2794): Max retries exceeded with url: /proxy/7b13e6db-b8fb-4deb-9951-eac908d7e2b1 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffffaef892e0>: Failed to establish a new connection: [Errno 111] Connection refused'))
<>        [2023-10-19 10:42:18] [c91e95c9] I1019 10:42:18.253016 281472123515360 plugin.py:429] Monitor runs begin
<>        [2023-10-19 10:42:18] [c91e95c9] TensorBoard 2.11.2 at http://6a8630b8c8ff:2794/proxy/7b13e6db-b8fb-4deb-9951-eac908d7e2b1/ (Press CTRL+C to quit)
<>        [2023-10-19 10:42:18] [c91e95c9] W1019 10:42:18.620665 281472968290784 security_validator.py:46] In 3.0, this warning will become an error:
<>        [2023-10-19 10:42:18] [c91e95c9] Content-Type is required on a Response
<>        [2023-10-19 10:42:18] [c91e95c9] W1019 10:42:18.620800 281472968290784 security_validator.py:46] In 3.0, this warning will become an error:
<>        [2023-10-19 10:42:18] [c91e95c9] X-Content-Type-Options is required to be "nosniff"
<info>    [2023-10-19 10:42:25] || INFO: Service of TensorBoard (endlessly-proper-calf) is available
<>        [2023-10-19 10:42:25] [c91e95c9] TensorBoard contains metrics

Jupyter 0.26.1 logs:

<info>    [2023-10-19 10:38:20] || INFO: Scheduling JupyterLab (obviously-viable-tuna) (id: 9bbf0f82-e819-4468-b210-f1efba9a6a40.1)
<info>    [2023-10-19 10:38:20] || INFO: JupyterLab (obviously-viable-tuna) was assigned to an agent
<info>    [2023-10-19 10:38:20] [a994dc0f] image already found, skipping pull phase: docker.io/determinedai/environments:py-3.8-pytorch-1.12-tf-2.11-cpu-2b7e2a1
<info>    [2023-10-19 10:38:20] [a994dc0f] copying files to container: /
<info>    [2023-10-19 10:38:20] [a994dc0f] copying files to container: /run/determined
<info>    [2023-10-19 10:38:20] [a994dc0f] copying files to container: /
<info>    [2023-10-19 10:38:20] [a994dc0f] copying files to container: /
<info>    [2023-10-19 10:38:20] [a994dc0f] copying files to container: /
<info>    [2023-10-19 10:38:21] [a994dc0f] Resources for JupyterLab (obviously-viable-tuna) have started
<info>    [2023-10-19 10:38:22] [a994dc0f] [31] root: detected 0 gpus (nvidia-smi not found)
<info>    [2023-10-19 10:38:22] [a994dc0f] [31] root: rocm-smi not found
<info>    [2023-10-19 10:38:22] [a994dc0f] [31] root: Running task container on agent_id=determined-agent-0, hostname=fb416e0c6861 with visible GPUs []
<info>    [2023-10-19 10:38:22] [a994dc0f] [31] root: detected 0 gpu processes (nvidia-smi not found)
<>        [2023-10-19 10:38:22] [a994dc0f] + test -f startup-hook.sh
<>        [2023-10-19 10:38:22] [a994dc0f] + set +x
<warning> [2023-10-19 10:38:22] [a994dc0f] [ServerApp] ServerApp.token config is deprecated in 2.0. Use IdentityProvider.token.
<info>    [2023-10-19 10:38:22] [a994dc0f] [ServerApp] Package jupyterlab took 0.0000s to import
<info>    [2023-10-19 10:38:22] [a994dc0f] [ServerApp] Package jupyter_archive took 0.0007s to import
<info>    [2023-10-19 10:38:22] [a994dc0f] [ServerApp] Package jupyter_server_terminals took 0.0032s to import
<info>    [2023-10-19 10:38:22] [a994dc0f] [ServerApp] Package nbclassic took 0.0000s to import
<warning> [2023-10-19 10:38:22] [a994dc0f] [ServerApp] A `_jupyter_server_extension_points` function was not found in nbclassic. Instead, a `_jupyter_server_extension_paths` function was found and will be used for now. This function name will be deprecated in future releases of Jupyter Server.
<info>    [2023-10-19 10:38:22] [a994dc0f] [ServerApp] Package notebook_shim took 0.0000s to import
<warning> [2023-10-19 10:38:22] [a994dc0f] [ServerApp] A `_jupyter_server_extension_points` function was not found in notebook_shim. Instead, a `_jupyter_server_extension_paths` function was found and will be used for now. This function name will be deprecated in future releases of Jupyter Server.
<info>    [2023-10-19 10:38:22] [a994dc0f] [ServerApp] jupyter_archive | extension was successfully linked.
<info>    [2023-10-19 10:38:22] [a994dc0f] [ServerApp] jupyter_server_terminals | extension was successfully linked.
<info>    [2023-10-19 10:38:22] [a994dc0f] [ServerApp] jupyterlab | extension was successfully linked.
<info>    [2023-10-19 10:38:22] [a994dc0f] [ServerApp] nbclassic | extension was successfully linked.
<info>    [2023-10-19 10:38:22] [a994dc0f] [ServerApp] Writing Jupyter server cookie secret to /run/determined/jupyter/runtime/jupyter_cookie_secret
<info>    [2023-10-19 10:38:23] [a994dc0f] [ServerApp] notebook_shim | extension was successfully linked.
<warning> [2023-10-19 10:38:23] [a994dc0f] [ServerApp] All authentication is disabled.  Anyone who can connect to this server will be able to run code.
<info>    [2023-10-19 10:38:23] [a994dc0f] [ServerApp] notebook_shim | extension was successfully loaded.
<info>    [2023-10-19 10:38:23] [a994dc0f] [ServerApp] jupyter_archive | extension was successfully loaded.
<info>    [2023-10-19 10:38:23] [a994dc0f] [ServerApp] jupyter_server_terminals | extension was successfully loaded.
<info>    [2023-10-19 10:38:23] [a994dc0f] [LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.8/site-packages/jupyterlab
<info>    [2023-10-19 10:38:23] [a994dc0f] [LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
<info>    [2023-10-19 10:38:23] [a994dc0f] [ServerApp] jupyterlab | extension was successfully loaded.
<>        [2023-10-19 10:38:23] [a994dc0f]
<>        [2023-10-19 10:38:23] [a994dc0f]   _   _          _      _
<>        [2023-10-19 10:38:23] [a994dc0f]  | | | |_ __  __| |__ _| |_ ___
<>        [2023-10-19 10:38:23] [a994dc0f]  | |_| | '_ \/ _` / _` |  _/ -_)
<>        [2023-10-19 10:38:23] [a994dc0f]   \___/| .__/\__,_\__,_|\__\___|
<>        [2023-10-19 10:38:23] [a994dc0f]        |_|
<>        [2023-10-19 10:38:23] [a994dc0f]
<>        [2023-10-19 10:38:23] [a994dc0f] Read the migration plan to Notebook 7 to learn about the new features and the actions to take if you are using extensions.
<>        [2023-10-19 10:38:23] [a994dc0f]
<>        [2023-10-19 10:38:23] [a994dc0f] https://jupyter-notebook.readthedocs.io/en/latest/migrate_to_notebook7.html
<>        [2023-10-19 10:38:23] [a994dc0f]
<>        [2023-10-19 10:38:23] [a994dc0f] Please note that updating to Notebook 7 might break some of your extensions.
<>        [2023-10-19 10:38:23] [a994dc0f]
<info>    [2023-10-19 10:38:23] [a994dc0f] [ServerApp] nbclassic | extension was successfully loaded.
<info>    [2023-10-19 10:38:23] [a994dc0f] [ServerApp] Serving notebooks from local directory: /run/determined/workdir
<info>    [2023-10-19 10:38:23] [a994dc0f] [ServerApp] Jupyter Server 2.7.0 is running at:
<info>    [2023-10-19 10:38:23] [a994dc0f] [ServerApp] https://fb416e0c6861:2979/proxy/9bbf0f82-e819-4468-b210-f1efba9a6a40/lab
<info>    [2023-10-19 10:38:23] [a994dc0f] [ServerApp]     https://127.0.0.1:2979/proxy/9bbf0f82-e819-4468-b210-f1efba9a6a40/lab
<info>    [2023-10-19 10:38:23] [a994dc0f] [ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
<info>    [2023-10-19 10:38:23] || INFO: Service of JupyterLab (obviously-viable-tuna) is available

Jupyter 0.21.2 logs:

JupyterLab (loudly-live-anteater)—default
All Agents
All Containers
All Levels
<[none]>
[2023-10-19 10:00:10]
[d26f2f1a] + test -f startup-hook.sh
<[none]>
[2023-10-19 10:00:10]
[d26f2f1a] + set +x
<[warning]>
[2023-10-19 10:00:10]
[d26f2f1a] [ServerApp] ServerApp.token config is deprecated in 2.0. Use IdentityProvider.token.
<[info]>
[2023-10-19 10:00:10]
[d26f2f1a] [ServerApp] Package jupyterlab took 0.0000s to import
<[info]>
[2023-10-19 10:00:10]
[d26f2f1a] [ServerApp] Package jupyter_archive took 0.0009s to import
<[info]>
[2023-10-19 10:00:10]
[d26f2f1a] [ServerApp] Package jupyter_server_terminals took 0.0021s to import
<[info]>
[2023-10-19 10:00:10]
[d26f2f1a] [ServerApp] Package nbclassic took 0.0000s to import
<[warning]>
[2023-10-19 10:00:10]
[d26f2f1a] [ServerApp] A `_jupyter_server_extension_points` function was not found in nbclassic. Instead, a `_jupyter_server_extension_paths` function was found and will be used for now. This function name will be deprecated in future releases of Jupyter Server.
<[info]>
[2023-10-19 10:00:10]
[d26f2f1a] [ServerApp] Package notebook_shim took 0.0000s to import
<[warning]>
[2023-10-19 10:00:10]
[d26f2f1a] [ServerApp] A `_jupyter_server_extension_points` function was not found in notebook_shim. Instead, a `_jupyter_server_extension_paths` function was found and will be used for now. This function name will be deprecated in future releases of Jupyter Server.
<[info]>
[2023-10-19 10:00:10]
[d26f2f1a] [ServerApp] jupyter_archive | extension was successfully linked.
<[info]>
[2023-10-19 10:00:10]
[d26f2f1a] [ServerApp] jupyter_server_terminals | extension was successfully linked.
<[info]>
[2023-10-19 10:00:10]
[d26f2f1a] [ServerApp] jupyterlab | extension was successfully linked.
<[info]>
[2023-10-19 10:00:10]
[d26f2f1a] [ServerApp] nbclassic | extension was successfully linked.
<[info]>
[2023-10-19 10:00:10]
[d26f2f1a] [ServerApp] Writing Jupyter server cookie secret to /run/determined/jupyter/runtime/jupyter_cookie_secret
<[info]>
[2023-10-19 10:00:11]
[d26f2f1a] [ServerApp] notebook_shim | extension was successfully linked.
<[warning]>
[2023-10-19 10:00:11]
[d26f2f1a] [ServerApp] All authentication is disabled.  Anyone who can connect to this server will be able to run code.
<[info]>
[2023-10-19 10:00:11]
[d26f2f1a] [ServerApp] notebook_shim | extension was successfully loaded.
<[info]>
[2023-10-19 10:00:11]
[d26f2f1a] [ServerApp] jupyter_archive | extension was successfully loaded.
<[info]>
[2023-10-19 10:00:11]
[d26f2f1a] [ServerApp] jupyter_server_terminals | extension was successfully loaded.
<[info]>
[2023-10-19 10:00:11]
[d26f2f1a] [LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.8/site-packages/jupyterlab
<[info]>
[2023-10-19 10:00:11]
[d26f2f1a] [LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
<[info]>
[2023-10-19 10:00:11]
[d26f2f1a] [ServerApp] jupyterlab | extension was successfully loaded.
<[none]>
[2023-10-19 10:00:11]
[d26f2f1a]
<[none]>
[2023-10-19 10:00:11]
[d26f2f1a]   _   _          _      _
<[none]>
[2023-10-19 10:00:11]
[d26f2f1a]  | | | |_ __  __| |__ _| |_ ___
<[none]>
[2023-10-19 10:00:11]
[d26f2f1a]  | |_| | '_ \/ _` / _` |  _/ -_)
<[none]>
[2023-10-19 10:00:11]
[d26f2f1a]   \___/| .__/\__,_\__,_|\__\___|
<[none]>
[2023-10-19 10:00:11]
[d26f2f1a]        |_|
<[none]>
[2023-10-19 10:00:11]
[d26f2f1a]
<[none]>
[2023-10-19 10:00:11]
[d26f2f1a] Read the migration plan to Notebook 7 to learn about the new features and the actions to take if you are using extensions.
<[none]>
[2023-10-19 10:00:11]
[d26f2f1a]
<[none]>
[2023-10-19 10:00:11]
[d26f2f1a] https://jupyter-notebook.readthedocs.io/en/latest/migrate_to_notebook7.html
<[none]>
[2023-10-19 10:00:11]
[d26f2f1a]
<[none]>
[2023-10-19 10:00:11]
[d26f2f1a] Please note that updating to Notebook 7 might break some of your extensions.
<[none]>
[2023-10-19 10:00:11]
[d26f2f1a]
<[info]>
[2023-10-19 10:00:11]
[d26f2f1a] [ServerApp] nbclassic | extension was successfully loaded.
<[info]>
[2023-10-19 10:00:11]
[d26f2f1a] [ServerApp] Serving notebooks from local directory: /run/determined/workdir
<[info]>
[2023-10-19 10:00:11]
[d26f2f1a] [ServerApp] Jupyter Server 2.5.0 is running at:
<[info]>
[2023-10-19 10:00:11]
[d26f2f1a] [ServerApp] http://3ac0d5f930e8:2953/proxy/436ef4a6-82ba-486c-a2ac-ab6275ac5201/lab
<[info]>
[2023-10-19 10:00:11]
[d26f2f1a] [ServerApp]     http://127.0.0.1:2953/proxy/436ef4a6-82ba-486c-a2ac-ab6275ac5201/lab
<[info]>
[2023-10-19 10:00:11]
[d26f2f1a] [ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
<[info]>
[2023-10-19 10:00:11]
|| INFO: Service of JupyterLab (loudly-live-anteater) is available
<[info]>
[2023-10-19 10:00:11]
[d26f2f1a] [ServerApp] Generating new user for token-authenticated request: 68ba0977dac945d2a93d52fc20fdca5f
<[info]>
[2023-10-19 10:00:11]
[d26f2f1a] [TerminalsExtensionApp] Generating new user for token-authenticated request: 1fe38c213e764f139f64538898b019c0
<[info]>
[2023-10-19 10:00:11]
[d26f2f1a] [ServerApp] Generating new user for token-authenticated request: 483fc63dd2544a9ea5f3b05803bcb614
<[info]>
[2023-10-19 10:00:41]
[d26f2f1a] [ServerApp] Generating new user for token-authenticated request: 093fca866e1e4317a3146f50dc0352b2
<[info]>
[2023-10-19 10:00:41]
[d26f2f1a] [TerminalsExtensionApp] Generating new user for token-authenticated request: a2760c21f3784de88690351d4ce060de
<[info]>
[2023-10-19 10:00:41]
[d26f2f1a] [ServerApp] Generating new user for token-authenticated request: 9924f8c5c89a4803b4e7f03ef9b035af

Reproduction Steps

  1. Run a fresh local cluster e.g. with det deploy local cluster-up --no-gpu

2.a. Open the UI: Tasks -> launch Jupyter
OR
2.b.1 Run an experiment e.g. gan_mnist_pytorch with det experiment create const.yaml .
2.b.2 Open the UI, open the experiment, open tensorboard

Expected Behavior

UI for Jupiter lab / tensorboard should open after some (short) waiting time (or a meaningful error message should show up at least).

Screenshot

Jupyter_0 26 1
Tensorboard_0 26 1
Jupyter_0 21 1

Environment

  • OS: MacOS 13.5.2, Windows 11, Ubuntu
  • Browser: chrome 118.0.5993.70, Firefox 118.0.2
  • Version: 0.26.1, 0.26.0, 0.25.1 and 0.21.2 (at least)
  • Docker Engine: 24.0.5

Additional Context

No response

@jokokojote jokokojote added the bug label Oct 19, 2023
@KevinHubert-Dev
Copy link

I was able to reproduce this problem on my local machine which returned the same problem.
Using:

  • Firefox 118.0.2
  • Windows 11
  • Docker Desktop 4.24.1
  • WSL Ubuntu 22 - Docker controller by Docker-Desktop on Mainsystem

@ioga
Copy link
Contributor

ioga commented Oct 19, 2023

hello, the most common cause for this symptom is a problem with master container connecting to the task (notebook/tensorboard) container to proxy the incoming request. this is often caused by firewalls or other networking setup issues.

OS: MacOS 13.5.2, Windows 11, Ubuntu

it's peculiar you see the issue on three different OS'es, and you both see the issue. are you working with one shared deployment? or have you each deployed determined separately on all of these OS'es?

can you tell me more on how did you deploy determined / which guide did you follow? (is it det deploy local or others)

if you happen to share a corporate firewall / proxy setup, I'd recommend trying to temporarily disabling it to see if it helps.

@KevinHubert-Dev
Copy link

KevinHubert-Dev commented Oct 19, 2023

For me I tried it on my private machine:
Docker desktop was in use with "Use the WSL 2 based engine" for docker activated (so I did everything with sudo on my WSL (Ubuntu).
For my local network I got just the windows-firewall activated and no additional anti-virus programs. The windows firewall just asked for the permission to allow docker-desktop to access the internet and since yet no container ever got problems.

I have no proxy activated.

I started the determined cluster using
det deploy local cluster-up

and also started a agent by running the docker-container
docker run \ -v /var/run/docker.sock:/var/run/docker.sock \ -v "$PWD"/agent.yaml:/etc/determined/agent.yaml \ determinedai/determined-agent:VERSION
which just worked fine.

When I start "Jupyter lab" using the UI - I see the container getting started and all pip-libs are being downloaded but it hangs itself when "Running" is printed on the screen as shown by @jokokojote

I'm root on my machine too.

@jokokojote
Copy link
Author

jokokojote commented Oct 19, 2023

Hello,

I tested it on different machines with different set ups to isolate respectively understand the problem.

At first, I indeed tried it on an ubuntu machine inside a cooperate network and run determined using the master, agent and db docker containers diretly (and passed proxy environment variables to the containers). The core functionalities like experiment initlization, (GPU) training, tuning, etc. worked like charm - jupyter and tensorboard did not, yielding the same logs I added in the issue description. Indeed firewall or proxy settings could be the issue here, even though I do not understand why the agent itself worked and no errors were shown in the logs fo tensorboard and jupyter.

Since jupyter and tensorboard it did not work on this machine I tried it on my cooperate laptop (Mac) but outside of the cooperate network and set up determined just with det deploy local cluster-up --no-gpu. Same result: Core functions worked w/o any problems, jupyter and tensorboard did not.

Then I asked @KevinHubert-Dev to try it at home with a private machine and private network and he got the same results like he described.

@ioga
Copy link
Contributor

ioga commented Oct 19, 2023

It is highly unusual to see this happen on so many different setups. I'll need your help debugging it.

When you start a notebook, there'd be a "registering service" log line in the master logs, e.g.

INFO[2023-10-19T13:24:31-07:00] registering service: b8f0beb4-0c7a-4b70-b4a3-6bdcee294de1 (https://127.0.0.1:32903)  component=proxy

you can docker exec -it <master container name> /bin/bash into the master container, and try to curl --insecure <service url>, e.g. curl --insecure https://127.0.0.1:32903 in this case. this should simulate what master does. if it works, weird. if it does not, we need to debug why. if you can't see this log line in the logs at all, please share the master logs.

@jokokojote
Copy link
Author

jokokojote commented Oct 20, 2023

I did what you suggested on my corporate laptop in my private network:

Start up with:

det deploy local cluster-up --no-gpu
Removing network determined_default
Creating network determined_default...
Creating determined_determined-db_1...
Waiting for determined_determined-db_1...
Creating determined_determined-master_1...
Waiting for master instance to be available....
Starting determined-agent-0

Master logs:

2023-10-20 11:07:56 INFO[2023-10-20T09:07:56Z] master configuration: {"config_file":"","log":{"level":"info","color":true},"db":{"user":"postgres","password":"********","migrations":"file:///usr/share/determined/master/static/migrations","host":"determined-db","port":"5432","name":"determined","ssl_mode":"disable","ssl_root_cert":""},"tensorboard_timeout":300,"notebook_timeout":null,"security":{"default_task":{"id":0,"user_id":0,"user":"root","uid":0,"group":"root","gid":0},"tls":{"cert":"","key":""},"ssh":{"rsa_key_size":1024},"authz":{"type":"basic","fallback":"basic","rbac_ui_enabled":null,"_strict_ntsc_enabled":false,"workspace_creator_assign_role":{"enabled":true,"role_id":2},"strict_job_queue_control":false}},"checkpoint_storage":{"host_path":"/Users/fero/Library/Application Support/determined","propagation":null,"save_experiment_best":0,"save_trial_best":1,"save_trial_latest":1,"storage_path":null,"type":"shared_fs"},"task_container_defaults":{"shm_size_bytes":4294967296,"network_mode":"bridge","cpu_pod_spec":null,"gpu_pod_spec":null,"add_capabilities":null,"drop_capabilities":null,"devices":null,"bind_mounts":null,"work_dir":null,"slurm":{},"pbs":{},"kubernetes":null},"port":8080,"root":"/usr/share/determined/master","telemetry":{"enabled":true,"segment_master_key":"********","otel_enabled":false,"otel_endpoint":"localhost:4317","segment_webui_key":"********","cluster_id":""},"enable_cors":false,"launch_error":true,"cluster_name":"","logging":{"type":"default"},"observability":{"enable_prometheus":false},"cache":{"cache_dir":"/var/cache/determined"},"webhooks":{"base_url":"","signing_key":"3dfc80a6eab1"},"feature_switches":[],"resource_manager":{"client_ca":"","default_aux_resource_pool":"default","default_compute_resource_pool":"default","no_default_resource_pools":false,"require_authentication":false,"scheduler":{"allow_heterogeneous_fits":false,"fitting_policy":"best","type":"fair_share"},"type":"agent"},"resource_pools":[{"pool_name":"default","description":"","provider":null,"max_aux_containers_per_agent":100,"task_container_defaults":null,"agent_reattach_enabled":false,"agent_reconnect_wait":"25s","kubernetes_namespace":""}],"__internal":{"audit_logging_enabled":false,"external_sessions":{"login_uri":"","logout_uri":"","jwt_key":""}}} 
2023-10-20 11:07:56 INFO[2023-10-20T09:07:56Z] Determined master 0.26.1 (built with go1.21.0) 
2023-10-20 11:07:56 INFO[2023-10-20T09:07:56Z] connecting to database determined-db:5432    
2023-10-20 11:07:56 INFO[2023-10-20T09:07:56Z] running DB migrations from file:///usr/share/determined/master/static/migrations; this might take a while... 
2023-10-20 11:07:56 INFO[2023-10-20T09:07:56Z] migrated from 0 to 20231006193809            
2023-10-20 11:07:56 INFO[2023-10-20T09:07:56Z] DB migrations completed                      
2023-10-20 11:07:56 INFO[2023-10-20T09:07:56Z] deleting all snapshots for terminal state experiments 
2023-10-20 11:07:56 INFO[2023-10-20T09:07:56Z] Generating a new CA certificate and key      
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] Saved certificate and key to DB              
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] Generating a new certificate and key for master 
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] Saved certificate and key to DB              
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] creating resource pool: default               actor-local-addr=agentRM actor-system=master go-type=agentResourceManager
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] pool default using global scheduling config   actor-local-addr=agentRM actor-system=master go-type=agentResourceManager
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] not enabling provisioner for resource pool: default  actor-local-addr=default actor-system=master go-type=resourcePool resource-pool=default
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] scheduling next resource allocation aggregation in 14h53m2s at 2023-10-21 00:01:00 +0000 UTC  actor-local-addr=allocation-aggregator actor-system=master go-type=allocationAggregator
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] telemetry reporting is enabled; run with --telemetry-enabled=false to disable  component=telemetry
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] accepting incoming connections on port 8080  
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] resource pool is empty; using default resource pool: default  actor-local-addr=agents actor-system=master go-type=agents
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] agent connected ip: 172.18.0.1 resource pool: default slots: 1  actor-local-addr=determined-agent-0 actor-system=master go-type=agent
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] adding device: cpu0 ( x 6 cores) on determined-agent-0  actor-local-addr=determined-agent-0 actor-system=master go-type=agent
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] adding agent: determined-agent-0              actor-local-addr=default actor-system=master agent-id=determined-agent-0 go-type=resourcePool resource-pool=default
2023-10-20 11:09:14 INFO[2023-10-20T09:09:14Z] resources are requested by JupyterLab (duly-strong-piglet) (Allocation ID: c293a8c8-31c6-4d83-a4ac-70a40e5c057b.1)  actor-local-addr=default actor-system=master allocation-id=c293a8c8-31c6-4d83-a4ac-70a40e5c057b.1 go-type=resourcePool resource-pool=default restore=false restoring=false
2023-10-20 11:09:14 INFO[2023-10-20T09:09:14Z] allocated resources to JupyterLab (duly-strong-piglet)  actor-local-addr=default actor-system=master go-type=resourcePool resource-pool=default
2023-10-20 11:09:14 INFO[2023-10-20T09:09:14Z] 1 resources allocated                         job-id=2765d3da-4e08-4494-ae50-a0d359dff301 restore=false task-id=c293a8c8-31c6-4d83-a4ac-70a40e5c057b task-type=NOTEBOOK
2023-10-20 11:09:14 INFO[2023-10-20T09:09:14Z] starting container                            actor-local-addr=determined-agent-0 actor-system=master allocation-id=c293a8c8-31c6-4d83-a4ac-70a40e5c057b.1 container-id=3223bcc3-4c4b-4294-b95e-78268f622808 go-type=agent job-id=2765d3da-4e08-4494-ae50-a0d359dff301 slots=1 task-id=c293a8c8-31c6-4d83-a4ac-70a40e5c057b task-type=NOTEBOOK
2023-10-20 11:12:53 INFO[2023-10-20T09:12:53Z] registering service: c293a8c8-31c6-4d83-a4ac-70a40e5c057b (https://172.18.0.1:32768)  component=proxy
2023-10-20 11:15:14 2023/10/20 09:15:14 http: proxy error: dial tcp 172.18.0.1:32768: connect: connection timed out

Curl inside master gets timeout:

curl --insecure https://172.18.0.1:32768
curl: (28) Failed to connect to 172.18.0.1 port 32768 after 130208 ms: Connection timed out

Jupyter container logs:

2023-10-20 11:12:55 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
2023-10-20 11:12:55 INFO: [31] root: detected 0 gpus (nvidia-smi not found)
2023-10-20 11:12:55 INFO: [31] root: rocm-smi not found
2023-10-20 11:12:55 INFO: [31] root: Running task container on agent_id=determined-agent-0, hostname=f9941fc1ee0b with visible GPUs []
2023-10-20 11:12:55 INFO: [31] root: detected 0 gpu processes (nvidia-smi not found)
2023-10-20 11:12:55 + test -f startup-hook.sh
2023-10-20 11:12:55 + set +x
2023-10-20 11:12:56 WARNING: [ServerApp] ServerApp.token config is deprecated in 2.0. Use IdentityProvider.token.
2023-10-20 11:12:56 INFO: [ServerApp] Package jupyterlab took 0.0000s to import
2023-10-20 11:12:56 INFO: [ServerApp] Package jupyter_archive took 0.0008s to import
2023-10-20 11:12:56 INFO: [ServerApp] Package jupyter_server_terminals took 0.0025s to import
2023-10-20 11:12:56 INFO: [ServerApp] Package nbclassic took 0.0000s to import
2023-10-20 11:12:56 WARNING: [ServerApp] A `_jupyter_server_extension_points` function was not found in nbclassic. Instead, a `_jupyter_server_extension_paths` function was found and will be used for now. This function name will be deprecated in future releases of Jupyter Server.
2023-10-20 11:12:56 INFO: [ServerApp] Package notebook_shim took 0.0000s to import
2023-10-20 11:12:56 WARNING: [ServerApp] A `_jupyter_server_extension_points` function was not found in notebook_shim. Instead, a `_jupyter_server_extension_paths` function was found and will be used for now. This function name will be deprecated in future releases of Jupyter Server.
2023-10-20 11:12:56 INFO: [ServerApp] jupyter_archive | extension was successfully linked.
2023-10-20 11:12:56 INFO: [ServerApp] jupyter_server_terminals | extension was successfully linked.
2023-10-20 11:12:56 INFO: [ServerApp] jupyterlab | extension was successfully linked.
2023-10-20 11:12:56 INFO: [ServerApp] nbclassic | extension was successfully linked.
2023-10-20 11:12:56 INFO: [ServerApp] Writing Jupyter server cookie secret to /run/determined/jupyter/runtime/jupyter_cookie_secret
2023-10-20 11:12:56 INFO: [ServerApp] notebook_shim | extension was successfully linked.
2023-10-20 11:12:56 WARNING: [ServerApp] All authentication is disabled.  Anyone who can connect to this server will be able to run code.
2023-10-20 11:12:56 INFO: [ServerApp] notebook_shim | extension was successfully loaded.
2023-10-20 11:12:56 INFO: [ServerApp] jupyter_archive | extension was successfully loaded.
2023-10-20 11:12:56 INFO: [ServerApp] jupyter_server_terminals | extension was successfully loaded.
2023-10-20 11:12:56 INFO: [LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.8/site-packages/jupyterlab
2023-10-20 11:12:56 INFO: [LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
2023-10-20 11:12:56 INFO: [ServerApp] jupyterlab | extension was successfully loaded.
2023-10-20 11:12:56 
2023-10-20 11:12:56   _   _          _      _
2023-10-20 11:12:56  | | | |_ __  __| |__ _| |_ ___
2023-10-20 11:12:56  | |_| | '_ \/ _` / _` |  _/ -_)
2023-10-20 11:12:56   \___/| .__/\__,_\__,_|\__\___|
2023-10-20 11:12:56        |_|
2023-10-20 11:12:56                                                                            
2023-10-20 11:12:56 Read the migration plan to Notebook 7 to learn about the new features and the actions to take if you are using extensions.
2023-10-20 11:12:56 
2023-10-20 11:12:56 https://jupyter-notebook.readthedocs.io/en/latest/migrate_to_notebook7.html
2023-10-20 11:12:56 
2023-10-20 11:12:56 Please note that updating to Notebook 7 might break some of your extensions.
2023-10-20 11:12:56 
2023-10-20 11:12:56 INFO: [ServerApp] nbclassic | extension was successfully loaded.
2023-10-20 11:12:56 INFO: [ServerApp] Serving notebooks from local directory: /run/determined/workdir
2023-10-20 11:12:56 INFO: [ServerApp] Jupyter Server 2.7.0 is running at:
2023-10-20 11:12:56 INFO: [ServerApp] https://f9941fc1ee0b:3085/proxy/c293a8c8-31c6-4d83-a4ac-70a40e5c057b/lab
2023-10-20 11:12:56 INFO: [ServerApp]     https://127.0.0.1:3085/proxy/c293a8c8-31c6-4d83-a4ac-70a40e5c057b/lab
2023-10-20 11:12:56 INFO: [ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

Docker containers running:

CONTAINER ID   IMAGE                                                               COMMAND                  CREATED          STATUS                    PORTS                     NAMES
f9941fc1ee0b   determinedai/environments:py-3.8-pytorch-1.12-tf-2.11-cpu-2b7e2a1   "/run/determined/jup…"   8 minutes ago    Up 8 minutes              0.0.0.0:32768->3085/tcp   infallible_grothendieck
034d5a640a92   determinedai/determined-agent:0.26.1                                "/run/determined/wor…"   13 minutes ago   Up 13 minutes                                       determined-agent-0
415688c85e56   determinedai/determined-master:0.26.1                               "/usr/bin/determined…"   13 minutes ago   Up 13 minutes             0.0.0.0:8080->8080/tcp    determined_determined-master_1
ca39ae6cde8a   postgres:10.14                                                      "docker-entrypoint.s…"   14 minutes ago   Up 14 minutes (healthy)   5432/tcp                  determined_determined-db_1

Agent logs:

2023-10-20 11:08:07 WARN[2023-10-20T09:08:07Z] no configuration file at /etc/determined/agent.yaml, skipping 
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] agent configuration: {"config_file":"","master_host":"host.docker.internal","master_port":8080,"agent_id":"determined-agent-0","artificial_slots":0,"slot_type":"auto","container_master_host":"","container_master_port":0,"label":"","resource_pool":"","api_enabled":false,"bind_ip":"0.0.0.0","bind_port":9090,"visible_gpus":"","tls":false,"cert_file":"","key_file":"","http_proxy":"","https_proxy":"","ftp_proxy":"","no_proxy":"","security":{"tls":{"enabled":false,"skip_verify":false,"master_cert":"","master_cert_name":"","client_cert":"","client_key":""}},"fluent":{"image":"","port":0,"container_name":""},"container_auto_remove_disabled":false,"agent_reconnect_attempts":5,"agent_reconnect_backoff":5,"hooks":{"on_connection_lost":null},"container_runtime":"docker","image_root":"","singularity_options":{"allow_network_creation":false},"podman_options":{"allow_network_creation":false},"debug":false} 
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] starting main agent process                  
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] connecting to master                          component=agent
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] connecting to master at: ws://host.docker.internal:8080/agents?id=determined-agent-0&version=0.26.1&resource_pool=&reconnect=false&hostname=docker-desktop  component=agent
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] reading master set agent options message      component=agent
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] running socket read loop                      component=websocket name=determined-agent-0 remote-addr="192.168.65.254:8080"
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] running socket write loop                     component=websocket name=determined-agent-0 remote-addr="192.168.65.254:8080"
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] detecting devices                             component=agent
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] detected compute devices:                    
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z]  cpu0 ( x 6 cores)                           
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] setting up docker runtime                     component=agent
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] couldn't process ~/.docker/config.json can't read Docker config: open /root/.docker/config.json: no such file or directory  component=docker-client
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] can't find any docker credential stores, continuing without them  component=docker-client
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] can't find any auths in ~/.docker/config.json, continuing without them  component=docker-client
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] setting up container manager                  component=agent
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] reattaching containers                        component=agent
2023-10-20 11:08:07 DEBU[2023-10-20T09:08:07Z] reattachContainers: expected survivors: []    component=container-manager
2023-10-20 11:08:07 DEBU[2023-10-20T09:08:07Z] reattachContainers: running containers: []    component=container-manager
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] iterating expected survivors and seeing if they were found  component=container-manager
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] sending SIGKILL to running containers that were not reattached  component=container-manager
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] writing agent started message                 component=agent
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] watching for ws requests and system events    component=agent
2023-10-20 11:09:14 TRAC[2023-10-20T09:09:14Z] starting container 3223bcc3-4c4b-4294-b95e-78268f622808  component=container-manager
2023-10-20 11:09:14 TRAC[2023-10-20T09:09:14Z] starting container launch                     component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808
2023-10-20 11:09:14 TRAC[2023-10-20T09:09:14Z] kicking off goroutine shim SIGKILL to cancellations, until we have launched  component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808
2023-10-20 11:09:14 TRAC[2023-10-20T09:09:14Z] kicking off goroutine to launch the container  component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808
2023-10-20 11:09:14 TRAC[2023-10-20T09:09:14Z] waiting for launch to complete                component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808
2023-10-20 11:09:14 TRAC[2023-10-20T09:09:14Z] pulling image                                 component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808
2023-10-20 11:09:14 INFO[2023-10-20T09:09:14Z] transitioning state from ASSIGNED to PULLING  component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808 stop="<nil>"
2023-10-20 11:12:53 TRAC[2023-10-20T09:12:53Z] creating container, copying files, etc        component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808
2023-10-20 11:12:53 INFO[2023-10-20T09:12:53Z] transitioning state from PULLING to STARTING  component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808 stop="<nil>"
2023-10-20 11:12:53 TRAC[2023-10-20T09:12:53Z] starting container                            component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808 docker-id=f9941fc1ee0b0941ed492c3b8818dca67c92227d0d90a4bb75e20050f5b58306
2023-10-20 11:12:53 TRAC[2023-10-20T09:12:53Z] signal-to-context shimmer exited              component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808
2023-10-20 11:12:53 TRAC[2023-10-20T09:12:53Z] transitioning to running state                component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808
2023-10-20 11:12:53 INFO[2023-10-20T09:12:53Z] transitioning state from STARTING to RUNNING  component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808 stop="<nil>"
2023-10-20 11:12:53 TRAC[2023-10-20T09:12:53Z] in monitoring loop                            component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808

@ioga
Copy link
Contributor

ioga commented Oct 20, 2023

Curl inside master gets timeout:

do you have any insight why this does not work?

@jokokojote
Copy link
Author

Verbose mode did not yield anymore information using curl:

# curl --insecure https://172.18.0.1:32768 -v
*   Trying 172.18.0.1:32768...
* connect to 172.18.0.1 port 32768 failed: Connection timed out
* Failed to connect to 172.18.0.1 port 32768 after 128437 ms: Connection timed out
* Closing connection 0
curl: (28) Failed to connect to 172.18.0.1 port 32768 after 128437 ms: Connection timed out

I am not a docker expert, so maybe this is not relevant, but I was wondering why in your example https://127.0.0.1:32903 a localhost address was used while in my master logs 172.18.0.1 occurs. I suspected this to be linked to the determined_default network which is set up when running det deploy local cluster-up --no-gpu:

Removing network determined_default
**Creating network determined_default...**
Creating determined_determined-db_1...
Waiting for determined_determined-db_1...
Creating determined_determined-master_1...
Waiting for master instance to be available....
Starting determined-agent-0 

Containers running after trying to run jupyter:

fero@BLN-FERO1OSX ~ % docker ps                    
CONTAINER ID   IMAGE                                                               COMMAND                  CREATED          STATUS                    PORTS                     NAMES
581f2aafaaea   determinedai/environments:py-3.8-pytorch-1.12-tf-2.11-cpu-2b7e2a1   "/run/determined/jup…"   14 minutes ago   Up 14 minutes             0.0.0.0:32768->3134/tcp   youthful_stonebraker
72d5c4237bde   determinedai/determined-agent:0.26.1                                "/run/determined/wor…"   19 minutes ago   Up 19 minutes                                       determined-agent-0
17ff592e096e   determinedai/determined-master:0.26.1                               "/usr/bin/determined…"   19 minutes ago   Up 19 minutes             0.0.0.0:8080->8080/tcp    determined_determined-master_1
23646c77853a   postgres:10.14                                                      "docker-entrypoint.s…"   20 minutes ago   Up 20 minutes (healthy)   5432/tcp                  determined_determined-db_1

Inspecting the docker networks showed that only db and master container are in the determined_default network, I don't know if this is intended.

docker network inspect determined_default
[
    {
        "Name": "determined_default",
        "Id": "7a594f724022b0f7da4ea03a1eec6afe9d60c73c9986427ba224f3fcd84562bc",
        "Created": "2023-10-23T08:49:10.04962575Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "17ff592e096e476ec25d14da167f017f45bcd3dec2d95a65136f3ca88dfb7196": {
                "Name": "determined_determined-master_1",
                "EndpointID": "652f56185779ad583c450d9194c0fc197ec9d614c62ceb2e70f93e78a18a837b",
                "MacAddress": "02:42:ac:12:00:03",
                "IPv4Address": "172.18.0.3/16",
                "IPv6Address": ""
            },
            "23646c77853ad8521941175c6270f63961a90469875ec851befb498be75cf2cf": {
                "Name": "determined_determined-db_1",
                "EndpointID": "dc19e0e4d78e86f3c7e3a365618e19ca04c1c354bb78a9293e184ab679c586cb",
                "MacAddress": "02:42:ac:12:00:02",
                "IPv4Address": "172.18.0.2/16",
                "IPv6Address": ""
            }
        },
        "Options": {},
        "Labels": {}
    }
]

Agent is in host network mode:

fero@BLN-FERO1OSX ~ % docker network inspect host              
[
    {
        "Name": "host",
        "Id": "3f050a8b78973aafc4140e1e99e6c72d6a61f1a5c7653a64e00598379171be1f",
        "Created": "2023-09-01T09:04:53.983683458Z",
        "Scope": "local",
        "Driver": "host",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": []
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "72d5c4237bdea4076bc5683fe54bc85c2b20e0005d20b525ca3c3415de5b1601": {
                "Name": "determined-agent-0",
                "EndpointID": "457d6e7cdbf24cde2480a51af4af186e8d91c59f766d67a6d2d2db79f588eee2",
                "MacAddress": "",
                "IPv4Address": "",
                "IPv6Address": ""
            }
        },
        "Options": {},
        "Labels": {}
    }
]

Jupyter container is in bridge mode:

 fero@BLN-FERO1OSX ~ % docker network inspect bridge
[
    {
        "Name": "bridge",
        "Id": "6e350adf502865d9a91cb8664b6912dd990799de33313ef2368f267e475164b2",
        "Created": "2023-10-23T08:49:09.70243675Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.17.0.0/16",
                    "Gateway": "172.17.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "581f2aafaaea9f2c3eb71428c7b4e8574a79cbe646923a0e4384c2a80d5d2c1e": {
                "Name": "youthful_stonebraker",
                "EndpointID": "85bc23b70833f03aa9c67135a4a255daff41b28702995304101c60b39843c2dd",
                "MacAddress": "02:42:ac:11:00:02",
                "IPv4Address": "172.17.0.2/16",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.bridge.default_bridge": "true",
            "com.docker.network.bridge.enable_icc": "true",
            "com.docker.network.bridge.enable_ip_masquerade": "true",
            "com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
            "com.docker.network.bridge.name": "docker0",
            "com.docker.network.driver.mtu": "65535"
        },
        "Labels": {}
    }
]

@ioga
Copy link
Contributor

ioga commented Oct 23, 2023

I was able to repro the issue with det deploy local on macos, works fine on ubuntu, will investigate more.

as a temporary workaround, I can suggest installing master and agent using linux packages or homebrew which should address that problem by not having master wrapped in docker.

@ioga
Copy link
Contributor

ioga commented Oct 23, 2023

@jokokojote did you do your last test on macos? or on ubuntu?

@jokokojote
Copy link
Author

Last test was on macOS.

Ob ubuntu I started it with:

# Start Postgres container
docker run \
    --name determined-db \
    --network host \
    -p 5432:5432 \
    -v determined_db:/var/lib/postgresql/data \
    -e POSTGRES_DB=determined \
    -e POSTGRES_PASSWORD="postgres" \
    -d \
    postgres:10

# Start Determined master node container
docker run \
    --name determined-master \
    --network host \
    -e DET_DB_HOST=localhost \
    -e DET_DB_NAME=determined \
    -e DET_DB_PORT=5432 \
    -e DET_DB_USER=postgres \
    -e DET_DB_PASSWORD="postgres" \
    -e http_proxy=http://10.56.130.176:3128 \
    -e https_proxy=http://10.56.130.176:3128 \
    -e no_proxy=localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 \
    -d \
    determinedai/determined-master:0.26.1

# Start Determined agent node container
docker run \
    --name determined-agent \
    --network host \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -e DET_MASTER_HOST=localhost \
    -e DET_MASTER_PORT=8080 \
    -e http_proxy=http://10.56.130.176:3128 \
    -e https_proxy=http://10.56.130.176:3128 \
    -e no_proxy=localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 \
    --gpus all \
    -d \
    determinedai/determined-agent:0.26.1

After running det deploy local once and had the same problem with jupyter and tensorboard not working.

@ioga
Copy link
Contributor

ioga commented Oct 24, 2023

so the ubuntu setup has the proxy configuration. this often causes problems.

you'd need to setup task_container_defaults->environment_variables in the master config to also pass the proxy variables. this configuration cannot be passed through the docker run -e, you'd need to make and mount a config file instead.

otherwise, master and agent has this config, but the spawned containers don't.

@Readon
Copy link

Readon commented Apr 18, 2024

Verbose mode did not yield anymore information using curl:

# curl --insecure https://172.18.0.1:32768 -v
*   Trying 172.18.0.1:32768...
* connect to 172.18.0.1 port 32768 failed: Connection timed out
* Failed to connect to 172.18.0.1 port 32768 after 128437 ms: Connection timed out
* Closing connection 0
curl: (28) Failed to connect to 172.18.0.1 port 32768 after 128437 ms: Connection timed out

I am not a docker expert, so maybe this is not relevant, but I was wondering why in your example https://127.0.0.1:32903 a localhost address was used while in my master logs 172.18.0.1 occurs. I suspected this to be linked to the determined_default network which is set up when running det deploy local cluster-up --no-gpu:

Removing network determined_default
**Creating network determined_default...**
Creating determined_determined-db_1...
Waiting for determined_determined-db_1...
Creating determined_determined-master_1...
Waiting for master instance to be available....
Starting determined-agent-0 

Containers running after trying to run jupyter:

fero@BLN-FERO1OSX ~ % docker ps                    
CONTAINER ID   IMAGE                                                               COMMAND                  CREATED          STATUS                    PORTS                     NAMES
581f2aafaaea   determinedai/environments:py-3.8-pytorch-1.12-tf-2.11-cpu-2b7e2a1   "/run/determined/jup…"   14 minutes ago   Up 14 minutes             0.0.0.0:32768->3134/tcp   youthful_stonebraker
72d5c4237bde   determinedai/determined-agent:0.26.1                                "/run/determined/wor…"   19 minutes ago   Up 19 minutes                                       determined-agent-0
17ff592e096e   determinedai/determined-master:0.26.1                               "/usr/bin/determined…"   19 minutes ago   Up 19 minutes             0.0.0.0:8080->8080/tcp    determined_determined-master_1
23646c77853a   postgres:10.14                                                      "docker-entrypoint.s…"   20 minutes ago   Up 20 minutes (healthy)   5432/tcp                  determined_determined-db_1

Inspecting the docker networks showed that only db and master container are in the determined_default network, I don't know if this is intended.

docker network inspect determined_default
[
    {
        "Name": "determined_default",
        "Id": "7a594f724022b0f7da4ea03a1eec6afe9d60c73c9986427ba224f3fcd84562bc",
        "Created": "2023-10-23T08:49:10.04962575Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "17ff592e096e476ec25d14da167f017f45bcd3dec2d95a65136f3ca88dfb7196": {
                "Name": "determined_determined-master_1",
                "EndpointID": "652f56185779ad583c450d9194c0fc197ec9d614c62ceb2e70f93e78a18a837b",
                "MacAddress": "02:42:ac:12:00:03",
                "IPv4Address": "172.18.0.3/16",
                "IPv6Address": ""
            },
            "23646c77853ad8521941175c6270f63961a90469875ec851befb498be75cf2cf": {
                "Name": "determined_determined-db_1",
                "EndpointID": "dc19e0e4d78e86f3c7e3a365618e19ca04c1c354bb78a9293e184ab679c586cb",
                "MacAddress": "02:42:ac:12:00:02",
                "IPv4Address": "172.18.0.2/16",
                "IPv6Address": ""
            }
        },
        "Options": {},
        "Labels": {}
    }
]

Agent is in host network mode:

fero@BLN-FERO1OSX ~ % docker network inspect host              
[
    {
        "Name": "host",
        "Id": "3f050a8b78973aafc4140e1e99e6c72d6a61f1a5c7653a64e00598379171be1f",
        "Created": "2023-09-01T09:04:53.983683458Z",
        "Scope": "local",
        "Driver": "host",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": []
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "72d5c4237bdea4076bc5683fe54bc85c2b20e0005d20b525ca3c3415de5b1601": {
                "Name": "determined-agent-0",
                "EndpointID": "457d6e7cdbf24cde2480a51af4af186e8d91c59f766d67a6d2d2db79f588eee2",
                "MacAddress": "",
                "IPv4Address": "",
                "IPv6Address": ""
            }
        },
        "Options": {},
        "Labels": {}
    }
]

Jupyter container is in bridge mode:

 fero@BLN-FERO1OSX ~ % docker network inspect bridge
[
    {
        "Name": "bridge",
        "Id": "6e350adf502865d9a91cb8664b6912dd990799de33313ef2368f267e475164b2",
        "Created": "2023-10-23T08:49:09.70243675Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.17.0.0/16",
                    "Gateway": "172.17.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "581f2aafaaea9f2c3eb71428c7b4e8574a79cbe646923a0e4384c2a80d5d2c1e": {
                "Name": "youthful_stonebraker",
                "EndpointID": "85bc23b70833f03aa9c67135a4a255daff41b28702995304101c60b39843c2dd",
                "MacAddress": "02:42:ac:11:00:02",
                "IPv4Address": "172.17.0.2/16",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.bridge.default_bridge": "true",
            "com.docker.network.bridge.enable_icc": "true",
            "com.docker.network.bridge.enable_ip_masquerade": "true",
            "com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
            "com.docker.network.bridge.name": "docker0",
            "com.docker.network.driver.mtu": "65535"
        },
        "Labels": {}
    }
]

I met the problem almost the same. if master is running on an individual server would it possible to access the registered address that 172.18.0.1? That address is an docker accessable, not LAN wide.
Is that possible to set the service registering IP address through config agent.yaml file?

@ioga
Copy link
Contributor

ioga commented Apr 18, 2024

I met the problem almost the same. if master is running on an individual server would it possible to access the registered address that 172.18.0.1? That address is an docker accessable, not LAN wide. Is that possible to set the service registering IP address through config agent.yaml file?

Sorry, nothing comes to mind. If complex bridge networking is causing issues, you can try switching to host mode networking.

Setting up local k8s clusters is also much easier nowadays, so that's another path to consider if you don't want to maintain a raw docker setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants