Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker service list fails with "rpc error: code = Unavailable desc = transport is closing" #37849

Closed
fidesmo-mrc opened this issue Sep 14, 2018 · 5 comments

Comments

@fidesmo-mrc
Copy link

Hi,

we hava a physical docker swarm with 3 managers, 1 of them updated to 18.06.01-ce (sys2) and the others still in 17.12.0-ce (sys1 and db1)

everything works ok, except the docker service list command from the updated node:

$ docker service list
Error response from daemon: rpc error: code = Unavailable desc = transport is closing

when this happens docker events shows:

2018-09-14T15:41:38.226952663+02:00 node update rgitjtk69u1yy2sxgiesthsz8 (name=sys2)
2018-09-14T15:41:38.634034525+02:00 node update rgitjtk69u1yy2sxgiesthsz8 (name=sys2)

also, all connections from sys2 to the cluster (docker service log ...) are interrupted

Steps to reproduce the issue:

  1. update 1 node to 18.06.01
  2. docker service list
  • I'm unable to reproduce the issue on my test lab

Describe the results you received:
Error response from daemon: rpc error: code = Unavailable desc = transport is closing

Describe the results you expected:
the list of services

Output of docker version:

sys2 $ docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:23:03 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:25:29 2018
  OS/Arch:          linux/amd64
  Experimental:     true
sys1 $ docker version
Client:
 Version:       17.12.0-ce
 API version:   1.35
 Go version:    go1.9.2
 Git commit:    c97c6d6
 Built: Wed Dec 27 20:10:14 2017
 OS/Arch:       linux/amd64

Server:
 Engine:
  Version:      17.12.0-ce
  API version:  1.35 (minimum version 1.12)
  Go version:   go1.9.2
  Git commit:   c97c6d6
  Built:        Wed Dec 27 20:12:46 2017
  OS/Arch:      linux/amd64
  Experimental: true

Output of docker info:

Containers: 9                                                                                                                                                                                 
 Running: 9                                                                                                                                                                                   
 Paused: 0
 Stopped: 0
Images: 75
Server Version: 18.06.1-ce
Storage Driver: btrfs
 Build Version: Btrfs v4.9.1
 Library Version: 102
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: rgitjtk69u1yy2sxgiesthsz8
 Is Manager: true
 ClusterID: wz3b9ns93up5pnowvjfr6vfjd
 Managers: 3
 Nodes: 8
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.0.100.20
 Manager Addresses:
  10.0.100.10:2377
  10.0.100.20:2377
  10.0.20.50:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683                                                         
Security Options:
 seccomp                          
  Profile: default
Kernel Version: 3.10.0-862.11.6.el7.x86_64
Operating System: Red Hat Enterprise Linux
OSType: linux
Architecture: x86_64
CPUs: 8                          
Total Memory: 31.22GiB
Name: sys2                
ID: RCXJ:7MQ4:HPU7:TBKC:U3XW:G7WQ:4CVG:QJ6U:N4UG:YBMJ:CHBX:ROSZ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 170       
 Goroutines: 281  
 System Time: 2018-09-14T15:47:22.656928039+02:00
 EventsListeners: 10
Registry: https://index.docker.io/v1/
Labels:                          
Experimental: true        
Insecure Registries:
 127.0.0.0/8      
Live Restore Enabled: false
@fidesmo-mrc
Copy link
Author

hi, fixed:

we reduced the number of old tasks (reducing RestartAttempts and task-history-limit) until the response to api/tasks was under 1Mb

sudo curl --unix-socket /run/docker.sock -X GET http://api/tasks -o tasks.json; du -hs tasks.json
948K    tasks.json

thanks!

@thaJeztah
Copy link
Member

Possibly related to #37997 (and #38123, #38103)

@ifkite
Copy link

ifkite commented Feb 3, 2021

os: centos 7
my k8s node get down every time I run a job in a pod.
When I login to the node, I realize that dockerd was terminated(in fact it was killed).

cmd on node to get dockerd log: sudo journalctl -u docker.service
what log I get:
level=error msg="failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=moby level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\". Reconnecting..." module=grpc level=error msg="failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=plugins.moby level=info msg="Processing signal 'terminated'" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\". Reconnecting..." module=grpc
cmd to get system log: grep "Out of memory" /var/log/messages

and I guess that it is a task that run in the container that casuse the above problem. The app consume much memory and cause OOM. I run the task which is a training task in a container, and it was killed.

solution: optimize the task to consume less mem, or run the task on another node with enough mem.

@shivam2202
Copy link

Same issue with Docker version 20.10.7, build 20.10.7-0ubuntu1~18.04.1. Any clues on this ?

Aug 06 06:19:37 ismv-appthl5 dockerd[12066]: time="2021-07-21T06:19:37.276314116+03:00" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=moby
Aug 06 06:19:37 ismv-appthl5 dockerd[12066]: time="2021-07-21T06:19:37.276339636+03:00" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=plugins.moby

@thaJeztah
Copy link
Member

@shivam2202 it's a different error in your case. Also a "rpc" error, but because docker failed to connect with containerd. (libcontainerd is the containerd client in the docker daemon). Could be that containerd crashed or something ran out of resources; the error itself is not very informative (just means the gRPC connection with containerd closed), but I'd recommend looking if other log entries around that timeframe give more details that help you find the cause (also be sure to check logs for containerd itself, and system logs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants