kubernetes-e2e-gci-gke-slow: broken test run #34665

Hmm the node unhealthy issue is basically not debuggable without a live repro, no logs in artifacts: http://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/1391 ?

I'm seeing found 3 nodes out of expected 4. Your cluster may not be fully functional in the kubernetes-e2e-gci-gce test suite failing now, any chance that is related?

There's a good chance that's the same issue with missing image layers that was causing failures on 1.4.5 kubelet. cc @dchen1107

@dchen1107 @yujuhong Can you help confirm?

saad-ali · 2016-11-03T21:37:05Z

Also potentially related, kubernetes-kubemark-500-gce tests are running into Detected 6 ready nodes, found 6 nodes out of expected 7. Your cluster may not be fully functional.:

I'm going to see if I can hunt down a PR that is responsible and revert it.

bprashanth · 2016-11-03T21:39:38Z

Fwiw from those logs (#34665 (comment)), the node is definitely created by the MIG, but we see no kubernetes node object. Probably means the vm didn't come up, kept rebooting, was disconnected from master, or kubelet had some startup issue. First 3 sound like general gce bugs. Importantly, we don't see a NotReady node.

k8s-github-robot · 2016-11-03T22:08:30Z

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/1396/

Run so broken it didn't make JUnit output!

saad-ali · 2016-11-03T22:13:23Z

continuous-e2e-docker-validation-gci also hit it (Detected 3 ready nodes, found 3 nodes out of expected 4. Your cluster may not be fully functional.):

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/continuous-e2e-docker-validation-gci/109/

Fwiw from those logs (#34665 (comment)), the node is definitely created by the MIG, but we see no kubernetes node object. Probably means the vm didn't come up, kept rebooting, was disconnected from master, or kubelet had some startup issue. First 3 sound like general gce bugs. Importantly, we don't see a NotReady node.

@kubernetes/test-infra-admins Is there a good way to identify if this is a GCE issue or k8s? Whenever this issue is hit, the logs are not copied from the node machines.

saad-ali · 2016-11-03T23:38:54Z

1.4 tests do not seem to be hitting this (Your cluster may not be fully functional), only master tests are. Which would indicate that the problem is with k8s not GCE.

saad-ali · 2016-11-03T23:44:48Z

CC @thockin @matchstick

The submit queue has been red most of the day due to this issue. I'm trying to cut the 1.5 branch and I can't find a green build after Nov 1. If we are not able to resolve this by tomorrow morning we will convene a "war room" to try and tackle it.

@kubernetes/test-infra-admins Is there something we can do to get logs off the node and master machines when this issue is hit? When this issue is hit artifacts is empty (e.g. https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gce/2313/)?

bprashanth · 2016-11-04T00:44:34Z

This seems to be a startup problem on gci @kubernetes/goog-node @vishh

I think the crash is in one of these functions: https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/configure-helper.sh#L1211-L1216
And "pre-warm-mounter" was very recently: #35821, why do we not just retry on failure?

Here's a bunch of debug info to support the theory:
kubelet not running

beeps@e2e-test-beeps-minion-group-59oq ~ $ sudo systemctl status kubelet
● kubelet.service - Kubernetes kubelet
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

beeps@e2e-test-beeps-minion-group-59oq ~ $ sudo journalctl -u kubelet
-- No entries --

unit file looks sane

beeps@e2e-test-beeps-minion-group-59oq ~ $ cat /usr/lib/systemd/system/kubelet.service
[Unit]
Description=Kubernetes kubelet
Requires=network-online.target
After=network-online.target

[Service]
Restart=always
RestartSec=10
EnvironmentFile=/etc/default/kubelet
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStart=/usr/bin/kubelet $KUBELET_OPTS

[Install]
WantedBy=multi-user.target

defaults are messed up

beeps@e2e-test-beeps-minion-group-59oq ~ $ cat /etc/default/kubelet
KUBELET_OPTS="--manifest-url=http://metadata.google.internal/computeMetadata/v1/instance/attributes/google-container-manifest --manifest-url-header=Metadata-Flavor:Google --config=/etc/kubernetes/manifests"

beeps@e2e-test-beeps-minion-group-59oq ~ $ cat /etc/default/docker
cat: /etc/default/docker: No such file or directory

beeps@e2e-test-beeps-minion-group-59oq ~ $ ls /etc/kubernetes/manifests/
beeps@e2e-test-beeps-minion-group-59oq ~ $

health monitors are working, at least

beeps@e2e-test-beeps-minion-group-59oq ~ $ sudo journalctl -f -u kube-docker-monitor
-- Logs begin at Thu 2016-11-03 22:48:02 UTC. --
Nov 03 22:53:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1352]: Start kubernetes health monitoring for docker


beeps@e2e-test-beeps-minion-group-59oq ~ $ sudo journalctl -f -u kubelet-monitor
-- Logs begin at Thu 2016-11-03 22:48:02 UTC. --
Nov 03 23:26:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: Dload  Upload   Total   Spent    Left  Speed
Nov 03 23:26:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: 0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 127.0.0.1 port 10250: Connection refused
Nov 03 23:27:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: Kubelet is unhealthy!
Nov 03 23:27:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Nov 03 23:27:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: Dload  Upload   Total   Spent    Left  Speed
Nov 03 23:27:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: 0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 127.0.0.1 port 10250: Connection refused
Nov 03 23:28:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: Kubelet is unhealthy!
Nov 03 23:28:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Nov 03 23:28:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: Dload  Upload   Total   Spent    Left  Speed
Nov 03 23:28:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: 0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 127.0.0.1 port 10250: Connection refused

node installation passed


beeps@e2e-test-beeps-minion-group-59oq ~ $ sudo systemctl status kube-node-installation
● kube-node-installation.service - Download and install k8s binaries and configurations
   Loaded: loaded (/etc/systemd/system/kube-node-installation.service; enabled; vendor preset: disabled)
   Active: active (exited) since Thu 2016-11-03 22:48:34 UTC; 1h 1min ago
  Process: 1091 ExecStart=/home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
  Process: 1087 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
  Process: 1082 ExecStartPre=/usr/bin/curl --fail --retry 5 --retry-delay 3 --silent --show-error -H X-Google-Metadata-Request: True -o /home/kubernetes/bin/configure.sh http://metadata.google.internal/computeMetadata/v1/instance/attributes/configure-sh (code=exited, status=0/SUCCESS)
  Process: 1076 ExecStartPre=/bin/mount -o remount,exec /home/kubernetes/bin (code=exited, status=0/SUCCESS)
  Process: 1072 ExecStartPre=/bin/mount --bind /home/kubernetes/bin /home/kubernetes/bin (code=exited, status=0/SUCCESS)
  Process: 1067 ExecStartPre=/bin/mkdir -p /home/kubernetes/bin (code=exited, status=0/SUCCESS)
 Main PID: 1091 (code=exited, status=0/SUCCESS)
   Memory: 0B
      CPU: 0
   CGroup: /system.slice/kube-node-installation.service

Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: Downloading k8s manifests tar
Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: Dload  Upload   Total   Spent    Left  Speed
Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: [155B blob data]
Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: == Downloaded https://storage.googleapis.com/kubernetes-staging-ae13154a31/e2e-test-beeps-devel/kubernetes-manifests.tar.gz (SHA1 = 8b94ee874d75d4264c5a9e382ffc210e4fe8d73d) ==
Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: Dload  Upload   Total   Spent    Left  Speed
Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: [234B blob data]
Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: == Downloaded https://storage.googleapis.com/kubernetes-release/rkt/rkt-v1.18.0 (SHA1 = 75fc8f29c79bc9e505f3e7f6e8fadf2425c21967) ==
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure.sh[1091]: Done for installing kubernetes files

node configuration failed

beeps@e2e-test-beeps-minion-group-59oq ~ $ sudo systemctl status kube-node-configuration
● kube-node-configuration.service - Configure kubernetes node
   Loaded: loaded (/etc/systemd/system/kube-node-configuration.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2016-11-03 22:53:45 UTC; 54min ago
  Process: 1165 ExecStart=/home/kubernetes/bin/configure-helper.sh (code=exited, status=254)
  Process: 1161 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/mounter (code=exited, status=0/SUCCESS)
  Process: 1157 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure-helper.sh (code=exited, status=0/SUCCESS)
 Main PID: 1165 (code=exited, status=254)

Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: Start to configure instance for kubernetes
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: Configuring IP firewall rules
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: Add rules to accept all inbound TCP/UDP/ICMP packets
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: Add rules to accept all forwarded TCP/UDP/ICMP packets
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: Creating required directories
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: Making /var/lib/kubelet executable for kubelet
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: No local SSD disks found.
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: Creating kubelet kubeconfig file
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: Creating kube-proxy kubeconfig file

It's suspicious that it stopped at kube-proxy but the tar file exists

beeps@e2e-test-beeps-minion-group-59oq ~ $ ls /home/kubernetes/kube-docker-files
kube-apiserver.docker_tag  kube-controller-manager.docker_tag  kube-proxy.docker_tag  kube-proxy.tar  kube-scheduler.docker_tag

and so does its kubeconfig

$ beeps@e2e-test-beeps-minion-group-59oq ~ $ ls /var/lib/kube-proxy/
kubeconfig

and there are only 3 calls between that line, and the next in a success run, ie

Creating kubelet kubeconfig file
Creating kube-proxy kubeconfig file
Assemble docker command line flags

Hence the theory about per-warm-mounter, the mounter exists:

beeps@e2e-test-beeps-minion-group-59oq ~ $ stat /home/kubernetes/bin/mounter
  File: '/home/kubernetes/bin/mounter'
  Size: 1587        Blocks: 8          IO Block: 4096   regular file
Device: 801h/2049d  Inode: 5943        Links: 1
Access: (0544/-r-xr--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2016-11-03 22:48:34.418955732 +0000
Modify: 2016-11-03 22:48:33.532955732 +0000
Change: 2016-11-03 22:48:34.370955732 +0000
 Birth: -

So just executing the configure helper

beeps@e2e-test-beeps-minion-group-59oq ~ $ sudo /home/kubernetes/bin/configure-helper.sh
Start to configure instance for kubernetes
Configuring IP firewall rules
Add rules to accept all inbound TCP/UDP/ICMP packets
Add rules to accept all forwarded TCP/UDP/ICMP packets
Creating required directories
Making /var/lib/kubelet executable for kubelet
No local SSD disks found.
Creating kubelet kubeconfig file
Creating kube-proxy kubeconfig file
Assemble docker command line flags
Enable docker registry mirror at: https://us-mirror.gcr.io
Docker command line is updated. Restart docker to pick it up
Start loading kube-system docker images
Try to load docker image file /home/kubernetes/kube-docker-files/kube-proxy.tar
d8bd0657b25f: Loading layer [==================================================>] 130.9 MB/130.9 MB
a582cd499e0f: Loading layer [==================================================>] 1.024 kB/1.024 kB
455e869e08c9: Loading layer [==================================================>] 5.077 MB/5.077 MB
14a8cdab7b6e: Loading layer [==================================================>] 45.33 MB/45.33 MB
Start kubelet
Using kubelet binary at /home/kubernetes/bin/kubelet
Start kube-proxy pod
Start image-puller
Start fluentd pod
Done for the configuration for kubernetes

And kubelet starts up with the right defaults

beeps@e2e-test-beeps-minion-group-59oq ~ $ sudo systemctl status kubelet
● kubelet.service - Kubernetes kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; disabled; vendor preset: disabled)
   Active: active (running) since Fri 2016-11-04 00:07:46 UTC; 26s ago
 Main PID: 20707 (kubelet)
   Memory: 21.4M
      CPU: 898ms
   CGroup: /system.slice/kubelet.service
           ├─20707 /home/kubernetes/bin/kubelet --v=4 --max-pods=110 --serialize-image-pulls=false --outofdisk-transition-frequency=0 --allow-privileged=true --babysit-daemons=true --cgroup-root=/ --cloud-provider=gce --cluster-dns=10.0.0.10 --cluster-domain=c...
           └─20773 journalctl -k -f

Nov 04 00:07:52 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:07:52.269866   20707 kubelet.go:1648] skipping pod synchronization - [Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
Nov 04 00:07:57 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:07:57.270024   20707 kubelet.go:1648] skipping pod synchronization - [Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
Nov 04 00:07:57 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:07:57.529997   20707 conversion.go:134] failed to handle multiple devices for container. Skipping Filesystem stats
Nov 04 00:07:57 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:07:57.536935   20707 eviction_manager.go:217] eviction manager: no resources are starved
Nov 04 00:08:02 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:08:02.270188   20707 kubelet.go:1648] skipping pod synchronization - [Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
Nov 04 00:08:07 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:08:07.270366   20707 kubelet.go:1648] skipping pod synchronization - [Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
Nov 04 00:08:07 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:08:07.546490   20707 conversion.go:134] failed to handle multiple devices for container. Skipping Filesystem stats
Nov 04 00:08:07 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:08:07.546525   20707 conversion.go:134] failed to handle multiple devices for container. Skipping Filesystem stats
Nov 04 00:08:07 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:08:07.557709   20707 eviction_manager.go:217] eviction manager: no resources are starved
Nov 04 00:08:12 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:08:12.270927   20707 kubelet.go:1648] skipping pod synchronization - [Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]

bprashanth · 2016-11-04T00:46:49Z

anyway I have #36202 for a couple of lines of logging

saad-ali · 2016-11-04T01:04:27Z

Excellent debugging @bprashanth! The evidence you provided strongly indicates PR #35821 is the culprit. The merge queue was blocked most of the day due to this issue (currently 58 outstanding PRs in the queue). So unless we have some other quick fix, we'll need to revert that PR. @vishh @mtaufen @jingxu97 thoughts?

CC @gmarek Warsaw on-call

bprashanth · 2016-11-04T01:52:04Z

This failed pulling the mounter image, something you definitely want to retry, especially during node bringup

beeps@e2e-test-beeps-minion-group-78ao ~ $ sudo journalctl -u  kube-node-configuration --follow
-- Logs begin at Fri 2016-11-04 01:43:39 UTC. --
Nov 04 01:44:28 configure-helper.sh[1146]: Creating kubelet kubeconfig file
Nov 04 01:44:28 configure-helper.sh[1146]: Creating kube-proxy kubeconfig file
Nov 04 01:44:28 configure-helper.sh[1146]: overriding kubectl
Nov 04 01:44:28 configure-helper.sh[1146]: prewarming mounter
Nov 04 01:44:28 configure-helper.sh[1146]: Running mount using a rkt fly container
Nov 04 01:44:29 configure-helper.sh[1146]: Downloading ACI:  0 B/5.9 MB
Nov 04 01:44:29 configure-helper.sh[1146]: Downloading ACI:  16.3 KB/5.9 MB
Nov 04 01:44:30 configure-helper.sh[1146]: Downloading ACI:  5.19 MB/5.9 MB
Nov 04 01:44:30 configure-helper.sh[1146]: Downloading ACI:  5.9 MB/5.9 MB
Nov 04 01:44:49 configure-helper.sh[1146]: run: error getting repository data: Get https://gcr.io/v1/repositories/google_containers/gci-mounter/images: dial tcp 74.125.70.82:443: i/o timeout

saad-ali · 2016-11-04T03:08:54Z

Submit queue is red again and we have 55 unmerged PRs. I'm going to revert PR #35821.

thockin · 2016-11-04T04:15:57Z

Thanks.

On Fri, Nov 4, 2016 at 4:09 AM, Saad Ali notifications@github.com wrote:

Submit queue is red again and we have 55 unmerged PRs. I'm going to revert
PR #35821 #35821.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#34665 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVHURCQJzEPvyQa9Tyrth7c1xO0wxks5q6qHjgaJpZM4KVXP-
.

bprashanth · 2016-11-04T05:37:37Z

fwiw, unretried image pull during node bringup is russian roulette. one of the reasons we dockerload master components/kube-proxy from a .gz.

bprashanth · 2016-11-04T05:38:15Z

Now regarding the weird timeouts.
There're some pretty slow tests in the suite, in order I think they're (slowest is 30m):

• [SLOW TEST:1629.824 seconds]
[k8s.io] Pods
/usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:823
  should cap back-off at MaxContainerBackOff [Slow]
  /usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/common/pods.go:659
------------------------------
• [SLOW TEST:438.495 seconds]
[k8s.io] Services
/usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:823
  should be able to change the type and ports of a service [Slow]
  /usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/service.go:805
------------------------------
• [SLOW TEST:425.851 seconds]
[k8s.io] Pods
/usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:823
  should have their auto-restart back-off timer reset on image update [Slow]
  /usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/common/pods.go:604
------------------------------
• [SLOW TEST:343.304 seconds]
[k8s.io] ScheduledJob
/usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:823
  should not schedule new jobs when ForbidConcurrent [Slow]
  /usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/scheduledjob.go:125
------------------------------
• [SLOW TEST:307.199 seconds]
[k8s.io] ScheduledJob
/usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:823
  should not schedule jobs when suspended [Slow]
  /usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/scheduledjob.go:94
------------------------------
• [SLOW TEST:245.555 seconds]
[k8s.io] Dynamic provisioning
/usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:823
  [k8s.io] DynamicProvisioner Alpha
  /usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:823
    should create and delete alpha persistent volumes [Slow]

but they should all run in parallel, so I don't think that's a big deal. We should probably optimize anything that isn't specifically soaking and still takes > 15m though.

Anyway I've strictified some timeouts in the nodePort test based on a previous guess by Jeff (#34665 (comment)), lets see if they produce more isolated failures: #36208

jingxu97 · 2016-11-04T06:47:16Z

Found this thread in docker, so it seems other users also experience this random timeout issue.
moby/moby#13337

Is that possible to build docker image locally instead of using remote repository to make it more reliably?

k8s-github-robot · 2016-11-04T11:34:48Z

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/1415/

Run so broken it didn't make JUnit output!

bprashanth · 2016-11-04T16:34:09Z

@jingxu97 io timeout is probably just normal network connectivity flake during node startup. There are 2 issues conflated into this bug, one is the mounter problem for which we reverted the pr, the second is a suite timeout, which is probably one of our tests or the framework itself doing something weird. The first problem was hitting more often and across more suites than the second, so I'd expect overall flakyness to go down now.

Automatic merge from submit-queue Stricter timeouts for nodePort curling If the timeouts are indeed because of #34665 (comment), stricter timeouts will probably surface as a more isolated failure

bprashanth · 2016-11-04T19:32:58Z

I was able to grab a strace on a hanging curl, I think we're hitting a keepalive established connection that avoids nat to the endpoints when we toggle nodePort on/off. My timeout pr should at least kill the connection, we should try and get https://github.com/kubernetes/kubernetes/pull/36130/files in.

Debug info:
Working nodeport service:

/ # while true; do curl --keepalive-time 100 --connect-timeout 1 http://104.154.134.229:30945/hostName; done
netserver-1netserver-1netserver-1netserver-2netserver-1netserver-2netserver-0netserver-2netserver-2netserver-2netserver-2

Toggle nodeport off

netserver-1netserver-1netserver-0netserver-0netserver-2curl: (7) Failed to connect to 104.154.134.229 port 30945: Connection refused
curl: (7) Failed to connect to 104.154.134.229 port 30945: Connection refused
curl: (7) Failed to connect to 104.154.134.229 port 30945: Connection refused
curl: (7) Failed to connect to 104.154.134.229 port 30945: Connection refused
curl: (7) Failed to connect to 104.154.134.229 port 30945: Connection refuse

Toggle nodeport back on

curl: (7) Failed to connect to 104.154.134.229 port 30945: Connection refused
curl: (7) Failed to connect to 104.154.134.229 port 30945: Connection refused

... silence

On the node

beeps@e2e-test-beeps-minion-group-vjgg ~ $ ps aux | grep curl
beeps    13861  0.0  0.0   4444   744 pts/2    S+   18:51   0:00 grep --colour=auto curl
root     19315  0.0  0.0  16712  3108 pts/1    S+   17:47   0:00 curl --keepalive-time 100 --connect-timeout 1 http://104.154.134.229:30945/hostName

strace it

strace -p 19315
Process 19315 attached
restart_syscall(<... resuming interrupted call ...>) = 0
clock_gettime(CLOCK_MONOTONIC, {349490, 795942869}) = 0
clock_gettime(CLOCK_MONOTONIC, {349490, 796205339}) = 0
rt_sigaction(SIGPIPE, NULL, {SIG_IGN, [], SA_RESTORER|SA_RESTART, 0x7fbb83c481df}, 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER|SA_RESTART, 0x7fbb83c481df}, NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {349490, 796465281}) = 0
clock_gettime(CLOCK_MONOTONIC, {349490, 796526072}) = 0
clock_gettime(CLOCK_MONOTONIC, {349490, 796578494}) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER|SA_RESTART, 0x7fbb83c481df}, NULL, 8) = 0

What's behind fd 3

root@e2e-test-beeps-minion-group-vjgg:~# ls -lh /proc/19315/fd
total 0
lrwx------ 1 root root 64 Nov  4 18:27 0 -> /dev/pts/1
lrwx------ 1 root root 64 Nov  4 18:27 1 -> /dev/pts/1
lrwx------ 1 root root 64 Nov  4 18:23 2 -> /dev/pts/1
lrwx------ 1 root root 64 Nov  4 18:27 3 -> socket:[32557556]`

what's behind inode 32557556

beeps@e2e-test-beeps-minion-group-vjgg ~ $ sudo netstat -apeen | grep 32557556
tcp        0      0 10.240.0.5:55702        104.154.134.229:30945   ESTABLISHED 0          32557556   19315/curl

What does netfilter say about that connection

root@e2e-test-beeps-minion-group-vjgg:~# cat /proc/net/ip_conntrack | grep 55702
tcp      6 86383 ESTABLISHED src=10.240.0.5 dst=104.154.134.229 sport=55702 dport=30945 src=104.154.134.229 dst=10.240.0.5 sport=30945 dport=55702 [ASSURED] mark=0 use=2

what about iptables for the nodeport

beeps@e2e-test-beeps-minion-group-vjgg ~ $ sudo iptables-save | grep -i 30945
-A KUBE-NODEPORTS -p tcp -m comment --comment "e2e-tests-nettest-72krw/node-port-service:http" -m tcp --dport 30945 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "e2e-tests-nettest-72krw/node-port-service:http" -m tcp --dport 30945 -j KUBE-SVC-OJUE3M7AGRFF5BFX
beeps@e2e-test-beeps-minion-group-vjgg ~ $ sudo iptables-save | grep -i KUBE-SVC-OJUE3M7AGRFF5BFX
:KUBE-SVC-OJUE3M7AGRFF5BFX - [0:0]
-A KUBE-NODEPORTS -p tcp -m comment --comment "e2e-tests-nettest-72krw/node-port-service:http" -m tcp --dport 30945 -j KUBE-SVC-OJUE3M7AGRFF5BFX
-A KUBE-SERVICES -d 10.0.241.158/32 -p tcp -m comment --comment "e2e-tests-nettest-72krw/node-port-service:http cluster IP" -m tcp --dport 80 -j KUBE-SVC-OJUE3M7AGRFF5BFX
-A KUBE-SVC-OJUE3M7AGRFF5BFX -m comment --comment "e2e-tests-nettest-72krw/node-port-service:http" -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-34FVYRBNIRMMAAXC
-A KUBE-SVC-OJUE3M7AGRFF5BFX -m comment --comment "e2e-tests-nettest-72krw/node-port-service:http" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-NCEBCBJDB4QT4UDU
-A KUBE-SVC-OJUE3M7AGRFF5BFX -m comment --comment "e2e-tests-nettest-72krw/node-port-service:http" -j KUBE-SEP-4AUEISXG755Z4CO4

does a new curl work

beeps@e2e-test-beeps-minion-group-vjgg ~ $ docker exec -it 297348353ae2 /bin/sh
/ # curl 104.154.134.229:30945
NOW: 2016-11-04 18:57:51.066517909 +0000 UTC/ 

root@e2e-test-beeps-minion-group-vjgg:~# tcpdump -i any -n dst port 30945
19:10:46.188122 IP 10.240.0.5.44886 > 104.154.134.229.30945: Flags [S], seq 462411288, win 28400, options [mss 1420,sackOK,TS val 351947167 ecr 0,nop,wscale 7], length 0
19:10:46.190133 IP 10.240.0.5.44886 > 104.154.134.229.30945: Flags [.], ack 2227952985, win 222, options [nop,nop,TS val 351947169 ecr 351947168], length 0
19:10:46.190174 IP 10.240.0.5.44886 > 104.154.134.229.30945: Flags [P.], seq 0:93, ack 1, win 222, options [nop,nop,TS val 351947169 ecr 351947168], length 93
19:10:46.191220 IP 10.240.0.5.44886 > 104.154.134.229.30945: Flags [.], ack 129, win 231, options [nop,nop,TS val 351947170 ecr 351947170], length 0
19:10:46.191427 IP 10.240.0.5.44886 > 104.154.134.229.30945: Flags [F.], seq 93, ack 129, win 231, options [nop,nop,TS val 351947171 ecr 351947170], length 0
19:10:46.191815 IP 10.240.0.5.44886 > 104.154.134.229.30945: Flags [.], ack 130, win 231, options [nop,nop,TS val 351947171 ecr 351947171], length 0

is the old curl sending anything

root@e2e-test-beeps-minion-group-vjgg:~# tcpdump -i any -n src port 55702 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes

... silence..
19:14:31.564385 IP 10.240.0.5.55702 > 104.154.134.229.30945: Flags [.], ack 3785698836, win 222, options [nop,nop,TS val 352172544 ecr 352071989], length 0
... silence..

So that's a keepalive probe.

bprashanth · 2016-11-04T21:04:50Z

I don't think this has failed since the revert and my last pr, closing (https://k8s-testgrid.appspot.com/google-gke#gci-gke-slow). autofiler will reopen anyway.

k8s-github-robot assigned fejta Oct 13, 2016

k8s-github-robot added kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence. area/test-infra labels Oct 13, 2016

fejta removed the area/test-infra label Oct 13, 2016

k8s-github-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Oct 13, 2016

fejta assigned apelisse and unassigned fejta Oct 13, 2016

k8s-github-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Oct 13, 2016

MrHohn mentioned this issue Nov 3, 2016

Kubedns logging #36013

Merged

bprashanth mentioned this issue Nov 4, 2016

Cluster setup in e2e tests not reliable enough nor transparent enough #31273

Closed

saad-ali mentioned this issue Nov 4, 2016

Revert "Merge pull request #35821 from vishh/gci-mounter-scope" #36206

Merged

bprashanth mentioned this issue Nov 4, 2016

Stricter timeouts for nodePort curling #36208

Merged

bprashanth closed this as completed Nov 4, 2016

This was referenced Nov 4, 2016

kubernetes-e2e-gci-gke-slow: broken test run #36266

Closed

kubernetes-e2e-gci-gke-slow: broken test run #36326

Closed

kubernetes-e2e-gci-gke-slow: broken test run #34665

kubernetes-e2e-gci-gke-slow: broken test run #34665

Comments

k8s-github-robot commented Oct 13, 2016

k8s-github-robot commented Oct 13, 2016

k8s-github-robot commented Oct 13, 2016

fejta commented Oct 13, 2016

k8s-github-robot commented Oct 13, 2016

k8s-github-robot commented Oct 13, 2016

k8s-github-robot commented Oct 14, 2016

k8s-github-robot commented Oct 14, 2016

k8s-github-robot commented Oct 15, 2016

k8s-github-robot commented Oct 15, 2016

k8s-github-robot commented Oct 15, 2016

k8s-github-robot commented Oct 15, 2016

k8s-github-robot commented Oct 15, 2016

k8s-github-robot commented Oct 16, 2016

k8s-github-robot commented Oct 16, 2016

k8s-github-robot commented Oct 17, 2016

k8s-github-robot commented Oct 17, 2016

k8s-github-robot commented Oct 17, 2016

k8s-github-robot commented Oct 17, 2016

k8s-github-robot commented Oct 18, 2016

j3ffml commented Nov 3, 2016 • edited

bprashanth commented Nov 3, 2016 • edited

saad-ali commented Nov 3, 2016

saad-ali commented Nov 3, 2016

bprashanth commented Nov 3, 2016

k8s-github-robot commented Nov 3, 2016

saad-ali commented Nov 3, 2016

saad-ali commented Nov 3, 2016

saad-ali commented Nov 3, 2016

bprashanth commented Nov 4, 2016 • edited

bprashanth commented Nov 4, 2016

saad-ali commented Nov 4, 2016

bprashanth commented Nov 4, 2016 • edited

saad-ali commented Nov 4, 2016

thockin commented Nov 4, 2016

bprashanth commented Nov 4, 2016

bprashanth commented Nov 4, 2016

jingxu97 commented Nov 4, 2016

k8s-github-robot commented Nov 4, 2016

bprashanth commented Nov 4, 2016

bprashanth commented Nov 4, 2016

bprashanth commented Nov 4, 2016

j3ffml commented Nov 3, 2016 •

edited

bprashanth commented Nov 3, 2016 •

edited

bprashanth commented Nov 4, 2016 •

edited

bprashanth commented Nov 4, 2016 •

edited