Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes-e2e-gci-gke-slow: broken test run #34665

Closed
k8s-github-robot opened this issue Oct 13, 2016 · 110 comments
Closed

kubernetes-e2e-gci-gke-slow: broken test run #34665

k8s-github-robot opened this issue Oct 13, 2016 · 110 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@k8s-github-robot
Copy link

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/710/

Run so broken it didn't make JUnit output!

@k8s-github-robot k8s-github-robot added kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence. area/test-infra labels Oct 13, 2016
@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/711/

Run so broken it didn't make JUnit output!

@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/714/

Run so broken it didn't make JUnit output!

@k8s-github-robot k8s-github-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Oct 13, 2016
@fejta fejta assigned apelisse and unassigned fejta Oct 13, 2016
@fejta
Copy link
Contributor

fejta commented Oct 13, 2016

Not test-infra

@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/722/

Run so broken it didn't make JUnit output!

@k8s-github-robot k8s-github-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Oct 13, 2016
@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/740/

Run so broken it didn't make JUnit output!

@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/776/

Run so broken it didn't make JUnit output!

@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/781/

Run so broken it didn't make JUnit output!

@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/799/

Run so broken it didn't make JUnit output!

@k8s-github-robot
Copy link
Author

[FLAKE-PING] @apelisse

This flaky-test issue would love to have more attention.

@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/801/

Run so broken it didn't make JUnit output!

@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/813/

Run so broken it didn't make JUnit output!

@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/816/

Run so broken it didn't make JUnit output!

@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/837/

Run so broken it didn't make JUnit output!

@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/843/

Run so broken it didn't make JUnit output!

@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/853/

Run so broken it didn't make JUnit output!

@k8s-github-robot
Copy link
Author

[FLAKE-PING] @apelisse

This flaky-test issue would love to have more attention.

@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/856/

Run so broken it didn't make JUnit output!

@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/866/

Run so broken it didn't make JUnit output!

@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/872/

Run so broken it didn't make JUnit output!

@j3ffml
Copy link
Contributor

j3ffml commented Nov 3, 2016

All cluster resources were brought up, but the cluster API is reporting that only 2 nodes out of 3 have registered.

There's a good chance that's the same issue with missing image layers that was causing failures on 1.4.5 kubelet. cc @dchen1107

@bprashanth
Copy link
Contributor

bprashanth commented Nov 3, 2016

Hmm the node unhealthy issue is basically not debuggable without a live repro, no logs in artifacts: http://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/1391 ?

(I took the links Saad posted and replaced the prefix with http://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow, unless we have a new situation for log aggregation)

@saad-ali
Copy link
Member

saad-ali commented Nov 3, 2016

Hmm the node unhealthy issue is basically not debuggable without a live repro, no logs in artifacts: http://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/1391 ?

I'm seeing found 3 nodes out of expected 4. Your cluster may not be fully functional in the kubernetes-e2e-gci-gce test suite failing now, any chance that is related?

There's a good chance that's the same issue with missing image layers that was causing failures on 1.4.5 kubelet. cc @dchen1107

@dchen1107 @yujuhong Can you help confirm?

@saad-ali
Copy link
Member

saad-ali commented Nov 3, 2016

Also potentially related, kubernetes-kubemark-500-gce tests are running into Detected 6 ready nodes, found 6 nodes out of expected 7. Your cluster may not be fully functional.:

I'm going to see if I can hunt down a PR that is responsible and revert it.

@bprashanth
Copy link
Contributor

Fwiw from those logs (#34665 (comment)), the node is definitely created by the MIG, but we see no kubernetes node object. Probably means the vm didn't come up, kept rebooting, was disconnected from master, or kubelet had some startup issue. First 3 sound like general gce bugs. Importantly, we don't see a NotReady node.

@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/1396/

Run so broken it didn't make JUnit output!

@saad-ali
Copy link
Member

saad-ali commented Nov 3, 2016

continuous-e2e-docker-validation-gci also hit it (Detected 3 ready nodes, found 3 nodes out of expected 4. Your cluster may not be fully functional.):

Fwiw from those logs (#34665 (comment)), the node is definitely created by the MIG, but we see no kubernetes node object. Probably means the vm didn't come up, kept rebooting, was disconnected from master, or kubelet had some startup issue. First 3 sound like general gce bugs. Importantly, we don't see a NotReady node.

@kubernetes/test-infra-admins Is there a good way to identify if this is a GCE issue or k8s? Whenever this issue is hit, the logs are not copied from the node machines.

@saad-ali
Copy link
Member

saad-ali commented Nov 3, 2016

1.4 tests do not seem to be hitting this (Your cluster may not be fully functional), only master tests are. Which would indicate that the problem is with k8s not GCE.

@saad-ali
Copy link
Member

saad-ali commented Nov 3, 2016

CC @thockin @matchstick

The submit queue has been red most of the day due to this issue. I'm trying to cut the 1.5 branch and I can't find a green build after Nov 1. If we are not able to resolve this by tomorrow morning we will convene a "war room" to try and tackle it.

@kubernetes/test-infra-admins Is there something we can do to get logs off the node and master machines when this issue is hit? When this issue is hit artifacts is empty (e.g. https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gce/2313/)?

@bprashanth
Copy link
Contributor

bprashanth commented Nov 4, 2016

This seems to be a startup problem on gci @kubernetes/goog-node @vishh

I think the crash is in one of these functions: https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/configure-helper.sh#L1211-L1216
And "pre-warm-mounter" was very recently: #35821, why do we not just retry on failure?

Here's a bunch of debug info to support the theory:
kubelet not running

beeps@e2e-test-beeps-minion-group-59oq ~ $ sudo systemctl status kubelet
● kubelet.service - Kubernetes kubelet
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

beeps@e2e-test-beeps-minion-group-59oq ~ $ sudo journalctl -u kubelet
-- No entries --

unit file looks sane

beeps@e2e-test-beeps-minion-group-59oq ~ $ cat /usr/lib/systemd/system/kubelet.service
[Unit]
Description=Kubernetes kubelet
Requires=network-online.target
After=network-online.target

[Service]
Restart=always
RestartSec=10
EnvironmentFile=/etc/default/kubelet
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStart=/usr/bin/kubelet $KUBELET_OPTS

[Install]
WantedBy=multi-user.target

defaults are messed up

beeps@e2e-test-beeps-minion-group-59oq ~ $ cat /etc/default/kubelet
KUBELET_OPTS="--manifest-url=http://metadata.google.internal/computeMetadata/v1/instance/attributes/google-container-manifest --manifest-url-header=Metadata-Flavor:Google --config=/etc/kubernetes/manifests"

beeps@e2e-test-beeps-minion-group-59oq ~ $ cat /etc/default/docker
cat: /etc/default/docker: No such file or directory

beeps@e2e-test-beeps-minion-group-59oq ~ $ ls /etc/kubernetes/manifests/
beeps@e2e-test-beeps-minion-group-59oq ~ $ 

health monitors are working, at least

beeps@e2e-test-beeps-minion-group-59oq ~ $ sudo journalctl -f -u kube-docker-monitor
-- Logs begin at Thu 2016-11-03 22:48:02 UTC. --
Nov 03 22:53:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1352]: Start kubernetes health monitoring for docker


beeps@e2e-test-beeps-minion-group-59oq ~ $ sudo journalctl -f -u kubelet-monitor
-- Logs begin at Thu 2016-11-03 22:48:02 UTC. --
Nov 03 23:26:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: Dload  Upload   Total   Spent    Left  Speed
Nov 03 23:26:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: 0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 127.0.0.1 port 10250: Connection refused
Nov 03 23:27:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: Kubelet is unhealthy!
Nov 03 23:27:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Nov 03 23:27:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: Dload  Upload   Total   Spent    Left  Speed
Nov 03 23:27:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: 0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 127.0.0.1 port 10250: Connection refused
Nov 03 23:28:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: Kubelet is unhealthy!
Nov 03 23:28:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Nov 03 23:28:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: Dload  Upload   Total   Spent    Left  Speed
Nov 03 23:28:45 e2e-test-beeps-minion-group-59oq health-monitor.sh[1363]: 0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 127.0.0.1 port 10250: Connection refused

node installation passed


beeps@e2e-test-beeps-minion-group-59oq ~ $ sudo systemctl status kube-node-installation
● kube-node-installation.service - Download and install k8s binaries and configurations
   Loaded: loaded (/etc/systemd/system/kube-node-installation.service; enabled; vendor preset: disabled)
   Active: active (exited) since Thu 2016-11-03 22:48:34 UTC; 1h 1min ago
  Process: 1091 ExecStart=/home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
  Process: 1087 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
  Process: 1082 ExecStartPre=/usr/bin/curl --fail --retry 5 --retry-delay 3 --silent --show-error -H X-Google-Metadata-Request: True -o /home/kubernetes/bin/configure.sh http://metadata.google.internal/computeMetadata/v1/instance/attributes/configure-sh (code=exited, status=0/SUCCESS)
  Process: 1076 ExecStartPre=/bin/mount -o remount,exec /home/kubernetes/bin (code=exited, status=0/SUCCESS)
  Process: 1072 ExecStartPre=/bin/mount --bind /home/kubernetes/bin /home/kubernetes/bin (code=exited, status=0/SUCCESS)
  Process: 1067 ExecStartPre=/bin/mkdir -p /home/kubernetes/bin (code=exited, status=0/SUCCESS)
 Main PID: 1091 (code=exited, status=0/SUCCESS)
   Memory: 0B
      CPU: 0
   CGroup: /system.slice/kube-node-installation.service

Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: Downloading k8s manifests tar
Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: Dload  Upload   Total   Spent    Left  Speed
Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: [155B blob data]
Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: == Downloaded https://storage.googleapis.com/kubernetes-staging-ae13154a31/e2e-test-beeps-devel/kubernetes-manifests.tar.gz (SHA1 = 8b94ee874d75d4264c5a9e382ffc210e4fe8d73d) ==
Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: Dload  Upload   Total   Spent    Left  Speed
Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: [234B blob data]
Nov 03 22:48:33 e2e-test-beeps-minion-group-59oq configure.sh[1091]: == Downloaded https://storage.googleapis.com/kubernetes-release/rkt/rkt-v1.18.0 (SHA1 = 75fc8f29c79bc9e505f3e7f6e8fadf2425c21967) ==
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure.sh[1091]: Done for installing kubernetes files

node configuration failed

beeps@e2e-test-beeps-minion-group-59oq ~ $ sudo systemctl status kube-node-configuration
● kube-node-configuration.service - Configure kubernetes node
   Loaded: loaded (/etc/systemd/system/kube-node-configuration.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2016-11-03 22:53:45 UTC; 54min ago
  Process: 1165 ExecStart=/home/kubernetes/bin/configure-helper.sh (code=exited, status=254)
  Process: 1161 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/mounter (code=exited, status=0/SUCCESS)
  Process: 1157 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure-helper.sh (code=exited, status=0/SUCCESS)
 Main PID: 1165 (code=exited, status=254)

Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: Start to configure instance for kubernetes
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: Configuring IP firewall rules
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: Add rules to accept all inbound TCP/UDP/ICMP packets
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: Add rules to accept all forwarded TCP/UDP/ICMP packets
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: Creating required directories
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: Making /var/lib/kubelet executable for kubelet
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: No local SSD disks found.
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: Creating kubelet kubeconfig file
Nov 03 22:48:34 e2e-test-beeps-minion-group-59oq configure-helper.sh[1165]: Creating kube-proxy kubeconfig file

It's suspicious that it stopped at kube-proxy but the tar file exists

beeps@e2e-test-beeps-minion-group-59oq ~ $ ls /home/kubernetes/kube-docker-files
kube-apiserver.docker_tag  kube-controller-manager.docker_tag  kube-proxy.docker_tag  kube-proxy.tar  kube-scheduler.docker_tag

and so does its kubeconfig

$ beeps@e2e-test-beeps-minion-group-59oq ~ $ ls /var/lib/kube-proxy/
kubeconfig

and there are only 3 calls between that line, and the next in a success run, ie

Creating kubelet kubeconfig file
Creating kube-proxy kubeconfig file
Assemble docker command line flags

Hence the theory about per-warm-mounter, the mounter exists:

beeps@e2e-test-beeps-minion-group-59oq ~ $ stat /home/kubernetes/bin/mounter
  File: '/home/kubernetes/bin/mounter'
  Size: 1587        Blocks: 8          IO Block: 4096   regular file
Device: 801h/2049d  Inode: 5943        Links: 1
Access: (0544/-r-xr--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2016-11-03 22:48:34.418955732 +0000
Modify: 2016-11-03 22:48:33.532955732 +0000
Change: 2016-11-03 22:48:34.370955732 +0000
 Birth: -

So just executing the configure helper

beeps@e2e-test-beeps-minion-group-59oq ~ $ sudo /home/kubernetes/bin/configure-helper.sh
Start to configure instance for kubernetes
Configuring IP firewall rules
Add rules to accept all inbound TCP/UDP/ICMP packets
Add rules to accept all forwarded TCP/UDP/ICMP packets
Creating required directories
Making /var/lib/kubelet executable for kubelet
No local SSD disks found.
Creating kubelet kubeconfig file
Creating kube-proxy kubeconfig file
Assemble docker command line flags
Enable docker registry mirror at: https://us-mirror.gcr.io
Docker command line is updated. Restart docker to pick it up
Start loading kube-system docker images
Try to load docker image file /home/kubernetes/kube-docker-files/kube-proxy.tar
d8bd0657b25f: Loading layer [==================================================>] 130.9 MB/130.9 MB
a582cd499e0f: Loading layer [==================================================>] 1.024 kB/1.024 kB
455e869e08c9: Loading layer [==================================================>] 5.077 MB/5.077 MB
14a8cdab7b6e: Loading layer [==================================================>] 45.33 MB/45.33 MB
Start kubelet
Using kubelet binary at /home/kubernetes/bin/kubelet
Start kube-proxy pod
Start image-puller
Start fluentd pod
Done for the configuration for kubernetes

And kubelet starts up with the right defaults

beeps@e2e-test-beeps-minion-group-59oq ~ $ sudo systemctl status kubelet
● kubelet.service - Kubernetes kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; disabled; vendor preset: disabled)
   Active: active (running) since Fri 2016-11-04 00:07:46 UTC; 26s ago
 Main PID: 20707 (kubelet)
   Memory: 21.4M
      CPU: 898ms
   CGroup: /system.slice/kubelet.service
           ├─20707 /home/kubernetes/bin/kubelet --v=4 --max-pods=110 --serialize-image-pulls=false --outofdisk-transition-frequency=0 --allow-privileged=true --babysit-daemons=true --cgroup-root=/ --cloud-provider=gce --cluster-dns=10.0.0.10 --cluster-domain=c...
           └─20773 journalctl -k -f

Nov 04 00:07:52 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:07:52.269866   20707 kubelet.go:1648] skipping pod synchronization - [Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
Nov 04 00:07:57 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:07:57.270024   20707 kubelet.go:1648] skipping pod synchronization - [Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
Nov 04 00:07:57 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:07:57.529997   20707 conversion.go:134] failed to handle multiple devices for container. Skipping Filesystem stats
Nov 04 00:07:57 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:07:57.536935   20707 eviction_manager.go:217] eviction manager: no resources are starved
Nov 04 00:08:02 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:08:02.270188   20707 kubelet.go:1648] skipping pod synchronization - [Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
Nov 04 00:08:07 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:08:07.270366   20707 kubelet.go:1648] skipping pod synchronization - [Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
Nov 04 00:08:07 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:08:07.546490   20707 conversion.go:134] failed to handle multiple devices for container. Skipping Filesystem stats
Nov 04 00:08:07 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:08:07.546525   20707 conversion.go:134] failed to handle multiple devices for container. Skipping Filesystem stats
Nov 04 00:08:07 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:08:07.557709   20707 eviction_manager.go:217] eviction manager: no resources are starved
Nov 04 00:08:12 e2e-test-beeps-minion-group-59oq kubelet[20707]: I1104 00:08:12.270927   20707 kubelet.go:1648] skipping pod synchronization - [Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]

@bprashanth
Copy link
Contributor

anyway I have #36202 for a couple of lines of logging

@saad-ali
Copy link
Member

saad-ali commented Nov 4, 2016

Excellent debugging @bprashanth! The evidence you provided strongly indicates PR #35821 is the culprit. The merge queue was blocked most of the day due to this issue (currently 58 outstanding PRs in the queue). So unless we have some other quick fix, we'll need to revert that PR. @vishh @mtaufen @jingxu97 thoughts?

CC @gmarek Warsaw on-call

@bprashanth
Copy link
Contributor

bprashanth commented Nov 4, 2016

This failed pulling the mounter image, something you definitely want to retry, especially during node bringup

beeps@e2e-test-beeps-minion-group-78ao ~ $ sudo journalctl -u  kube-node-configuration --follow
-- Logs begin at Fri 2016-11-04 01:43:39 UTC. --
Nov 04 01:44:28 configure-helper.sh[1146]: Creating kubelet kubeconfig file
Nov 04 01:44:28 configure-helper.sh[1146]: Creating kube-proxy kubeconfig file
Nov 04 01:44:28 configure-helper.sh[1146]: overriding kubectl
Nov 04 01:44:28 configure-helper.sh[1146]: prewarming mounter
Nov 04 01:44:28 configure-helper.sh[1146]: Running mount using a rkt fly container
Nov 04 01:44:29 configure-helper.sh[1146]: Downloading ACI:  0 B/5.9 MB
Nov 04 01:44:29 configure-helper.sh[1146]: Downloading ACI:  16.3 KB/5.9 MB
Nov 04 01:44:30 configure-helper.sh[1146]: Downloading ACI:  5.19 MB/5.9 MB
Nov 04 01:44:30 configure-helper.sh[1146]: Downloading ACI:  5.9 MB/5.9 MB
Nov 04 01:44:49 configure-helper.sh[1146]: run: error getting repository data: Get https://gcr.io/v1/repositories/google_containers/gci-mounter/images: dial tcp 74.125.70.82:443: i/o timeout

@saad-ali
Copy link
Member

saad-ali commented Nov 4, 2016

Submit queue is red again and we have 55 unmerged PRs. I'm going to revert PR #35821.

@thockin
Copy link
Member

thockin commented Nov 4, 2016

Thanks.

On Fri, Nov 4, 2016 at 4:09 AM, Saad Ali notifications@github.com wrote:

Submit queue is red again and we have 55 unmerged PRs. I'm going to revert
PR #35821 #35821.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#34665 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVHURCQJzEPvyQa9Tyrth7c1xO0wxks5q6qHjgaJpZM4KVXP-
.

@bprashanth
Copy link
Contributor

fwiw, unretried image pull during node bringup is russian roulette. one of the reasons we dockerload master components/kube-proxy from a .gz.

@bprashanth
Copy link
Contributor

Now regarding the weird timeouts.
There're some pretty slow tests in the suite, in order I think they're (slowest is 30m):

• [SLOW TEST:1629.824 seconds]
[k8s.io] Pods
/usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:823
  should cap back-off at MaxContainerBackOff [Slow]
  /usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/common/pods.go:659
------------------------------
• [SLOW TEST:438.495 seconds]
[k8s.io] Services
/usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:823
  should be able to change the type and ports of a service [Slow]
  /usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/service.go:805
------------------------------
• [SLOW TEST:425.851 seconds]
[k8s.io] Pods
/usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:823
  should have their auto-restart back-off timer reset on image update [Slow]
  /usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/common/pods.go:604
------------------------------
• [SLOW TEST:343.304 seconds]
[k8s.io] ScheduledJob
/usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:823
  should not schedule new jobs when ForbidConcurrent [Slow]
  /usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/scheduledjob.go:125
------------------------------
• [SLOW TEST:307.199 seconds]
[k8s.io] ScheduledJob
/usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:823
  should not schedule jobs when suspended [Slow]
  /usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/scheduledjob.go:94
------------------------------
• [SLOW TEST:245.555 seconds]
[k8s.io] Dynamic provisioning
/usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:823
  [k8s.io] DynamicProvisioner Alpha
  /usr/local/google/home/beeps/goproj/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:823
    should create and delete alpha persistent volumes [Slow]

but they should all run in parallel, so I don't think that's a big deal. We should probably optimize anything that isn't specifically soaking and still takes > 15m though.

Anyway I've strictified some timeouts in the nodePort test based on a previous guess by Jeff (#34665 (comment)), lets see if they produce more isolated failures: #36208

@jingxu97
Copy link
Contributor

jingxu97 commented Nov 4, 2016

Found this thread in docker, so it seems other users also experience this random timeout issue.
moby/moby#13337

Is that possible to build docker image locally instead of using remote repository to make it more reliably?

@k8s-github-robot
Copy link
Author

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/1415/

Run so broken it didn't make JUnit output!

@bprashanth
Copy link
Contributor

@jingxu97 io timeout is probably just normal network connectivity flake during node startup. There are 2 issues conflated into this bug, one is the mounter problem for which we reverted the pr, the second is a suite timeout, which is probably one of our tests or the framework itself doing something weird. The first problem was hitting more often and across more suites than the second, so I'd expect overall flakyness to go down now.

k8s-github-robot pushed a commit that referenced this issue Nov 4, 2016
Automatic merge from submit-queue

Stricter timeouts for nodePort curling

If the timeouts are indeed because of  #34665 (comment), stricter timeouts will probably surface as a more isolated failure
@bprashanth
Copy link
Contributor

I was able to grab a strace on a hanging curl, I think we're hitting a keepalive established connection that avoids nat to the endpoints when we toggle nodePort on/off. My timeout pr should at least kill the connection, we should try and get https://github.com/kubernetes/kubernetes/pull/36130/files in.

Debug info:
Working nodeport service:

/ # while true; do curl --keepalive-time 100 --connect-timeout 1 http://104.154.134.229:30945/hostName; done
netserver-1netserver-1netserver-1netserver-2netserver-1netserver-2netserver-0netserver-2netserver-2netserver-2netserver-2

Toggle nodeport off

netserver-1netserver-1netserver-0netserver-0netserver-2curl: (7) Failed to connect to 104.154.134.229 port 30945: Connection refused
curl: (7) Failed to connect to 104.154.134.229 port 30945: Connection refused
curl: (7) Failed to connect to 104.154.134.229 port 30945: Connection refused
curl: (7) Failed to connect to 104.154.134.229 port 30945: Connection refused
curl: (7) Failed to connect to 104.154.134.229 port 30945: Connection refuse

Toggle nodeport back on

curl: (7) Failed to connect to 104.154.134.229 port 30945: Connection refused
curl: (7) Failed to connect to 104.154.134.229 port 30945: Connection refused

... silence

On the node

beeps@e2e-test-beeps-minion-group-vjgg ~ $ ps aux | grep curl
beeps    13861  0.0  0.0   4444   744 pts/2    S+   18:51   0:00 grep --colour=auto curl
root     19315  0.0  0.0  16712  3108 pts/1    S+   17:47   0:00 curl --keepalive-time 100 --connect-timeout 1 http://104.154.134.229:30945/hostName

strace it

strace -p 19315
Process 19315 attached
restart_syscall(<... resuming interrupted call ...>) = 0
clock_gettime(CLOCK_MONOTONIC, {349490, 795942869}) = 0
clock_gettime(CLOCK_MONOTONIC, {349490, 796205339}) = 0
rt_sigaction(SIGPIPE, NULL, {SIG_IGN, [], SA_RESTORER|SA_RESTART, 0x7fbb83c481df}, 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER|SA_RESTART, 0x7fbb83c481df}, NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {349490, 796465281}) = 0
clock_gettime(CLOCK_MONOTONIC, {349490, 796526072}) = 0
clock_gettime(CLOCK_MONOTONIC, {349490, 796578494}) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER|SA_RESTART, 0x7fbb83c481df}, NULL, 8) = 0

What's behind fd 3

root@e2e-test-beeps-minion-group-vjgg:~# ls -lh /proc/19315/fd
total 0
lrwx------ 1 root root 64 Nov  4 18:27 0 -> /dev/pts/1
lrwx------ 1 root root 64 Nov  4 18:27 1 -> /dev/pts/1
lrwx------ 1 root root 64 Nov  4 18:23 2 -> /dev/pts/1
lrwx------ 1 root root 64 Nov  4 18:27 3 -> socket:[32557556]`

what's behind inode 32557556

beeps@e2e-test-beeps-minion-group-vjgg ~ $ sudo netstat -apeen | grep 32557556
tcp        0      0 10.240.0.5:55702        104.154.134.229:30945   ESTABLISHED 0          32557556   19315/curl 

What does netfilter say about that connection

root@e2e-test-beeps-minion-group-vjgg:~# cat /proc/net/ip_conntrack | grep 55702
tcp      6 86383 ESTABLISHED src=10.240.0.5 dst=104.154.134.229 sport=55702 dport=30945 src=104.154.134.229 dst=10.240.0.5 sport=30945 dport=55702 [ASSURED] mark=0 use=2

what about iptables for the nodeport

beeps@e2e-test-beeps-minion-group-vjgg ~ $ sudo iptables-save | grep -i 30945
-A KUBE-NODEPORTS -p tcp -m comment --comment "e2e-tests-nettest-72krw/node-port-service:http" -m tcp --dport 30945 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "e2e-tests-nettest-72krw/node-port-service:http" -m tcp --dport 30945 -j KUBE-SVC-OJUE3M7AGRFF5BFX
beeps@e2e-test-beeps-minion-group-vjgg ~ $ sudo iptables-save | grep -i KUBE-SVC-OJUE3M7AGRFF5BFX
:KUBE-SVC-OJUE3M7AGRFF5BFX - [0:0]
-A KUBE-NODEPORTS -p tcp -m comment --comment "e2e-tests-nettest-72krw/node-port-service:http" -m tcp --dport 30945 -j KUBE-SVC-OJUE3M7AGRFF5BFX
-A KUBE-SERVICES -d 10.0.241.158/32 -p tcp -m comment --comment "e2e-tests-nettest-72krw/node-port-service:http cluster IP" -m tcp --dport 80 -j KUBE-SVC-OJUE3M7AGRFF5BFX
-A KUBE-SVC-OJUE3M7AGRFF5BFX -m comment --comment "e2e-tests-nettest-72krw/node-port-service:http" -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-34FVYRBNIRMMAAXC
-A KUBE-SVC-OJUE3M7AGRFF5BFX -m comment --comment "e2e-tests-nettest-72krw/node-port-service:http" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-NCEBCBJDB4QT4UDU
-A KUBE-SVC-OJUE3M7AGRFF5BFX -m comment --comment "e2e-tests-nettest-72krw/node-port-service:http" -j KUBE-SEP-4AUEISXG755Z4CO4

does a new curl work

beeps@e2e-test-beeps-minion-group-vjgg ~ $ docker exec -it 297348353ae2 /bin/sh
/ # curl 104.154.134.229:30945
NOW: 2016-11-04 18:57:51.066517909 +0000 UTC/ 

root@e2e-test-beeps-minion-group-vjgg:~# tcpdump -i any -n dst port 30945
19:10:46.188122 IP 10.240.0.5.44886 > 104.154.134.229.30945: Flags [S], seq 462411288, win 28400, options [mss 1420,sackOK,TS val 351947167 ecr 0,nop,wscale 7], length 0
19:10:46.190133 IP 10.240.0.5.44886 > 104.154.134.229.30945: Flags [.], ack 2227952985, win 222, options [nop,nop,TS val 351947169 ecr 351947168], length 0
19:10:46.190174 IP 10.240.0.5.44886 > 104.154.134.229.30945: Flags [P.], seq 0:93, ack 1, win 222, options [nop,nop,TS val 351947169 ecr 351947168], length 93
19:10:46.191220 IP 10.240.0.5.44886 > 104.154.134.229.30945: Flags [.], ack 129, win 231, options [nop,nop,TS val 351947170 ecr 351947170], length 0
19:10:46.191427 IP 10.240.0.5.44886 > 104.154.134.229.30945: Flags [F.], seq 93, ack 129, win 231, options [nop,nop,TS val 351947171 ecr 351947170], length 0
19:10:46.191815 IP 10.240.0.5.44886 > 104.154.134.229.30945: Flags [.], ack 130, win 231, options [nop,nop,TS val 351947171 ecr 351947171], length 0

is the old curl sending anything

root@e2e-test-beeps-minion-group-vjgg:~# tcpdump -i any -n src port 55702 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes

... silence..
19:14:31.564385 IP 10.240.0.5.55702 > 104.154.134.229.30945: Flags [.], ack 3785698836, win 222, options [nop,nop,TS val 352172544 ecr 352071989], length 0
... silence..

So that's a keepalive probe.

@bprashanth
Copy link
Contributor

I don't think this has failed since the revert and my last pr, closing (https://k8s-testgrid.appspot.com/google-gke#gci-gke-slow). autofiler will reopen anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests