e2e flake: pull-kubernetes-e2e-gce-etcd3 fails [sig-apps] Deployment and others with dial tcp (a node addr):10250: getsockopt: connection refused #50695

MikeSpreitzer · 2017-08-15T15:31:39Z

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
A version of PR #47262 failed one run of pull-kubernetes-e2e-gce-etcd3 and passed another. Earlier versions also got varied results. See the whole testing history at https://k8s-gubernator.appspot.com/pr/47262 .

For the 9a64d88 commit, the failed run included this in the build log:

I0811 01:26:37.551] [sig-apps] Deployment
I0811 01:26:37.552] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apps/framework.go:22
I0811 01:26:37.552]   test Deployment ReplicaSet orphaning and adoption regarding controllerRef
I0811 01:26:37.552]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apps/deployment.go:116
I0811 01:26:37.552] ------------------------------
I0811 01:26:39.182] [BeforeEach] [sig-instrumentation] Cluster level logging implemented by Stackdriver
I0811 01:26:39.183]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:139
I0811 01:26:39.183] STEP: Creating a kubernetes client
I0811 01:26:39.183] Aug 11 01:25:36.422: INFO: >>> kubeConfig: /workspace/.kube/config
I0811 01:26:39.184] STEP: Building a namespace api object
I0811 01:26:39.184] STEP: Waiting for a default service account to be provisioned in namespace
I0811 01:26:39.184] [BeforeEach] [sig-instrumentation] Cluster level logging implemented by Stackdriver
I0811 01:26:39.184]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/instrumentation/logging/stackdrvier/basic.go:43
I0811 01:26:39.184] [It] should ingest system logs from all nodes
I0811 01:26:39.185]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/instrumentation/logging/stackdrvier/basic.go:151
I0811 01:26:39.185] Aug 11 01:25:38.500: INFO: Using the following filter for log entries: resource.type="gce_instance" AND (resource.labels.instance_id=7683115974146440037 OR resource.labels.instance_id=2309583240800811877 OR resource.labels.instance_id=4062923293439605604 OR resource.labels.instance_id=6051722915058072421)
I0811 01:26:39.185] Aug 11 01:25:38.872: INFO: Waiting for log sink to become operational
I0811 01:26:39.185] Aug 11 01:25:41.835: INFO: Sink e2e-tests-sd-logging-fql29 is operational
I0811 01:26:39.185] STEP: Waiting for some system logs to ingest
I0811 01:26:39.185] Aug 11 01:26:10.358: INFO: Failed to parse Stackdriver LogEntry: Failed to deserialize jsonPayload as json object 
I0811 01:26:39.186] Aug 11 01:26:14.401: INFO: Failed to parse Stackdriver LogEntry: Failed to deserialize jsonPayload as json object 
I0811 01:26:39.186] [AfterEach] [sig-instrumentation] Cluster level logging implemented by Stackdriver
I0811 01:26:39.186]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:140
I0811 01:26:39.186] Aug 11 01:26:22.447: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready
I0811 01:26:39.186] STEP: Destroying namespace "e2e-tests-sd-logging-fql29" for this suite.
I0811 01:26:39.186] Aug 11 01:26:39.111: INFO: namespace: e2e-tests-sd-logging-fql29, resource: bindings, ignored listing per whitelist
I0811 01:26:39.186] Aug 11 01:26:39.181: INFO: namespace e2e-tests-sd-logging-fql29 deletion completed in 16.72184701s
... skipping 116 lines ...
I0811 01:26:49.630] Aug 11 01:24:57.002: INFO: Waiting for pod ss-0 to enter Running - Ready=false, currently Pending - Ready=false
I0811 01:26:49.631] Aug 11 01:25:07.029: INFO: Waiting for pod ss-0 to enter Running - Ready=false, currently Pending - Ready=false
I0811 01:26:49.631] Aug 11 01:25:17.021: INFO: Waiting for pod ss-0 to enter Running - Ready=false, currently Running - Ready=false
I0811 01:26:49.631] Aug 11 01:25:17.021: INFO: Resuming stateful pod at index 0
I0811 01:26:49.631] Aug 11 01:25:17.049: INFO: Running '/workspace/kubernetes/platforms/linux/amd64/kubectl --server=https://35.192.220.105 --kubeconfig=/workspace/.kube/config exec --namespace=e2e-tests-statefulset-pmwxv ss-0 -- /bin/sh -c touch /tmp/statefulset-continue'
I0811 01:26:49.631] Aug 11 01:25:48.319: INFO: rc: 127
I0811 01:26:49.631] Aug 11 01:25:48.319: INFO: Unexpected error occurred: error running &{/workspace/kubernetes/platforms/linux/amd64/kubectl [kubectl --server=https://35.192.220.105 --kubeconfig=/workspace/.kube/config exec --namespace=e2e-tests-statefulset-pmwxv ss-0 -- /bin/sh -c touch /tmp/statefulset-continue] []  <nil>  Error from server: error dialing backend: dial tcp 10.128.0.4:10250: getsockopt: connection refused
I0811 01:26:49.631]  [] <nil> 0xc420fd2990 exit status 1 <nil> <nil> true [0xc4213a62a0 0xc4213a62b8 0xc4213a62d0] [0xc4213a62a0 0xc4213a62b8 0xc4213a62d0] [0xc4213a62b0 0xc4213a62c8] [0x11bfd50 0x11bfd50] 0xc420d01c80 <nil>}:
I0811 01:26:49.632] Command stdout:
I0811 01:26:49.632] 
I0811 01:26:49.632] stderr:
I0811 01:26:49.632] Error from server: error dialing backend: dial tcp 10.128.0.4:10250: getsockopt: connection refused
I0811 01:26:49.632] 
I0811 01:26:49.632] error:
I0811 01:26:49.632] exit status 1

It is worth noting that the build log also showed 6 minutes earlier that all the nodes were up:

I0811 01:20:15.654] Found 5 node(s).
I0811 01:20:15.835] NAME                          STATUS                     AGE       VERSION
I0811 01:20:15.836] e2e-46505-master              Ready,SchedulingDisabled   59s       v1.8.0-alpha.2.1702+e5a191d32a17b4
I0811 01:20:15.836] e2e-46505-minion-group-1jzp   Ready                      34s       v1.8.0-alpha.2.1702+e5a191d32a17b4
I0811 01:20:15.836] e2e-46505-minion-group-2nfk   Ready                      43s       v1.8.0-alpha.2.1702+e5a191d32a17b4
I0811 01:20:15.836] e2e-46505-minion-group-g4dk   Ready                      39s       v1.8.0-alpha.2.1702+e5a191d32a17b4
I0811 01:20:15.836] e2e-46505-minion-group-qqjs   Ready                      34s       v1.8.0-alpha.2.1702+e5a191d32a17b4

And artifacts/nodes.yaml showed that e2e-46505-minion-group-qqjs had address 10.128.0.4.

What you expected to happen:
Consistent test results for a given commit.

How to reproduce it (as minimally and precisely as possible):
I have no good suggestion here.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): master
Cloud provider or hardware configuration**: GCE
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

MikeSpreitzer · 2017-08-15T15:36:10Z

/kind flake

MikeSpreitzer · 2017-08-15T15:36:48Z

/remove-kind bug

MikeSpreitzer · 2017-08-15T17:16:49Z

@kubernetes/sig-apps-test-failures

liggitt · 2017-08-27T01:34:57Z

seeing kubelets (and possibly the entire node) restarting during e2e runs which is disrupting any log/exec/scheduling tests using that node at the time:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&text=10250%3A%20getsockopt%3A%20connection%20refused

seen in https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/51168/pull-kubernetes-e2e-gce-bazel/12561/

the test fails with: an error on the server ("Error: 'dial tcp 10.128.0.6:10250: getsockopt: connection refused'\nTrying to reach: 'https://e2e-12561-minion-group-05pn:10250/logs/'") has prevented the request from succeeding

from the apiserver log:

I0827 00:31:39.749708       9 wrap.go:42] GET /api/v1/nodes/e2e-12561-minion-group-05pn:10250/proxy/logs/: (13.297005175s) 503 
goroutine 558050 [running]:
k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc42d8631f0, 0x1f7)
	vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:207 +0xdd
k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc42d8631f0, 0x1f7)
logging error output: "Error: 'dial tcp 10.128.0.6:10250: getsockopt: connection refused'\nTrying to reach: 'https://e2e-12561-minion-group-05pn:10250/logs/'"                                                                                                               [[e2e.test/v0.0.0 (linux/amd64) kubernetes/$Format] 104.197.137.193:33864]

from the kubelet log:

I0827 00:31:02.485982    3528 kubenet_linux.go:528] TearDownPod took 97.425804ms for e2e-tests-deployment-9tjxm/nginx-409829171-plrfn
I0827 00:31:02.495012    3528 kubenet_linux.go:777] Removing e2e-tests-deployment-9tjxm/nginx-409829171-67xpx from 'kubenet' with CNI 'bridge' plugin and runtime: &{ContainerID:67dd123c08490c4490435cf01ebdc410d6ad13002015497b0a32a9acaa3f2a1b NetNS: IfName:eth0 Args:[] CapabilityArgs:map[]}
I0827 00:31:02.510140    3528 status_manager.go:467] Pod "cleanup40-e0d858fd-8abe-11e7-a1fd-0a580a3c150d-vgtb9_e2e-tests-kubelet-vznfv(e11fe840-8abe-11e7-8ae6-42010a800002)" fully terminated and removed from etcd
I0827 00:31:02.520988    3528 kubenet_linux.go:528] TearDownPod took 34.733352ms for e2e-tests-deployment-9tjxm/nginx-409829171-67xpx
I0827 00:31:02.522098    3528 plugins.go:405] Calling network plugin kubenet to tear down pod "nginx-409829171-5zxw4_e2e-tests-deployment-9tjxm"

2017/08/27 00:32:33 proto: duplicate proto type registered: google.protobuf.Duration
2017/08/27 00:32:33 proto: duplicate proto type registered: google.protobuf.Timestamp
Flag --network-plugin-dir has been deprecated, Use --cni-bin-dir instead. This flag will be removed in a future version.
I0827 00:32:33.763347    3145 flags.go:52] FLAG: --address="0.0.0.0"
I0827 00:32:33.763365    3145 flags.go:52] FLAG: --allow-privileged="true"
I0827 00:32:33.763373    3145 flags.go:52] FLAG: --alsologtostderr="false"
I0827 00:32:33.763401    3145 flags.go:52] FLAG: --anonymous-auth="false"
I0827 00:32:33.763407    3145 flags.go:52] FLAG: --application-metrics-count-limit="100"
I0827 00:32:33.763413    3145 flags.go:52] FLAG: --authentication-token-webhook="false"

you can see the almost 90 second gap and the startup logging occur in the kubelet during that window

liggitt · 2017-08-27T01:39:21Z

interesting-looking things from the logs on that kubelet around that time:
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/51168/pull-kubernetes-e2e-gce-bazel/12561/artifacts/e2e-12561-minion-group-05pn/serial-1.log:

Aug 27 00:31:11 e2e-12561-minion-group-05pn kernel: [  707.884993] cbr0: port 12(veth43179ee6) entered forwarding state
[  710.163266] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
[  710.173160] IP: [<ffffffff810a1130>] check_preempt_wakeup+0xd0/0x1d0
[  710.179752] PGD 1b94af067 PUD 1b3be4067 PMD 0 
[  710.184667] Oops: 0000 [#1] SMP 
[  710.188253] Modules linked in: tcp_diag inet_diag nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver sg xt_statistic nf_conntrack_netlink nfnetlink sch_htb ebt_ip ebtable_filter ebtables veth xt_nat xt_recent ipt_REJECT xt_mark xt_comment xt_tcpudp ipt_MASQUERADE iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype ip_tables xt_conntrack x_tables nf_nat nf_conntrack bridge stp llc aufs(C) nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc crct10dif_pclmul crc32_pclmul crc32c_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper psmouse parport_pc i2c_piix4 i2c_core ablk_helper cryptd parport pvpanic evdev pcspkr serio_raw processor button thermal_sys virtio_net ext4 crc16 mbcache jbd2 sd_mod crc_t10dif crct10dif_common virtio_scsi scsi_mod virtio_pci virtio virtio_ring
[  710.269200] CPU: 1 PID: 17112 Comm: exe Tainted: G        WC    3.16.0-4-amd64 #1 Debian 3.16.43-2+deb8u1
[  710.278881] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[  710.288213] task: ffff88007bfc2110 ti: ffff8802143ac000 task.ti: ffff8802143ac000
[  710.295806] RIP: 0010:[<ffffffff810a1130>]  [<ffffffff810a1130>] check_preempt_wakeup+0xd0/0x1d0
[  710.304827] RSP: 0018:ffff8802143afe60  EFLAGS: 00010006
[  710.310245] RAX: 0000000000000001 RBX: ffff880145a83940 RCX: 0000000000000008
[  710.317490] RDX: 0000000000000001 RSI: ffff880214d86b20 RDI: ffff88021fd12fb8
[  710.324734] RBP: 0000000000000000 R08: ffffffff81610640 R09: 0000000000000001
[  710.331977] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007bfc2110
[  710.339218] R13: ffff88021fd12f40 R14: 0000000000000000 R15: 0000000000000000
[  710.349523] FS:  000000000153f880(0063) GS:ffff88021fd00000(0000) knlGS:0000000000000000
[  710.357731] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  710.363586] CR2: 0000000000000078 CR3: 00000001ae3c4000 CR4: 00000000001406e0
[  710.370830] Stack:
[  710.372950]  0000000000012f40 ffff88021fd12f40 0000000000012f40 ffff88021fd12f40
[  710.380942]  ffff880214d871a4 0000000000000246 ffff8801ffe0c9c0 ffffffff81095be5
[  710.388947]  ffff880214d86b20 ffffffff810986ca 00007fffffffeffd 0000000000000000
[  710.396958] Call Trace:
[  710.399520]  [<ffffffff81095be5>] ? check_preempt_curr+0x85/0xa0
[  710.405645]  [<ffffffff810986ca>] ? wake_up_new_task+0xda/0x190
[  710.411680]  [<ffffffff81067a49>] ? do_fork+0x139/0x3d0
[  710.417017]  [<ffffffff8151a7f9>] ? stub_clone+0x69/0x90
[  710.422440]  [<ffffffff8151a48d>] ? system_call_fast_compare_end+0x10/0x15
[  710.429423] Code: 39 c2 7d 27 0f 1f 80 00 00 00 00 83 e8 01 48 8b 5b 70 39 d0 75 f5 48 8b 7d 78 48 3b 7b 78 74 15 0f 1f 00 48 8b 6d 70 48 8b 5b 70 <48> 8b 7d 78 48 3b 7b 78 75 ee 48 85 ff 74 e9 e8 8c cb ff ff 48 
[  710.456373] RIP  [<ffffffff810a1130>] check_preempt_wakeup+0xd0/0x1d0
[  710.463046]  RSP <ffff8802143afe60>
[  710.466646] CR2: 0000000000000078
[  710.470593] ---[ end trace 06e67ea027b5f481 ]---
[  710.475322] Kernel panic - not syncing: Fatal exception
[  711.546412] Shutting down cpus with NMI
[  711.551190] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[  711.561475] Rebooting in 10 seconds..
[  721.540372] ACPI MEMORY or I/O RESET_REG.
SeaBIOS (version 1.8.2-20170517_162014-google)

docker shows a gap in logs:
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/51168/pull-kubernetes-e2e-gce-bazel/12561/artifacts/e2e-12561-minion-group-05pn/docker.log:

time="2017-08-27T00:31:02.486603349Z" level=debug msg="Calling GET /v1.23/containers/67dd123c08490c4490435cf01ebdc410d6ad13002015497b0a32a9acaa3f2a1b/json" 
time="2017-08-27T00:31:02.487323975Z" level=debug msg="Calling POST /v1.23/containers/5038490df8bb364a6e5e1706a98a0d147f987561c253e5175bab46b5264e0c7e/stop?t=10" 
time="2017-08-27T00:31:02.487397207Z" level=debug msg="Sending 15 to 5038490df8bb364a6e5e1706a98a0d147f987561c253e5175bab46b5264e0c7e" 
time="2017-08-27T00:31:02.522695012Z" level=debug msg="Calling POST /v1.23/containers/67dd123c08490c4490435cf01ebdc410d6ad13002015497b0a32a9acaa3f2atime="2017-08-27T00:31:37.834931235Z" level=debug msg="docker group found. gid: 107" 
time="2017-08-27T00:31:37.835057164Z" level=debug msg="Listener created for HTTP on unix (/var/run/docker.sock)" 
time="2017-08-27T00:31:37.898334222Z" level=info msg="New containerd process, pid: 2279\n" 
time="2017-08-27T00:31:38Z" level=debug msg="containerd: read past events" count=0 
time="2017-08-27T00:31:38Z" level=debug msg="containerd: supervisor running" cpus=2 memory=7499 runtime=docker-runc runtimeArgs=[] stateDir="/run/containerd" 
time="2017-08-27T00:31:37.961909157Z" level=debug msg="containerd connection state change: CONNECTING" 
time="2017-08-27T00:31:38.176040104Z" level=debug msg="containerd connection state change: READY" 
time="2017-08-27T00:31:38Z" level=debug msg="containerd: grpc api on /var/run/docker/libcontainerd/docker-containerd.sock" 
time="2017-08-27T00:31:38.217557339Z" level=debug msg="Using default logging driver json-file" 
time="2017-08-27T00:31:38.217713726Z" level=debug msg="Golang's threads limit set to 53910" 
time="2017-08-27T00:31:38.217759693Z" level=debug msg="[graphdriver] trying provided driver \"aufs\"" 
time="2017-08-27T00:31:38.230880914Z" level=debug msg="Using graph driver aufs"

liggitt · 2017-08-27T01:48:18Z

cc @kubernetes/sig-node-test-failures @kubernetes/sig-node-bugs for ideas on chasing down the kernel issue

liggitt · 2017-08-27T01:58:31Z

found the same panic in one of the kubelet logs in the original linked failure in this issue:

http://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/47262/pull-kubernetes-e2e-gce-etcd3/46505/artifacts/e2e-46505-minion-group-qqjs/

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/47262/pull-kubernetes-e2e-gce-etcd3/46505/artifacts/e2e-46505-minion-group-qqjs/serial-1.log:

Aug 11 01:25:14 e2e-46505-minion-group-qqjs kernel: [  487.854028] cbr0: port 7(vethb75257a9) entered forwarding state
[  488.027147] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
[  488.035387] IP: [<ffffffff810a1130>] check_preempt_wakeup+0xd0/0x1d0
[  488.042066] PGD 8cc8d067 PUD b6570067 PMD 0 
[  488.046828] Oops: 0000 [#1] SMP 
[  488.050416] Modules linked in: sg xt_statistic nf_conntrack_netlink nfnetlink sch_htb ebt_ip ebtable_filter ebtables veth xt_nat xt_recent ipt_REJECT xt_mark xt_comment xt_tcpudp ipt_MASQUERADE iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype ip_tables xt_conntrack x_tables nf_nat nf_conntrack bridge stp llc aufs(C) nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc crct10dif_pclmul crc32_pclmul crc32c_intel aesni_intel parport_pc parport psmouse evdev pcspkr serio_raw i2c_piix4 i2c_core aes_x86_64 processor thermal_sys pvpanic lrw gf128mul button glue_helper ablk_helper cryptd virtio_net ext4 crc16 mbcache jbd2 sd_mod crc_t10dif crct10dif_common virtio_scsi scsi_mod virtio_pci virtio virtio_ring
[  488.125886] CPU: 1 PID: 21597 Comm: exe Tainted: G        WC    3.16.0-4-amd64 #1 Debian 3.16.43-2+deb8u1
[  488.135697] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[  488.145035] task: ffff88020d5fd630 ti: ffff880074fc4000 task.ti: ffff880074fc4000
[  488.152629] RIP: 0010:[<ffffffff810a1130>]  [<ffffffff810a1130>] check_preempt_wakeup+0xd0/0x1d0
[  488.161699] RSP: 0018:ffff880074fc7e60  EFLAGS: 00010006
[  488.167123] RAX: 0000000000000001 RBX: ffff8800bae12040 RCX: 0000000000000008
[  488.174373] RDX: 0000000000000001 RSI: ffff8801bab0eb60 RDI: ffff88021fd12fb8
[  488.181710] RBP: 0000000000000000 R08: ffffffff81610640 R09: 0000000000000001
[  488.188967] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88020d5fd630
[  488.196222] R13: ffff88021fd12f40 R14: 0000000000000000 R15: 0000000000000000
[  488.203574] FS:  0000000001dd6880(0063) GS:ffff88021fd00000(0000) knlGS:0000000000000000
[  488.211776] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  488.217632] CR2: 0000000000000078 CR3: 00000000bad9b000 CR4: 00000000001406e0
[  488.224881] Stack:
[  488.227096]  0000000000012f40 ffff88021fd12f40 0000000000012f40 ffff88021fd12f40
[  488.235101]  ffff8801bab0f1e4 0000000000000246 ffff8801b55d9240 ffffffff81095be5
[  488.243130]  ffff8801bab0eb60 ffffffff810986ca 00007fffffffeffd 0000000000000000
[  488.251205] Call Trace:
[  488.253769]  [<ffffffff81095be5>] ? check_preempt_curr+0x85/0xa0
[  488.259895]  [<ffffffff810986ca>] ? wake_up_new_task+0xda/0x190
[  488.265928]  [<ffffffff81067a49>] ? do_fork+0x139/0x3d0
[  488.271272]  [<ffffffff8151a7f9>] ? stub_clone+0x69/0x90
[  488.276700]  [<ffffffff8151a48d>] ? system_call_fast_compare_end+0x10/0x15
[  488.283704] Code: 39 c2 7d 27 0f 1f 80 00 00 00 00 83 e8 01 48 8b 5b 70 39 d0 75 f5 48 8b 7d 78 48 3b 7b 78 74 15 0f 1f 00 48 8b 6d 70 48 8b 5b 70 <48> 8b 7d 78 48 3b 7b 78 75 ee 48 85 ff 74 e9 e8 8c cb ff ff 48 
[  488.310697] RIP  [<ffffffff810a1130>] check_preempt_wakeup+0xd0/0x1d0
[  488.317381]  RSP <ffff880074fc7e60>
[  488.320977] CR2: 0000000000000078
[  488.324407] ---[ end trace 7bcdfbcc991522e4 ]---
[  488.329132] Kernel panic - not syncing: Fatal exception
[  489.391575] Shutting down cpus with NMI
[  489.396236] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[  489.406529] Rebooting in 10 seconds..
[  499.389994] ACPI MEMORY or I/O RESET_REG.
SeaBIOS (version 1.8.2-20170524_173944-google)

liggitt · 2017-08-27T02:02:00Z

kernel panic with same stack mentioned in #45368

aledbf · 2017-08-27T02:06:48Z

@liggitt also here moby/moby#30402

sbezverk · 2017-08-28T15:43:50Z

It looks like it is known kernel bug and it seems there is a fix but I do not think it was pushed upstream. See this post and the link to the proposed diff. if somebody has access to the test bed where this issue happens, would be interesting to patch the kernel and rerun this test to reconfirm.
https://serverfault.com/questions/709926/bug-unable-to-handle-kernel-null-pointer-dereference-at-on-google-compute-eng

liggitt · 2017-09-05T00:23:16Z

this continues to affect us in myriad ways. https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/51915/pull-kubernetes-e2e-gce-etcd3/52532/ failed namespace cleanup because the node hosting an add-on server crashed with this bug

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/51915/pull-kubernetes-e2e-gce-etcd3/52532/artifacts/e2e-52532-minion-group-7rs9/kubelet.log shows that was the kubelet hosting the metrics apiserver pod

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/51915/pull-kubernetes-e2e-gce-etcd3/52532/artifacts/e2e-52532-minion-group-7rs9/serial-1.log shows the null pointer dereference error and reboot hit that node

yujuhong · 2017-09-07T01:44:27Z

What might help is to change the default testing OS image to COS since CVM is being deprecated (part of #51487). We can still keep the CVM test coverage but make it non-blocking until it's officially retired.
/cc @abgworrall @dchen1107

abgworrall · 2017-09-07T03:53:08Z

Yep, agreed. We should flip to COS. Even @mtaufen agreed. I might try and flip it this week, although I'm pretty swamped, so anyone else feel free to do it.

A more surgical fix you could do right now would be to amend pull-kubernetes-e2e-gce-etcd3.env to specify COS right now by setting KUBE_NODE_OS_DISTRIBUTION to gci instead of debian.

@yujuhong

Automatic merge from submit-queue (batch tested with PRs 52227, 52120) Use COS for nodes in testing clusters by default, and bump COS. Addresses part of issue #51487. May assist with #51961 and #50695. CVM is being deprecated, and falls out of support on 2017/10/01. We shouldn't run test jobs on it. So start using COS for all test jobs. The default value of `KUBE_NODE_OS_DISTRIBUTION` for clusters created for testing will now be gci. Testjobs that do not specify this value will now run on clusters using COS (aka GCI) as the node OS, instead of CVM, the previous default. This change only affects testing; non-testing clusters already use COS by default. In addition, bump the version of COS from `cos-stable-60-9592-84-0` to `cos-stable-60-9592-90-0`. ```release-note NONE ``` /cc @yujuhong, @mtaufen, @fejta, @krzyzacy

liggitt · 2017-09-29T18:15:48Z

was #52120 intended to switch all jobs to a known good image?
still seeing this kernel panic in pull-kubernetes-e2e-gce-bazel jobs:

http://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/53158/pull-kubernetes-e2e-gce-bazel/32563

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/53158/pull-kubernetes-e2e-gce-bazel/32563/artifacts/e2e-32563-minion-group-0rfx/serial-1.log

k8s-github-robot · 2017-11-30T08:27:44Z

This Issue hasn't been active in 61 days. It will be closed in 28 days (Dec 28, 2017).

cc @MikeSpreitzer

You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days

liggitt · 2018-01-15T02:43:22Z

The panicking kernel image has been retired
/close

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 15, 2017

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Aug 15, 2017

k8s-ci-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Aug 15, 2017

k8s-ci-robot removed the kind/bug Categorizes issue or PR as related to a bug. label Aug 15, 2017

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Aug 15, 2017

k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Aug 15, 2017

leblancd mentioned this issue Aug 16, 2017

Fix kube-proxy to use proper iptables commands for IPv6 operation #50478

Merged

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/bug Categorizes issue or PR as related to a bug. labels Aug 27, 2017

liggitt added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Aug 27, 2017

liggitt removed the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Aug 27, 2017

liggitt mentioned this issue Aug 27, 2017

Test failures caused by kernel NULL pointer dereference on debian-based CVM #45368

Closed

This was referenced Sep 5, 2017

Tolerate group discovery errors in e2e ns cleanup #51915

Merged

Improve APIService auto-registration for HA/upgrade scenarios #51921

Merged

Fix setNodeAddress when a node IP and a cloud provider are set #49202

Merged

This was referenced Sep 6, 2017

Fixed CCM service controller start jitter #52007

Merged

Automated cherry pick of #51761 #51954

Merged

abgworrall mentioned this issue Sep 7, 2017

Use COS for nodes in testing clusters by default, and bump COS. #52120

Merged

liggitt mentioned this issue Sep 29, 2017

Calculate patches for commands using input version #53158

Merged

liggitt closed this as completed Jan 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e flake: pull-kubernetes-e2e-gce-etcd3 fails [sig-apps] Deployment and others with dial tcp (a node addr):10250: getsockopt: connection refused #50695

e2e flake: pull-kubernetes-e2e-gce-etcd3 fails [sig-apps] Deployment and others with dial tcp (a node addr):10250: getsockopt: connection refused #50695

MikeSpreitzer commented Aug 15, 2017

MikeSpreitzer commented Aug 15, 2017

MikeSpreitzer commented Aug 15, 2017

MikeSpreitzer commented Aug 15, 2017

liggitt commented Aug 27, 2017 •

edited

liggitt commented Aug 27, 2017 •

edited

liggitt commented Aug 27, 2017

liggitt commented Aug 27, 2017 •

edited

liggitt commented Aug 27, 2017 •

edited

aledbf commented Aug 27, 2017

sbezverk commented Aug 28, 2017 •

edited

liggitt commented Sep 5, 2017 •

edited

yujuhong commented Sep 7, 2017

abgworrall commented Sep 7, 2017

liggitt commented Sep 29, 2017 •

edited

k8s-github-robot commented Nov 30, 2017

liggitt commented Jan 15, 2018

e2e flake: pull-kubernetes-e2e-gce-etcd3 fails [sig-apps] Deployment and others with dial tcp (a node addr):10250: getsockopt: connection refused #50695

e2e flake: pull-kubernetes-e2e-gce-etcd3 fails [sig-apps] Deployment and others with dial tcp (a node addr):10250: getsockopt: connection refused #50695

Comments

MikeSpreitzer commented Aug 15, 2017

MikeSpreitzer commented Aug 15, 2017

MikeSpreitzer commented Aug 15, 2017

MikeSpreitzer commented Aug 15, 2017

liggitt commented Aug 27, 2017 • edited

liggitt commented Aug 27, 2017 • edited

liggitt commented Aug 27, 2017

liggitt commented Aug 27, 2017 • edited

liggitt commented Aug 27, 2017 • edited

aledbf commented Aug 27, 2017

sbezverk commented Aug 28, 2017 • edited

liggitt commented Sep 5, 2017 • edited

yujuhong commented Sep 7, 2017

abgworrall commented Sep 7, 2017

liggitt commented Sep 29, 2017 • edited

k8s-github-robot commented Nov 30, 2017

liggitt commented Jan 15, 2018

liggitt commented Aug 27, 2017 •

edited

liggitt commented Aug 27, 2017 •

edited

liggitt commented Aug 27, 2017 •

edited

liggitt commented Aug 27, 2017 •

edited

sbezverk commented Aug 28, 2017 •

edited

liggitt commented Sep 5, 2017 •

edited

liggitt commented Sep 29, 2017 •

edited