Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e flake: pull-kubernetes-e2e-gce-etcd3 fails [sig-apps] Deployment and others with dial tcp (a node addr):10250: getsockopt: connection refused #50695

Closed
MikeSpreitzer opened this issue Aug 15, 2017 · 16 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@MikeSpreitzer
Copy link
Member

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
A version of PR #47262 failed one run of pull-kubernetes-e2e-gce-etcd3 and passed another. Earlier versions also got varied results. See the whole testing history at https://k8s-gubernator.appspot.com/pr/47262 .

For the 9a64d88 commit, the failed run included this in the build log:

I0811 01:26:37.551] [sig-apps] Deployment
I0811 01:26:37.552] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apps/framework.go:22
I0811 01:26:37.552]   test Deployment ReplicaSet orphaning and adoption regarding controllerRef
I0811 01:26:37.552]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apps/deployment.go:116
I0811 01:26:37.552] ------------------------------
I0811 01:26:39.182] [BeforeEach] [sig-instrumentation] Cluster level logging implemented by Stackdriver
I0811 01:26:39.183]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:139
I0811 01:26:39.183] STEP: Creating a kubernetes client
I0811 01:26:39.183] Aug 11 01:25:36.422: INFO: >>> kubeConfig: /workspace/.kube/config
I0811 01:26:39.184] STEP: Building a namespace api object
I0811 01:26:39.184] STEP: Waiting for a default service account to be provisioned in namespace
I0811 01:26:39.184] [BeforeEach] [sig-instrumentation] Cluster level logging implemented by Stackdriver
I0811 01:26:39.184]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/instrumentation/logging/stackdrvier/basic.go:43
I0811 01:26:39.184] [It] should ingest system logs from all nodes
I0811 01:26:39.185]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/instrumentation/logging/stackdrvier/basic.go:151
I0811 01:26:39.185] Aug 11 01:25:38.500: INFO: Using the following filter for log entries: resource.type="gce_instance" AND (resource.labels.instance_id=7683115974146440037 OR resource.labels.instance_id=2309583240800811877 OR resource.labels.instance_id=4062923293439605604 OR resource.labels.instance_id=6051722915058072421)
I0811 01:26:39.185] Aug 11 01:25:38.872: INFO: Waiting for log sink to become operational
I0811 01:26:39.185] Aug 11 01:25:41.835: INFO: Sink e2e-tests-sd-logging-fql29 is operational
I0811 01:26:39.185] STEP: Waiting for some system logs to ingest
I0811 01:26:39.185] Aug 11 01:26:10.358: INFO: Failed to parse Stackdriver LogEntry: Failed to deserialize jsonPayload as json object 
I0811 01:26:39.186] Aug 11 01:26:14.401: INFO: Failed to parse Stackdriver LogEntry: Failed to deserialize jsonPayload as json object 
I0811 01:26:39.186] [AfterEach] [sig-instrumentation] Cluster level logging implemented by Stackdriver
I0811 01:26:39.186]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:140
I0811 01:26:39.186] Aug 11 01:26:22.447: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready
I0811 01:26:39.186] STEP: Destroying namespace "e2e-tests-sd-logging-fql29" for this suite.
I0811 01:26:39.186] Aug 11 01:26:39.111: INFO: namespace: e2e-tests-sd-logging-fql29, resource: bindings, ignored listing per whitelist
I0811 01:26:39.186] Aug 11 01:26:39.181: INFO: namespace e2e-tests-sd-logging-fql29 deletion completed in 16.72184701s
... skipping 116 lines ...
I0811 01:26:49.630] Aug 11 01:24:57.002: INFO: Waiting for pod ss-0 to enter Running - Ready=false, currently Pending - Ready=false
I0811 01:26:49.631] Aug 11 01:25:07.029: INFO: Waiting for pod ss-0 to enter Running - Ready=false, currently Pending - Ready=false
I0811 01:26:49.631] Aug 11 01:25:17.021: INFO: Waiting for pod ss-0 to enter Running - Ready=false, currently Running - Ready=false
I0811 01:26:49.631] Aug 11 01:25:17.021: INFO: Resuming stateful pod at index 0
I0811 01:26:49.631] Aug 11 01:25:17.049: INFO: Running '/workspace/kubernetes/platforms/linux/amd64/kubectl --server=https://35.192.220.105 --kubeconfig=/workspace/.kube/config exec --namespace=e2e-tests-statefulset-pmwxv ss-0 -- /bin/sh -c touch /tmp/statefulset-continue'
I0811 01:26:49.631] Aug 11 01:25:48.319: INFO: rc: 127
I0811 01:26:49.631] Aug 11 01:25:48.319: INFO: Unexpected error occurred: error running &{/workspace/kubernetes/platforms/linux/amd64/kubectl [kubectl --server=https://35.192.220.105 --kubeconfig=/workspace/.kube/config exec --namespace=e2e-tests-statefulset-pmwxv ss-0 -- /bin/sh -c touch /tmp/statefulset-continue] []  <nil>  Error from server: error dialing backend: dial tcp 10.128.0.4:10250: getsockopt: connection refused
I0811 01:26:49.631]  [] <nil> 0xc420fd2990 exit status 1 <nil> <nil> true [0xc4213a62a0 0xc4213a62b8 0xc4213a62d0] [0xc4213a62a0 0xc4213a62b8 0xc4213a62d0] [0xc4213a62b0 0xc4213a62c8] [0x11bfd50 0x11bfd50] 0xc420d01c80 <nil>}:
I0811 01:26:49.632] Command stdout:
I0811 01:26:49.632] 
I0811 01:26:49.632] stderr:
I0811 01:26:49.632] Error from server: error dialing backend: dial tcp 10.128.0.4:10250: getsockopt: connection refused
I0811 01:26:49.632] 
I0811 01:26:49.632] error:
I0811 01:26:49.632] exit status 1

It is worth noting that the build log also showed 6 minutes earlier that all the nodes were up:

I0811 01:20:15.654] Found 5 node(s).
I0811 01:20:15.835] NAME                          STATUS                     AGE       VERSION
I0811 01:20:15.836] e2e-46505-master              Ready,SchedulingDisabled   59s       v1.8.0-alpha.2.1702+e5a191d32a17b4
I0811 01:20:15.836] e2e-46505-minion-group-1jzp   Ready                      34s       v1.8.0-alpha.2.1702+e5a191d32a17b4
I0811 01:20:15.836] e2e-46505-minion-group-2nfk   Ready                      43s       v1.8.0-alpha.2.1702+e5a191d32a17b4
I0811 01:20:15.836] e2e-46505-minion-group-g4dk   Ready                      39s       v1.8.0-alpha.2.1702+e5a191d32a17b4
I0811 01:20:15.836] e2e-46505-minion-group-qqjs   Ready                      34s       v1.8.0-alpha.2.1702+e5a191d32a17b4

And artifacts/nodes.yaml showed that e2e-46505-minion-group-qqjs had address 10.128.0.4.

What you expected to happen:
Consistent test results for a given commit.

How to reproduce it (as minimally and precisely as possible):
I have no good suggestion here.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): master
  • Cloud provider or hardware configuration**: GCE
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 15, 2017
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Aug 15, 2017
@MikeSpreitzer MikeSpreitzer changed the title e2e flake: pull-kubernetes-e2e-gce-etcd3 fails sig-apps] Deployment and others with dial tcp (a node addr):10250: getsockopt: connection refused e2e flake: pull-kubernetes-e2e-gce-etcd3 fails [sig-apps] Deployment and others with dial tcp (a node addr):10250: getsockopt: connection refused Aug 15, 2017
@MikeSpreitzer
Copy link
Member Author

/kind flake

@k8s-ci-robot k8s-ci-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Aug 15, 2017
@MikeSpreitzer
Copy link
Member Author

/remove-kind bug

@k8s-ci-robot k8s-ci-robot removed the kind/bug Categorizes issue or PR as related to a bug. label Aug 15, 2017
@MikeSpreitzer
Copy link
Member Author

@kubernetes/sig-apps-test-failures

@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Aug 15, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Aug 15, 2017
@liggitt
Copy link
Member

liggitt commented Aug 27, 2017

seeing kubelets (and possibly the entire node) restarting during e2e runs which is disrupting any log/exec/scheduling tests using that node at the time:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&text=10250%3A%20getsockopt%3A%20connection%20refused

seen in https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/51168/pull-kubernetes-e2e-gce-bazel/12561/

the test fails with: an error on the server ("Error: 'dial tcp 10.128.0.6:10250: getsockopt: connection refused'\nTrying to reach: 'https://e2e-12561-minion-group-05pn:10250/logs/'") has prevented the request from succeeding

from the apiserver log:

I0827 00:31:39.749708       9 wrap.go:42] GET /api/v1/nodes/e2e-12561-minion-group-05pn:10250/proxy/logs/: (13.297005175s) 503 
goroutine 558050 [running]:
k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc42d8631f0, 0x1f7)
	vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:207 +0xdd
k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc42d8631f0, 0x1f7)
logging error output: "Error: 'dial tcp 10.128.0.6:10250: getsockopt: connection refused'\nTrying to reach: 'https://e2e-12561-minion-group-05pn:10250/logs/'"                                                                                                               [[e2e.test/v0.0.0 (linux/amd64) kubernetes/$Format] 104.197.137.193:33864]   

from the kubelet log:

I0827 00:31:02.485982    3528 kubenet_linux.go:528] TearDownPod took 97.425804ms for e2e-tests-deployment-9tjxm/nginx-409829171-plrfn
I0827 00:31:02.495012    3528 kubenet_linux.go:777] Removing e2e-tests-deployment-9tjxm/nginx-409829171-67xpx from 'kubenet' with CNI 'bridge' plugin and runtime: &{ContainerID:67dd123c08490c4490435cf01ebdc410d6ad13002015497b0a32a9acaa3f2a1b NetNS: IfName:eth0 Args:[] CapabilityArgs:map[]}
I0827 00:31:02.510140    3528 status_manager.go:467] Pod "cleanup40-e0d858fd-8abe-11e7-a1fd-0a580a3c150d-vgtb9_e2e-tests-kubelet-vznfv(e11fe840-8abe-11e7-8ae6-42010a800002)" fully terminated and removed from etcd
I0827 00:31:02.520988    3528 kubenet_linux.go:528] TearDownPod took 34.733352ms for e2e-tests-deployment-9tjxm/nginx-409829171-67xpx
I0827 00:31:02.522098    3528 plugins.go:405] Calling network plugin kubenet to tear down pod "nginx-409829171-5zxw4_e2e-tests-deployment-9tjxm"

2017/08/27 00:32:33 proto: duplicate proto type registered: google.protobuf.Duration
2017/08/27 00:32:33 proto: duplicate proto type registered: google.protobuf.Timestamp
Flag --network-plugin-dir has been deprecated, Use --cni-bin-dir instead. This flag will be removed in a future version.
I0827 00:32:33.763347    3145 flags.go:52] FLAG: --address="0.0.0.0"
I0827 00:32:33.763365    3145 flags.go:52] FLAG: --allow-privileged="true"
I0827 00:32:33.763373    3145 flags.go:52] FLAG: --alsologtostderr="false"
I0827 00:32:33.763401    3145 flags.go:52] FLAG: --anonymous-auth="false"
I0827 00:32:33.763407    3145 flags.go:52] FLAG: --application-metrics-count-limit="100"
I0827 00:32:33.763413    3145 flags.go:52] FLAG: --authentication-token-webhook="false"

you can see the almost 90 second gap and the startup logging occur in the kubelet during that window

@liggitt
Copy link
Member

liggitt commented Aug 27, 2017

interesting-looking things from the logs on that kubelet around that time:
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/51168/pull-kubernetes-e2e-gce-bazel/12561/artifacts/e2e-12561-minion-group-05pn/serial-1.log:

Aug 27 00:31:11 e2e-12561-minion-group-05pn kernel: [  707.884993] cbr0: port 12(veth43179ee6) entered forwarding state
[  710.163266] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
[  710.173160] IP: [<ffffffff810a1130>] check_preempt_wakeup+0xd0/0x1d0
[  710.179752] PGD 1b94af067 PUD 1b3be4067 PMD 0 
[  710.184667] Oops: 0000 [#1] SMP 
[  710.188253] Modules linked in: tcp_diag inet_diag nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver sg xt_statistic nf_conntrack_netlink nfnetlink sch_htb ebt_ip ebtable_filter ebtables veth xt_nat xt_recent ipt_REJECT xt_mark xt_comment xt_tcpudp ipt_MASQUERADE iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype ip_tables xt_conntrack x_tables nf_nat nf_conntrack bridge stp llc aufs(C) nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc crct10dif_pclmul crc32_pclmul crc32c_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper psmouse parport_pc i2c_piix4 i2c_core ablk_helper cryptd parport pvpanic evdev pcspkr serio_raw processor button thermal_sys virtio_net ext4 crc16 mbcache jbd2 sd_mod crc_t10dif crct10dif_common virtio_scsi scsi_mod virtio_pci virtio virtio_ring
[  710.269200] CPU: 1 PID: 17112 Comm: exe Tainted: G        WC    3.16.0-4-amd64 #1 Debian 3.16.43-2+deb8u1
[  710.278881] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[  710.288213] task: ffff88007bfc2110 ti: ffff8802143ac000 task.ti: ffff8802143ac000
[  710.295806] RIP: 0010:[<ffffffff810a1130>]  [<ffffffff810a1130>] check_preempt_wakeup+0xd0/0x1d0
[  710.304827] RSP: 0018:ffff8802143afe60  EFLAGS: 00010006
[  710.310245] RAX: 0000000000000001 RBX: ffff880145a83940 RCX: 0000000000000008
[  710.317490] RDX: 0000000000000001 RSI: ffff880214d86b20 RDI: ffff88021fd12fb8
[  710.324734] RBP: 0000000000000000 R08: ffffffff81610640 R09: 0000000000000001
[  710.331977] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007bfc2110
[  710.339218] R13: ffff88021fd12f40 R14: 0000000000000000 R15: 0000000000000000
[  710.349523] FS:  000000000153f880(0063) GS:ffff88021fd00000(0000) knlGS:0000000000000000
[  710.357731] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  710.363586] CR2: 0000000000000078 CR3: 00000001ae3c4000 CR4: 00000000001406e0
[  710.370830] Stack:
[  710.372950]  0000000000012f40 ffff88021fd12f40 0000000000012f40 ffff88021fd12f40
[  710.380942]  ffff880214d871a4 0000000000000246 ffff8801ffe0c9c0 ffffffff81095be5
[  710.388947]  ffff880214d86b20 ffffffff810986ca 00007fffffffeffd 0000000000000000
[  710.396958] Call Trace:
[  710.399520]  [<ffffffff81095be5>] ? check_preempt_curr+0x85/0xa0
[  710.405645]  [<ffffffff810986ca>] ? wake_up_new_task+0xda/0x190
[  710.411680]  [<ffffffff81067a49>] ? do_fork+0x139/0x3d0
[  710.417017]  [<ffffffff8151a7f9>] ? stub_clone+0x69/0x90
[  710.422440]  [<ffffffff8151a48d>] ? system_call_fast_compare_end+0x10/0x15
[  710.429423] Code: 39 c2 7d 27 0f 1f 80 00 00 00 00 83 e8 01 48 8b 5b 70 39 d0 75 f5 48 8b 7d 78 48 3b 7b 78 74 15 0f 1f 00 48 8b 6d 70 48 8b 5b 70 <48> 8b 7d 78 48 3b 7b 78 75 ee 48 85 ff 74 e9 e8 8c cb ff ff 48 
[  710.456373] RIP  [<ffffffff810a1130>] check_preempt_wakeup+0xd0/0x1d0
[  710.463046]  RSP <ffff8802143afe60>
[  710.466646] CR2: 0000000000000078
[  710.470593] ---[ end trace 06e67ea027b5f481 ]---
[  710.475322] Kernel panic - not syncing: Fatal exception
[  711.546412] Shutting down cpus with NMI
[  711.551190] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[  711.561475] Rebooting in 10 seconds..
[  721.540372] ACPI MEMORY or I/O RESET_REG.
SeaBIOS (version 1.8.2-20170517_162014-google)

docker shows a gap in logs:
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/51168/pull-kubernetes-e2e-gce-bazel/12561/artifacts/e2e-12561-minion-group-05pn/docker.log:

time="2017-08-27T00:31:02.486603349Z" level=debug msg="Calling GET /v1.23/containers/67dd123c08490c4490435cf01ebdc410d6ad13002015497b0a32a9acaa3f2a1b/json" 
time="2017-08-27T00:31:02.487323975Z" level=debug msg="Calling POST /v1.23/containers/5038490df8bb364a6e5e1706a98a0d147f987561c253e5175bab46b5264e0c7e/stop?t=10" 
time="2017-08-27T00:31:02.487397207Z" level=debug msg="Sending 15 to 5038490df8bb364a6e5e1706a98a0d147f987561c253e5175bab46b5264e0c7e" 
time="2017-08-27T00:31:02.522695012Z" level=debug msg="Calling POST /v1.23/containers/67dd123c08490c4490435cf01ebdc410d6ad13002015497b0a32a9acaa3f2atime="2017-08-27T00:31:37.834931235Z" level=debug msg="docker group found. gid: 107" 
time="2017-08-27T00:31:37.835057164Z" level=debug msg="Listener created for HTTP on unix (/var/run/docker.sock)" 
time="2017-08-27T00:31:37.898334222Z" level=info msg="New containerd process, pid: 2279\n" 
time="2017-08-27T00:31:38Z" level=debug msg="containerd: read past events" count=0 
time="2017-08-27T00:31:38Z" level=debug msg="containerd: supervisor running" cpus=2 memory=7499 runtime=docker-runc runtimeArgs=[] stateDir="/run/containerd" 
time="2017-08-27T00:31:37.961909157Z" level=debug msg="containerd connection state change: CONNECTING" 
time="2017-08-27T00:31:38.176040104Z" level=debug msg="containerd connection state change: READY" 
time="2017-08-27T00:31:38Z" level=debug msg="containerd: grpc api on /var/run/docker/libcontainerd/docker-containerd.sock" 
time="2017-08-27T00:31:38.217557339Z" level=debug msg="Using default logging driver json-file" 
time="2017-08-27T00:31:38.217713726Z" level=debug msg="Golang's threads limit set to 53910" 
time="2017-08-27T00:31:38.217759693Z" level=debug msg="[graphdriver] trying provided driver \"aufs\"" 
time="2017-08-27T00:31:38.230880914Z" level=debug msg="Using graph driver aufs" 

@liggitt
Copy link
Member

liggitt commented Aug 27, 2017

cc @kubernetes/sig-node-test-failures @kubernetes/sig-node-bugs for ideas on chasing down the kernel issue

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/bug Categorizes issue or PR as related to a bug. labels Aug 27, 2017
@liggitt liggitt added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Aug 27, 2017
@liggitt
Copy link
Member

liggitt commented Aug 27, 2017

found the same panic in one of the kubelet logs in the original linked failure in this issue:

http://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/47262/pull-kubernetes-e2e-gce-etcd3/46505/artifacts/e2e-46505-minion-group-qqjs/

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/47262/pull-kubernetes-e2e-gce-etcd3/46505/artifacts/e2e-46505-minion-group-qqjs/serial-1.log:

Aug 11 01:25:14 e2e-46505-minion-group-qqjs kernel: [  487.854028] cbr0: port 7(vethb75257a9) entered forwarding state
[  488.027147] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
[  488.035387] IP: [<ffffffff810a1130>] check_preempt_wakeup+0xd0/0x1d0
[  488.042066] PGD 8cc8d067 PUD b6570067 PMD 0 
[  488.046828] Oops: 0000 [#1] SMP 
[  488.050416] Modules linked in: sg xt_statistic nf_conntrack_netlink nfnetlink sch_htb ebt_ip ebtable_filter ebtables veth xt_nat xt_recent ipt_REJECT xt_mark xt_comment xt_tcpudp ipt_MASQUERADE iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype ip_tables xt_conntrack x_tables nf_nat nf_conntrack bridge stp llc aufs(C) nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc crct10dif_pclmul crc32_pclmul crc32c_intel aesni_intel parport_pc parport psmouse evdev pcspkr serio_raw i2c_piix4 i2c_core aes_x86_64 processor thermal_sys pvpanic lrw gf128mul button glue_helper ablk_helper cryptd virtio_net ext4 crc16 mbcache jbd2 sd_mod crc_t10dif crct10dif_common virtio_scsi scsi_mod virtio_pci virtio virtio_ring
[  488.125886] CPU: 1 PID: 21597 Comm: exe Tainted: G        WC    3.16.0-4-amd64 #1 Debian 3.16.43-2+deb8u1
[  488.135697] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[  488.145035] task: ffff88020d5fd630 ti: ffff880074fc4000 task.ti: ffff880074fc4000
[  488.152629] RIP: 0010:[<ffffffff810a1130>]  [<ffffffff810a1130>] check_preempt_wakeup+0xd0/0x1d0
[  488.161699] RSP: 0018:ffff880074fc7e60  EFLAGS: 00010006
[  488.167123] RAX: 0000000000000001 RBX: ffff8800bae12040 RCX: 0000000000000008
[  488.174373] RDX: 0000000000000001 RSI: ffff8801bab0eb60 RDI: ffff88021fd12fb8
[  488.181710] RBP: 0000000000000000 R08: ffffffff81610640 R09: 0000000000000001
[  488.188967] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88020d5fd630
[  488.196222] R13: ffff88021fd12f40 R14: 0000000000000000 R15: 0000000000000000
[  488.203574] FS:  0000000001dd6880(0063) GS:ffff88021fd00000(0000) knlGS:0000000000000000
[  488.211776] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  488.217632] CR2: 0000000000000078 CR3: 00000000bad9b000 CR4: 00000000001406e0
[  488.224881] Stack:
[  488.227096]  0000000000012f40 ffff88021fd12f40 0000000000012f40 ffff88021fd12f40
[  488.235101]  ffff8801bab0f1e4 0000000000000246 ffff8801b55d9240 ffffffff81095be5
[  488.243130]  ffff8801bab0eb60 ffffffff810986ca 00007fffffffeffd 0000000000000000
[  488.251205] Call Trace:
[  488.253769]  [<ffffffff81095be5>] ? check_preempt_curr+0x85/0xa0
[  488.259895]  [<ffffffff810986ca>] ? wake_up_new_task+0xda/0x190
[  488.265928]  [<ffffffff81067a49>] ? do_fork+0x139/0x3d0
[  488.271272]  [<ffffffff8151a7f9>] ? stub_clone+0x69/0x90
[  488.276700]  [<ffffffff8151a48d>] ? system_call_fast_compare_end+0x10/0x15
[  488.283704] Code: 39 c2 7d 27 0f 1f 80 00 00 00 00 83 e8 01 48 8b 5b 70 39 d0 75 f5 48 8b 7d 78 48 3b 7b 78 74 15 0f 1f 00 48 8b 6d 70 48 8b 5b 70 <48> 8b 7d 78 48 3b 7b 78 75 ee 48 85 ff 74 e9 e8 8c cb ff ff 48 
[  488.310697] RIP  [<ffffffff810a1130>] check_preempt_wakeup+0xd0/0x1d0
[  488.317381]  RSP <ffff880074fc7e60>
[  488.320977] CR2: 0000000000000078
[  488.324407] ---[ end trace 7bcdfbcc991522e4 ]---
[  488.329132] Kernel panic - not syncing: Fatal exception
[  489.391575] Shutting down cpus with NMI
[  489.396236] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[  489.406529] Rebooting in 10 seconds..
[  499.389994] ACPI MEMORY or I/O RESET_REG.
SeaBIOS (version 1.8.2-20170524_173944-google)

@liggitt
Copy link
Member

liggitt commented Aug 27, 2017

kernel panic with same stack mentioned in #45368

@aledbf
Copy link
Member

aledbf commented Aug 27, 2017

@liggitt also here moby/moby#30402

@liggitt liggitt removed the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Aug 27, 2017
@sbezverk
Copy link
Contributor

sbezverk commented Aug 28, 2017

It looks like it is known kernel bug and it seems there is a fix but I do not think it was pushed upstream. See this post and the link to the proposed diff. if somebody has access to the test bed where this issue happens, would be interesting to patch the kernel and rerun this test to reconfirm.
https://serverfault.com/questions/709926/bug-unable-to-handle-kernel-null-pointer-dereference-at-on-google-compute-eng

@liggitt
Copy link
Member

liggitt commented Sep 5, 2017

@yujuhong
Copy link
Contributor

yujuhong commented Sep 7, 2017

What might help is to change the default testing OS image to COS since CVM is being deprecated (part of #51487). We can still keep the CVM test coverage but make it non-blocking until it's officially retired.
/cc @abgworrall @dchen1107

@abgworrall
Copy link
Contributor

Yep, agreed. We should flip to COS. Even @mtaufen agreed. I might try and flip it this week, although I'm pretty swamped, so anyone else feel free to do it.

A more surgical fix you could do right now would be to amend pull-kubernetes-e2e-gce-etcd3.env to specify COS right now by setting KUBE_NODE_OS_DISTRIBUTION to gci instead of debian.

k8s-github-robot pushed a commit that referenced this issue Sep 12, 2017
Automatic merge from submit-queue (batch tested with PRs 52227, 52120)

Use COS for nodes in testing clusters by default, and bump COS.

Addresses part of issue #51487. May assist with #51961 and #50695.

CVM is being deprecated, and falls out of support on 2017/10/01. We shouldn't run test jobs on it. So start using COS for all test jobs.

The default value of `KUBE_NODE_OS_DISTRIBUTION` for clusters created for testing will now be gci. Testjobs that do not specify this value will now run on clusters using COS (aka GCI) as the node OS, instead of CVM, the previous default.

This change only affects testing; non-testing clusters already use COS by default.

In addition, bump the version of COS from `cos-stable-60-9592-84-0` to `cos-stable-60-9592-90-0`.

```release-note
NONE
```
/cc @yujuhong, @mtaufen, @fejta, @krzyzacy
@liggitt
Copy link
Member

liggitt commented Sep 29, 2017

@k8s-github-robot
Copy link

This Issue hasn't been active in 61 days. It will be closed in 28 days (Dec 28, 2017).

cc @MikeSpreitzer

You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days

@liggitt
Copy link
Member

liggitt commented Jan 15, 2018

The panicking kernel image has been retired
/close

@liggitt liggitt closed this as completed Jan 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

8 participants