Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linkerd-proxy crashes with "supplied instant is later than self" (AWS EC2/EKS) #7748

Closed
jberm opened this issue Jan 31, 2022 · 15 comments
Closed

Comments

@jberm
Copy link

jberm commented Jan 31, 2022

What is the issue?

Linkerd proxy crashes intermittently with the following error message:

thread 'main' panicked at 'supplied instant is later than self', library/std/src/time.rs:281:48
thread 'main' panicked at 'supplied instant is later than self', library/std/src/time.rs:281:48
stack backtrace:
0:     0x55ca07b4ba84 - <unknown>
1:     0x55ca0713d55c - <unknown>
 ...
37:     0x55ca0708129a - <unknown>
38:                0x0 - <unknown>
thread panicked while panicking. aborting.

How can it be reproduced?

Deploy linkerd 2.11.1-stable to AWS EKS and wait for crashes.

Logs, error output, etc

  • OS and kernel version
[ssm-user@ip-10-0-20-45 bin]$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"PRETTY_NAME="Amazon Linux 2"ANSI_COLOR="0;33"CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
  • Output for one core from /proc/cpuinfo
[ssm-user@ip-10-0-20-45 bin]$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD EPYC 7571
stepping        : 2
microcode       : 0x800126c
cpu MHz         : 2199.758
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_ts
c rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr
8_legacy abm sse4a misalignsse 3dnowprefetch topoext vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt
nrip_save
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 4399.51
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD EPYC 7571
stepping        : 2
microcode       : 0x800126c
cpu MHz         : 2199.758
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat nptnrip_save
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 4399.51
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:
  • hypervisor if the system is virtualized
[ssm-user@ip-10-0-20-45 bin]$ ls /sys/hypervisor/
[ssm-user@ip-10-0-20-45 bin]$
  • selected clock source
[ssm-user@ip-10-0-20-45 bin]$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

output of linkerd check -o short

13:36 $ linkerd check -o short
Linkerd core checks
===================


Status check results are √

Linkerd extensions checks
=========================


Status check results are √

Environment

  • Kubernetes Version: 1.21
  • Cluster Environment: AWS EKS
  • Host OS: Amazon Linux
  • Linkerd version: 2.11.1-stable

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

No response

@jberm jberm added the bug label Jan 31, 2022
@olix0r
Copy link
Member

olix0r commented Jan 31, 2022

Possibly related to rust-lang/rust#86470

olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Jan 31, 2022
When comparing instances, we should use saturating varieties to help
ensure that we can't hit panics.

This change bans uses of `std::time::Instant::{duration_since, elapsed,
sub}` via clippy. Uses are ported to using
`Instant::saturating_duration_since`.

Related to linkerd/linkerd2#7748

Signed-off-by: Oliver Gould <ver@buoyant.io>
@olix0r
Copy link
Member

olix0r commented Jan 31, 2022

@jberm Can you share the output of uname -rv?

olix0r added a commit to olix0r/hyper that referenced this issue Jan 31, 2022
We have reports of runtime panics (linkerd/linkerd2#7748) that sound a
lot like rust-lang/rust#86470. We don't have any evidence that these
panics originate in hyper, but hyperium#2385 reports a similar issue.

Even though this is almost definitely a bug in Rust, it seems most
prudent to actively avoid the uses of `Instant` that are prone to this
bug.

This change replaces uses of `Instant::elapsed` and `Instant::sub` with
calls to `Instant::saturating_duration_since` to prevent this class of
panic.
olix0r added a commit to tower-rs/tower that referenced this issue Jan 31, 2022
We have reports of runtime panics (linkerd/linkerd2#7748) that sound a
lot like rust-lang/rust#86470. We don't have any evidence that these
panics originate in hyper, but #2385 reports a similar issue.

Even though this is almost definitely a bug in Rust, it seems most
prudent to actively avoid the uses of `Instant` that are prone to this
bug.

This change replaces uses of `Instant::elapsed` and `Instant::sub` with
calls to `Instant::saturating_duration_since` to prevent this class of
panic. These fixes should ultimately be made in the standard library,
but this change lets us avoid this problem while we wait for those
fixes.

See also hyperium/hyper#2746
olix0r added a commit to tower-rs/tower that referenced this issue Jan 31, 2022
We have reports of runtime panics (linkerd/linkerd2#7748) that sound a
lot like rust-lang/rust#86470. We don't have any evidence that these
panics originate in tower, but we have some potentialy flawed `Instant`
arithmetic that could panic in this way.

Even though this is almost definitely a bug in Rust, it seems most
prudent to actively avoid the uses of `Instant` that are prone to this
bug.

This change replaces uses of `Instant::elapsed` and `Instant::sub` with
calls to `Instant::saturating_duration_since` to prevent this class of
panic. These fixes should ultimately be made in the standard library,
but this change lets us avoid this problem while we wait for those
fixes.

See also hyperium/hyper#2746
olix0r added a commit to tower-rs/tower that referenced this issue Jan 31, 2022
We have reports of runtime panics (linkerd/linkerd2#7748) that sound a
lot like rust-lang/rust#86470. We don't have any evidence that these
panics originate in tower, but we have some potentialy flawed `Instant`
arithmetic that could panic in this way.

Even though this is almost definitely a bug in Rust, it seems most
prudent to actively avoid the uses of `Instant` that are prone to this
bug.

This change replaces uses of `Instant::elapsed` and `Instant::sub` with
calls to `Instant::saturating_duration_since` to prevent this class of
panic. These fixes should ultimately be made in the standard library,
but this change lets us avoid this problem while we wait for those
fixes.

See also hyperium/hyper#2746
olix0r added a commit to hyperium/h2 that referenced this issue Jan 31, 2022
We have reports of runtime panics (linkerd/linkerd2#7748) that sound a
lot like rust-lang/rust#86470. We don't have any evidence that these
panics originate in h2, but there is one use of `Instant::sub` that
could panic in this way.

Even though this is almost definitely a bug in Rust, it seems most
prudent to actively avoid the uses of `Instant` that are prone to this
bug. These fixes should ultimately be made in the standard library, but
this change lets us avoid this problem while we wait for those fixes.

This change replaces uses of `Instant::elapsed` and `Instant::sub` with
calls to `Instant::saturating_duration_since` to prevent this class of
panic.

See also hyperium/hyper#2746
seanmonstar pushed a commit to hyperium/h2 that referenced this issue Feb 1, 2022
We have reports of runtime panics (linkerd/linkerd2#7748) that sound a
lot like rust-lang/rust#86470. We don't have any evidence that these
panics originate in h2, but there is one use of `Instant::sub` that
could panic in this way.

Even though this is almost definitely a bug in Rust, it seems most
prudent to actively avoid the uses of `Instant` that are prone to this
bug. These fixes should ultimately be made in the standard library, but
this change lets us avoid this problem while we wait for those fixes.

This change replaces uses of `Instant::elapsed` and `Instant::sub` with
calls to `Instant::saturating_duration_since` to prevent this class of
panic.

See also hyperium/hyper#2746
seanmonstar pushed a commit to tower-rs/tower that referenced this issue Feb 1, 2022
We have reports of runtime panics (linkerd/linkerd2#7748) that sound a
lot like rust-lang/rust#86470. We don't have any evidence that these
panics originate in tower, but we have some potentialy flawed `Instant`
arithmetic that could panic in this way.

Even though this is almost definitely a bug in Rust, it seems most
prudent to actively avoid the uses of `Instant` that are prone to this
bug.

This change replaces uses of `Instant::elapsed` and `Instant::sub` with
calls to `Instant::saturating_duration_since` to prevent this class of
panic. These fixes should ultimately be made in the standard library,
but this change lets us avoid this problem while we wait for those
fixes.

See also hyperium/hyper#2746
@olix0r
Copy link
Member

olix0r commented Feb 1, 2022

Rusts's standard library provides an Instant type, described as follows:

A measurement of a monotonically nondecreasing clock. Opaque and useful only with Duration.

Instants are always guaranteed to be no less than any previously measured instant when created, and are often useful for tasks such as measuring benchmarks or timing how long an operation takes.

Note, however, that instants are not guaranteed to be steady. In other words, each tick of the underlying clock might not be the same length (e.g. some seconds may be longer than others). An instant may jump forwards or experience time dilation (slow down or speed up), but it will never go backwards.

rust-lang/rust#86470 describes a bug in Instant--specifically triggered on Amazon Linux--where this invariant is violated.

There are a few ways we can attack this problem:

  1. Help the Rust team get enough information about the platform(s) this occurs on so that they can fix the underlying issue.
  2. Wait for the resolution of make Instant::{duration_since, elapsed, sub} saturating and remove workarounds rust-lang/rust#89926--an RFC that proposes changing the behavior of some methods on Instant to avoid the type of panic you've observed.
  3. Audit our code and our dependencies for time subtraction that can overflow. I've done this with most of the code I could find that seems likely to be at play. There may be other instances, though:

It will likely take a few weeks for all of these dependencies to release so that we can pick up a change in an edge release. Or we can take git dependencies on these repos to avoid waiting for a proper release.

It will probably take a few months until we can use a Rust version that has better guards against this kind of panic.

But, still, it would be best if we can help the Rust team nail down more details about the environment where this error occurs. Perhaps its possible we can get the folks working on Amazon Linux involved.

@olix0r olix0r added the env/eks Amazon EKS label Feb 1, 2022
@carllerche
Copy link
Contributor

@jberm What AWS instance types are you using? I would guess either is either m5a or t3a.

@jberm
Copy link
Author

jberm commented Feb 1, 2022

The pods were failing on a t3a.medium instance and only failing on that single instance. Unfortunately we upgraded our AMIs yesterday so I don't have the uname output for that node. The output for the node that replaced it is the following:

$ uname -rv
5.4.172-90.336.amzn2.x86_64 #1 SMP Wed Jan 19 23:08:01 UTC 2022

Maybe someone at AWS can give you the kernel version for the previous AMI.

olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Feb 1, 2022
When comparing instances, we should use saturating varieties to help
ensure that we can't hit panics.

This change bans uses of `std::time::Instant::{duration_since, elapsed,
sub}` via clippy. Uses are ported to using `Instant::saturating_duration_since`.

Related to linkerd/linkerd2#7748

Signed-off-by: Oliver Gould <ver@buoyant.io>
Co-authored-by: Eliza Weisman <eliza@buoyant.io>
@virenrshah
Copy link

We just had the same issue.
v1.19.15-eks-9c63c4 192.168.113.9 Amazon Linux 2 5.4.172-90.336.amzn2.x86_64 docker://20.10.7
t3a.xlarge

@olix0r
Copy link
Member

olix0r commented Feb 10, 2022

Thanks @virenrshah. We've got a few workarounds that will become available as our dependencies release new versions. In the meantime, you could try engaging AWS support or reprovisioning impacted nodes.

I'm told that AWS has reproduced the issue but I'm not aware of how long it will take for fixes to be available on their side.

@virenrshah
Copy link

Thanks! Anyone know if this is something I can workaround by shifting to a different set of instance types? Looks like both @jberm and I had t3a instance types.
We just had 3 client sites impacted and I am trying to find some solution.

@fcrespofastly
Copy link

Same here:

  • Cluster created with kops, only happening in one single instance and we're using t3.2xlarge
  • Kernel: 5.11.0-1017-aws
  • OS: Ubuntu 20.04.3 LTS

@fcrespofastly
Copy link

fcrespofastly commented Feb 14, 2022

do we have any update on this? This is impacting our production clusters and we're even considering uninstalling linkerd until this gets sorted, something I'd love to avoid if there's any known workaround

olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Feb 14, 2022
tokio & tower have been patched to avoid issues described in
linkerd/linkerd2#7748, but they have not yet been released. This change
pins these dependencies to Git to pickup the workarounds.
olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Feb 14, 2022
tokio & tower have been patched to avoid issues described in
linkerd/linkerd2#7748, but they have not yet been released. This change
pins these dependencies to Git to pickup the workarounds.

Signed-off-by: Oliver Gould <ver@buoyant.io>
@olix0r
Copy link
Member

olix0r commented Feb 14, 2022

@fcrespofastly As mentioned previously, this is a bug between the Rust standard library and AWS Linux, which has a buggy time source. So it's going to be difficult for us to completely eliminate this issue until it is fixed upstream.

That said, we've put in place workarounds in linkerd2-proxy and several ecosystem projects (tokio, tower, hyper) that should reduce the likelihood of encountering this bug. I've put up linkerd/linkerd2-proxy#1497 to take git dependencies while we wait for tokio & tower to do a proper release and I've published a proxy build with these changes.

You can use this build by setting namespace/workload annotations:

annotations:
  config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
  config.linkerd.io/proxy-version: instant.495a51ae

Or set it globally by upgrading with the appropriate helm values

@fcrespofastly
Copy link

hey @olix0r thanks a lot, I knew it was more on Rust and AWS land, but it was also mentioned: We've got a few workarounds that will become available as our dependencies release new versions. Hence I was asking about this.

Thanks again!

@nyetwurk
Copy link

Same here:

* Cluster created with kops, only happening in one single instance and we're using `t3.2xlarge`

* Kernel: 5.11.0-1017-aws

* OS: Ubuntu 20.04.3 LTS

We're seeing this on a different rust application, but also on t3.2xlarge
5.4.0-1041-aws #43-Ubuntu SMP Fri Mar 19 22:06:16 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Feb 15, 2022
tokio & tower have been patched to avoid issues described in
linkerd/linkerd2#7748, but they have not yet been released. This change
pins these dependencies to Git to pickup the workarounds.

Signed-off-by: Oliver Gould <ver@buoyant.io>
olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Mar 30, 2022
When comparing instances, we should use saturating varieties to help
ensure that we can't hit panics.

This change bans uses of `std::time::Instant::{duration_since, elapsed,
sub}` via clippy. Uses are ported to using `Instant::saturating_duration_since`.

Related to linkerd/linkerd2#7748

Signed-off-by: Oliver Gould <ver@buoyant.io>
Co-authored-by: Eliza Weisman <eliza@buoyant.io>
(cherry picked from commit bffdb1a)
Signed-off-by: Oliver Gould <ver@buoyant.io>
@stale
Copy link

stale bot commented May 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label May 16, 2022
@olix0r
Copy link
Member

olix0r commented May 16, 2022

Recent versions of the proxy should be immune to this class of panic.

@olix0r olix0r closed this as completed May 16, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants