Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

helper/resource: Kernel Panics on darwin/arm64 (e.g. Apple M1 hardware) #1088

Closed
bflad opened this issue Nov 2, 2022 · 6 comments
Closed
Labels
bug Something isn't working subsystem/tests Issues and feature requests related to the testing framework.

Comments

@bflad
Copy link
Member

bflad commented Nov 2, 2022

SDK version

v2.24.0

Relevant provider source code

It seems any acceptance testing code will do, but potentially more prevalent with providers that are state-only. Maybe due to the acceptance testing code running faster than API-based providers.

Terraform Configuration Files

N/A

Expected Behavior

Acceptance testing (or Go code in general) should never cause the operating system kernel to panic.

Actual Behavior

During acceptance testing, the workstation restarts due to kernel panic.

Steps to Reproduce

TF_ACC=1 go test -count=1 -v ./... in providers such as:

Additional Information

This has happened for a few weeks now to multiple developers across multiple provider codebases. It may potentially be related to #1063. Seems to only happen on darwin/arm64 and not darwin/amd64.

For me, it's happened on a few macOS versions now, with the latest time (today) being with:

ProductName:	macOS
ProductVersion:	12.6.1
BuildVersion:	21G217

Here are some macOS panic snapshots from my workstation: https://gist.github.com/bflad/a709e2c1ddfa980c0c3163565485ee6f

Note that the in-flight task seems to be provider.test.

@bflad bflad added bug Something isn't working subsystem/tests Issues and feature requests related to the testing framework. labels Nov 2, 2022
@bflad
Copy link
Member Author

bflad commented Nov 3, 2022

I can confirm these panics also happen with strictly unit testing within helper/resource itself.

@bflad
Copy link
Member Author

bflad commented Nov 4, 2022

Here's the relevant socket definitions for the kernel panic ( socket state: 102 socket flags: 48000 socket flags1: 1000000):

#define SS_ISCONNECTED          0x0002  /* socket connected to a peer */
#define SS_NBIO                 0x0100  /* non-blocking ops */

#define SOF_NODEFUNCT           0x00008000 /* socket cannot be defunct'd */
#define SOF_INCOMP_INPROGRESS   0x00040000 /* incomp socket is being processed */

#define SOF1_INBOUND                    0x01000000 /* Created via a passive listener */

@bflad
Copy link
Member Author

bflad commented Nov 7, 2022

Since writing Terraform logs does not flush to disk enough to be accurate at time of the kernel panic, have gone down to worst case troubleshooting route of logging to console and capturing a photo by an external camera in the second or two while the OS writes the panic snapshot and before it forcibly restarts the OS. Simplified logs of these have included:

terraform-provider-tls (pre #1091):

[TRACE] sdk.helper_resource: Calling Terraform CLI show command:
  test_name=TestAccResourceCertRequest_UpgradeFromVersion3_4_0
  test_step_number=3

[TRACE] sdk.helper_resource: Calling Terraform CLI show command:
  test_name=TestResourceSelfSignedCert_NotRecreated(?)RenewalUpdateInFuture
  test_step_number=1

terraform-plugin-sdk:

[TRACE] sdk.helper_resource: Calling Terraform CLI refresh command:
  test_name=TestTest_TestStep_ProviderFactories_To_ExternalProviders
  test_step_number=1

[TRACE] sdk.helper_resource: Calling Terraform CLI show command for JSON plan:
  test_name=TestTest_TestStep_ProviderFactories_Import_Inline_WithPersistMatch
  test_step_number=2

The prevalence of show being the last Terraform command is interesting in this very small sample size. It could be that since there is much less context building that needs to be done for the command that it is attempting to call out to providers faster than the socket is actually ready to receive. Will keep digging.

@bflad
Copy link
Member Author

bflad commented Nov 8, 2022

Some additional captures from testing with last night within terraform-plugin-sdk (TestPanicAtTheDisco being raw logic within the testing framework):

[TRACE] sdk.helper_resource: Calling Terraform CLI show command for JSON plan:
  test_name=TestTest_TestStep_ProviderFactories_Import_Inline_WithPersistMatch
  test_step_number=2

[TRACE] sdk.helper_resource: Calling Terraform CLI destroy command:
  test_name=TestTest_TestStep_ProviderFactories_Import_Inline_WithPersistMatch
  test_step_number=2

[TRACE] sdk.helper_resource: Calling Terraform CLI show command for JSON plan:
  test_name=TestTest_TestStep_ProviderFactories_Import_Inline_WithPersistMatch
  test_step_number=2

[TRACE] sdk.helper_resource: Calling Terraform CLI show command for JSON plan:
  test_name=TestTest_TestStep_ProviderFactories_Import_Inline_WithPersistMatch
  test_step_number=2

[TRACE] sdk.helper_resource: Calling Terraform CLI show command for JSON plan:
  test_name=TestTest_TestStep_ProviderFactories_Import_Inline_WithPersistMatch
  test_step_number=1

[TRACE] sdk.helper_resource: Calling Terraform CLI plan command:
  test_name=TestTest_TestStep_ProviderFactories_Import_Inline_WithPersistMatch
  test_step_number=2

[TRACE] sdk.helper_resource: Calling Terraform CLI refresh command:
  test_name=TestTest_TestStep_ProviderFactories_Import_Inline_WithPersistMatch
  test_step_number=2

=== RUN TestPanicAtTheDisco (prior to manually adding tfexec logging)
Calling apply
starting provider
running command

=== RUN TestPanicAtTheDisco (prior to manually adding tfexec logging)
Calling show
starting provider
running command

=== RUN TestPanicAtTheDisco (strange output ordering?)
tfexec: command: starting command
tfexec: command: starting stdout goroutine
tfexec: command: starting stderr goroutine
tfexec: command: waiting for command to finish
calling init
starting provider

=== RUN TestPanicAtTheDisco
calling show
starting provider
running command
(handoff to tfexec)
tfexec: show: checking compatible
tfexec: show: configuring
tfexec: show: merging environment
tfexec: show: building command args
tfexec: show: running command
tfexec: command: merging writers
tfexec: command: running command
tfexec: command: merging stdout writers
tfexec: command: merging stderr writers
tfexec: command: starting command
tfexec: command: starting stdout goroutine
tfexec: command: starting stderr goroutine
tfexec: command: waiting for command to finish

=== RUN TestPanicAtTheDisco
calling init
starting provider
running command
(handoff to tfexec)
tfexec: init: configuring
tfexec: init: merging environment
tfexec: init: building command args
tfexec: command: merging stdout writers
tfexec: command: merging stderr writers
tfexec: command: starting command
tfexec: command: starting stdout goroutine
tfexec: command: starting stderr goroutine
tfexec: command: waiting for command to finish

All signs seem to point at Terraform CLI's starting of the go-plugin client (which does perform net.Dial for verification) or making initial calls to that client. My initial testing of NewClient(...).Client() was not triggering the issue. Creating a real client will mean copying the compiled protobuf code, but I was out of time to get that far.

@bflad
Copy link
Member Author

bflad commented Feb 9, 2023

This issue appears to be resolved with macOS Ventura (13.x). Please open a new issue if still encountering this issue.

@bflad bflad closed this as completed Feb 9, 2023
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working subsystem/tests Issues and feature requests related to the testing framework.
Projects
None yet
Development

No branches or pull requests

1 participant