Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test for runTerraformCmd leaked go-routine #299

Merged
merged 8 commits into from May 20, 2022

Conversation

lornasong
Copy link
Member

@lornasong lornasong commented May 5, 2022

This PR intends to showcase the issue fixed by #276:

  • Commit 1-3: all part of #301 (Sorry, I didn't realize becauseI forked, I couldn't create this PR off of my other branch)
  • Commit 4: added a test which captures the issue; test fails
  • Commit 5: cherry-picked commits from fix: Race condition on execution when context closes #276 that address the issue; test passes

If the changes in #276 look good, can close this PR in favor of merging #276. The tests added in this PR aren't very meaning and lose context once the fix is merged.

Issue details:

Currently, when runTerraformCmd is called, it starts a go-routine to kill the
Terraform CLI on context.Done(). However, when the Terraform CLI completes and
runTerraformCmd() finishes, the go-routine continues running unnecessarily.

If the caller cancels the context down the line, this will stop the go-routine
and it will log the error: "error from kill: os: process already finished" because
the Terraform CLI has already finished.

Added a test for this in cmd_default.go and cmd_linux.go. Have not tried it in
linux yet though.

Note: it's possible to see the race with pre-existing tests e.g. go test -race -run TestContext_sleepTimeoutExpired ./tfexec/internal/e2etest -v

When running with the race detector for Test_runTerraformCmd_default:

==================
WARNING: DATA RACE
Read at 0x00c0002516c8 by goroutine 7:
  bytes.(*Buffer).String()
      /usr/local/go/src/bytes/buffer.go:65 +0x36a
  github.com/hashicorp/terraform-exec/tfexec.Test_runTerraformCmd_default()
      /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default_test.go:35 +0x360
  testing.tRunner()
  // truncated ...

Previous write at 0x00c0002516c8 by goroutine 8:
  bytes.(*Buffer).grow()
      /usr/local/go/src/bytes/buffer.go:147 +0x3b1
  bytes.(*Buffer).Write()
      /usr/local/go/src/bytes/buffer.go:172 +0xcd
  log.(*Logger).Output()
      /usr/local/go/src/log/log.go:184 +0x466
  log.(*Logger).Printf()
      /usr/local/go/src/log/log.go:191 +0x6e
  github.com/hashicorp/terraform-exec/tfexec.(*Terraform).runTerraformCmd.func1()
      /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default.go:24 +0x2a5
  // truncated ...
==================
--- FAIL: Test_runTerraformCmd_default (1.01s)
    cmd_default_test.go:35:
                Error Trace:    cmd_default_test.go:35
                Error:          "[INFO] running Terraform command: /bin/echo hello tf-exec!
                                error from kill: os: process already finished
                                " should not contain "error from kill"
                Test:           Test_runTerraformCmd_default
    testing.go:1152: race detected during execution of test

Closes #258
Closes #276

@lornasong lornasong force-pushed the run-cmd branch 4 times, most recently from 608cf7f to 27f77eb Compare May 6, 2022 02:15
…timeframe

Add a timeout to the test to ensure that the terraform apply cancels within
a reasonable time of the 5s timeout.

Currently, this test is not canceling the terraform apply as expected. In the
logs you can see that the test takes 1 min rather than ~5s:
```
    --- PASS: TestContext_sleepTimeoutExpired/sleep-0.12.31 (62.13s)
```

```
=== RUN   TestContext_sleepTimeoutExpired/sleep-0.12.31
    util_test.go:113: [INFO] running Terraform command: /var/folders/6y/gy9gggt14379c_k39vwb50lc0000gn/T/terraform_1378921380/terraform apply -no-color -auto-approve -input=false -lock=true -parallelism=10 -refresh=true
    util_test.go:103: CLI Output:
	    // truncated ...
        time_sleep.sleep: Creating...
        time_sleep.sleep: Still creating... [10s elapsed]
        time_sleep.sleep: Still creating... [20s elapsed]
        time_sleep.sleep: Still creating... [30s elapsed]
        time_sleep.sleep: Still creating... [41s elapsed]
        time_sleep.sleep: Still creating... [51s elapsed]
        time_sleep.sleep: Creation complete after 1m0s [id=2022-05-06T17:40:20Z]

        Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
```
Processes were not being killed because cmd.ProcessState was nil. With this
change, processes will be able to make the request to Kill().

Added a temporary log to printout cmd.ProcessState to demonstrate. Will be
removed in next commit.

Note: this will cause hanging `TestContext_sleepTimeoutExpired` due to a known
Golang issue with killing a command when Stdout or Stderr are set to anything
besides `nil` or `*os.File`. This is because the Kill does not notify the
stdout/stderr subprocesses to stop. `cmd.Wait` (called by `cmd.Run`) waits
indefinitely for those subprocesses to stop.
`TestContext_sleepTimeoutExpired` can occasionally hang when killing a command
that has Stdout or Stderr set to anything besides `nil` or `*os.File`.
golang/go#23019

Use workaround to read from StdoutPipe and StderrPipe rather than setting
Stdout / Stderr
Currently, when runTerraformCmd is called, it starts a go-routine to kill the
Terraform CLI on context.Done(). However, when the Terraform CLI completes and
runTerraformCmd() finishes, the go-routine continues running unnecessarily.

If the caller cancels the context down the line, this will stop the go-routine
and it will log the error: "error from kill: os: process already finished" because
the Terraform CLI has already finished.

Added a test for this in cmd_default.go and cmd_linux.go. Have not tried it in
linux yet though.

When running with the race detector:
```
==================
WARNING: DATA RACE
Read at 0x00c0002516c8 by goroutine 7:
  bytes.(*Buffer).String()
      /usr/local/go/src/bytes/buffer.go:65 +0x36a
  github.com/hashicorp/terraform-exec/tfexec.Test_runTerraformCmd_default()
      /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default_test.go:35 +0x360
  testing.tRunner()
  // truncated ...

Previous write at 0x00c0002516c8 by goroutine 8:
  bytes.(*Buffer).grow()
      /usr/local/go/src/bytes/buffer.go:147 +0x3b1
  bytes.(*Buffer).Write()
      /usr/local/go/src/bytes/buffer.go:172 +0xcd
  log.(*Logger).Output()
      /usr/local/go/src/log/log.go:184 +0x466
  log.(*Logger).Printf()
      /usr/local/go/src/log/log.go:191 +0x6e
  github.com/hashicorp/terraform-exec/tfexec.(*Terraform).runTerraformCmd.func1()
      /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default.go:24 +0x2a5
  // truncated ...
==================
@lornasong lornasong marked this pull request as ready for review May 10, 2022 20:22
@radeksimko radeksimko self-requested a review May 11, 2022 09:32
Copy link
Member

@radeksimko radeksimko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lornasong
Thanks for the PR.

It seems there are still some race conditions pointing to the newly introduced code, see https://gist.github.com/radeksimko/a199a28daa4eb8225aa6886686bac7fd

Would you mind looking into it?

The implementation and tests generally look reasonable to me though. I'm not sure what's the root cause of that race condition.

The main reason we were catching the context cancellation and killing the process was because it provided extra context beyond just "context deadline exceeded" error. However we don't really have any existing tests and don't communicate that as a feature, so I'm okay with loosing that. We can revisit error handling generally as part of #107

In terms of other PRs - I think we can focus on this one, which seems to be a superset of #301 and importantly has tests, so feel free to close #301

tfexec/cmd.go Outdated Show resolved Hide resolved
This also fixes a race condition caused by using the pointer to the io.EOF
@lornasong
Copy link
Member Author

@radeksimko - thank you for reviewing and catching the race condition.

It looks like the race condition is caused by threads sharing access to &io.EOF in errors.As(err, &io.EOF) at cmd.go:246

I was able to reproduce the race condition using go test -race -run TestApplyCmd ./tfexec -v. When I create a locally scoped io.EOF variable, I am no longer able to reproduce the race condition:

eofErr := io.EOF
if errors.As(err, &eofErr) {
    return nil
}

Switching to error.Is, as you caught, actually removes the &io.EOF problem! With that change, I am no longer able to reproduce this race condition. Please let me know if you see otherwise. Thank you

@radeksimko
Copy link
Member

Thanks for the update @lornasong

This seemed to have reduced the race conditions to a point that I can rarely reproduce it via

go test ./... -race -count=1

but unfortunately I still see some race conditions with this

go test ./... -race -count=5

See attached log output:
https://gist.github.com/radeksimko/82dab1f27efc0ff493891b7dc3e82253

I also noticed an error util_test.go:95: unable to determine running version (expected "1.1.9"): read |0: file already closed

string.Builder is a non-comparable type which is not safe for concurrent use
when shared by Cmd.Stdout and Cmd.Stderr. Causes a race condition when accessing
the the builder when Cmd is running.
 - Use waitgroups for more readability
 - Improve handling errors from writeOutput
 - Finish reading from pipes before calling cmd.Wait - fixes a race condition
 that leads to an error :`read |0: file already closed`
 - Because now waiting for pipes to finish reading, need to update waitGroup to
 close buf.Read on context cancel. Otherwise buf.Read blocks until next line
 before stopping. Causes TestContext_sleepTimeoutExpired takes a little too long
 to cancel (~20s)
@lornasong
Copy link
Member Author

Thanks for catching all of these issues @radeksimko. I pushed two commits to try to address the issues in the gist you shared.

The first commit tries to resolve the race condition issues. It looks like the race condition occurs when Stdout and Stderr share a writer that is non-go-routine safe e.g. strings.Builder. I opted to modify the tests to have separate strings.Builders for Stdout and Stderr for a couple reasons:

  • It looks like when sharing a writer, Cmd.Stdout and Cmd.Stderr expects the writer to be a comparable type in order to be goroutine safe. strings.Builder isn't comparable because []byte can't be compared
  • Relatedly, it seems as though sharing a writer was already causing a race condition on the main branch. I was able to reliably reproduce the race condition (on this branch prior to the test modification and main) with TestShow_noInitLocalBackendNonDefaultState v0.13.7 (I manually commented out the other versions). Sample log output from running test on main

Please let me know if you have concerns with going this route.

The second commit should hopefully resolve the read |0: file already closed. From hashicorp/go-plugin#116, I learned that the pipes should finish reading before calling cmd.Wait. Waiting for the pipes to finish reading introduced a new problem where TestContext_sleepTimeoutExpired started failing because buf.ReadBytes() blocks until it reads bytes, which prevents handling the context cancel in a timely way. I added a fix in writeOutput to attempt to address this.

Please me know any additional feedback or issues you find. Thank you

Copy link
Member

@radeksimko radeksimko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the updates and for the added context. Tests are now repeatedly passing even under

go test ./... -race -count=5 -timeout=30m

👍🏻

func writeOutput(ctx context.Context, r io.ReadCloser, w io.Writer) error {
// ReadBytes will block until bytes are read, which can cause a delay in
// returning even if the command's context has been canceled. Use a separate
// goroutine to prompt ReadBytes to return on cancel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the comment ❤️

@radeksimko radeksimko added the bug Something isn't working label May 20, 2022
@radeksimko radeksimko merged commit e3ff563 into hashicorp:main May 20, 2022
@lornasong
Copy link
Member Author

Thank you for the review @radeksimko. Appreciate your time and finding all of those issues 🙏

@lornasong lornasong deleted the run-cmd branch May 23, 2022 21:43
rstandt pushed a commit to rstandt/terraform-exec that referenced this pull request Jun 21, 2022
* Update TestContext_sleepTimeoutExpired to check for canceling within timeframe

Add a timeout to the test to ensure that the terraform apply cancels within
a reasonable time of the 5s timeout.

Currently, this test is not canceling the terraform apply as expected. In the
logs you can see that the test takes 1 min rather than ~5s:
```
    --- PASS: TestContext_sleepTimeoutExpired/sleep-0.12.31 (62.13s)
```

```
=== RUN   TestContext_sleepTimeoutExpired/sleep-0.12.31
    util_test.go:113: [INFO] running Terraform command: /var/folders/6y/gy9gggt14379c_k39vwb50lc0000gn/T/terraform_1378921380/terraform apply -no-color -auto-approve -input=false -lock=true -parallelism=10 -refresh=true
    util_test.go:103: CLI Output:
	    // truncated ...
        time_sleep.sleep: Creating...
        time_sleep.sleep: Still creating... [10s elapsed]
        time_sleep.sleep: Still creating... [20s elapsed]
        time_sleep.sleep: Still creating... [30s elapsed]
        time_sleep.sleep: Still creating... [41s elapsed]
        time_sleep.sleep: Still creating... [51s elapsed]
        time_sleep.sleep: Creation complete after 1m0s [id=2022-05-06T17:40:20Z]

        Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
```

* Remove runTerraformCmd check for cmd.ProcessState

Processes were not being killed because cmd.ProcessState was nil. With this
change, processes will be able to make the request to Kill().

Added a temporary log to printout cmd.ProcessState to demonstrate. Will be
removed in next commit.

Note: this will cause hanging `TestContext_sleepTimeoutExpired` due to a known
Golang issue with killing a command when Stdout or Stderr are set to anything
besides `nil` or `*os.File`. This is because the Kill does not notify the
stdout/stderr subprocesses to stop. `cmd.Wait` (called by `cmd.Run`) waits
indefinitely for those subprocesses to stop.

* Read logs from Stderr/out Pipe to avoid hanging

`TestContext_sleepTimeoutExpired` can occasionally hang when killing a command
that has Stdout or Stderr set to anything besides `nil` or `*os.File`.
golang/go#23019

Use workaround to read from StdoutPipe and StderrPipe rather than setting
Stdout / Stderr

* Test for runTerraformCmd leaked go-routine

Currently, when runTerraformCmd is called, it starts a go-routine to kill the
Terraform CLI on context.Done(). However, when the Terraform CLI completes and
runTerraformCmd() finishes, the go-routine continues running unnecessarily.

If the caller cancels the context down the line, this will stop the go-routine
and it will log the error: "error from kill: os: process already finished" because
the Terraform CLI has already finished.

Added a test for this in cmd_default.go and cmd_linux.go. Have not tried it in
linux yet though.

When running with the race detector:
```
==================
WARNING: DATA RACE
Read at 0x00c0002516c8 by goroutine 7:
  bytes.(*Buffer).String()
      /usr/local/go/src/bytes/buffer.go:65 +0x36a
  github.com/hashicorp/terraform-exec/tfexec.Test_runTerraformCmd_default()
      /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default_test.go:35 +0x360
  testing.tRunner()
  // truncated ...

Previous write at 0x00c0002516c8 by goroutine 8:
  bytes.(*Buffer).grow()
      /usr/local/go/src/bytes/buffer.go:147 +0x3b1
  bytes.(*Buffer).Write()
      /usr/local/go/src/bytes/buffer.go:172 +0xcd
  log.(*Logger).Output()
      /usr/local/go/src/log/log.go:184 +0x466
  log.(*Logger).Printf()
      /usr/local/go/src/log/log.go:191 +0x6e
  github.com/hashicorp/terraform-exec/tfexec.(*Terraform).runTerraformCmd.func1()
      /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default.go:24 +0x2a5
  // truncated ...
==================

* Use CommandContext to kill instead of manually doing it

* Fix EOF error check to use error.Is()

This also fixes a race condition caused by using the pointer to the io.EOF

* Update tests to use separate string.Builder-s for stdout and stderr

string.Builder is a non-comparable type which is not safe for concurrent use
when shared by Cmd.Stdout and Cmd.Stderr. Causes a race condition when accessing
the the builder when Cmd is running.

* Fixes to runTerraformCmd for race conditions

 - Use waitgroups for more readability
 - Improve handling errors from writeOutput
 - Finish reading from pipes before calling cmd.Wait - fixes a race condition
 that leads to an error :`read |0: file already closed`
 - Because now waiting for pipes to finish reading, need to update waitGroup to
 close buf.Read on context cancel. Otherwise buf.Read blocks until next line
 before stopping. Causes TestContext_sleepTimeoutExpired takes a little too long
 to cancel (~20s)

Co-authored-by: Kyle Carberry <kyle@carberry.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants