Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testscript: fix "signal: killed" exec errors on MacOS by adding a 200ms sleep on darwin #219

Closed
wants to merge 1 commit into from

Conversation

bep
Copy link
Contributor

@bep bep commented May 2, 2023

On MacOS there are lots of reports of unexpected failing tests with output similar to this:

 [signal: killed]
            FAIL: testscripts/myecho.txt:1: unexpected command failure

On CI builds the workaround has been to downgrade to a builder with MacOS <= 11 (e.g. macos-11 on GitHub).

In development on a MacBook, this is not an option.

This commit works around what seem to be a upstream bug in os.Link by adding a small sleep before TestingM.Run() to allow the write of the test
command symlinks to be ready.

See #200

uname -a                                                                                                             
Darwin bep-mac14.local 22.3.0 Darwin Kernel Version 22.3.0: Thu Jan  5 20:48:54 PST 2023; root:xnu-8792.81.2~2/RELEASE_ARM64_T6000 arm64
go version                                                                                                           
go version go1.20.3 darwin/arm64

Fixes #200

@mvdan
Copy link
Collaborator

mvdan commented May 2, 2023

Thanks for digging into this! It's great to have someone with a machine that is able to reproduce do this investigation, as I was unable to.

// Make sure all of the io operations above gets some time to complete.

I'm curious, what would make them not finish in time? We wait for the io.Copy and os.File.Close calls to finish, which I assume are enough. I guess we could also call https://pkg.go.dev/os#File.Sync, but at least in my past experience, that has never been needed.

How did you arrive at the 50ms? Is that number long enough to make the failures virtually impossible on your laptop? I'm almost tempted to double or even quadruple that number if so, because CI runners are often many times slower than a modern laptop.

I'd also like to have this sleep behind a runtime.GOOS == "darwin"; the other platforms are entirely unaffected.

Also, interesting that CI still failed with the same error on Mac :)

Finally - it would be good to still figure out a way to make this work well without sleeping. Sleeping is always going to be either too short and flakey, or too long and wasteful. If there really isn't anything we can do but sleep, perhaps we could do something a bit better, such as sleeping in a loop until calling the last of the commands as a no-op (e.g. with -h) results in a normal execution.

@mvdan
Copy link
Collaborator

mvdan commented May 2, 2023

Also note that the failure in CI above is from executing go, so it's in theory entirely unrelated to the programs we set up in $PATH. I've always wondered if the cause was that testscript is parallel by default, and executing so many binaries concurrently might make MacOS unhappy.

@bep
Copy link
Contributor Author

bep commented May 2, 2023

How did you arrive at the 50ms? Is that number long enough to make the failures virtually impossible on your laptop? I'm almost tempted to double or even quadruple that number if so, because CI runners are often many times slower than a modern laptop.

That's a rather arbitrary number. I did test higher, but not lower numbers. My test case is pretty aggressive and I can easily run that may times without any error -- but this is the setup part of the scripts, so there's no reason not to go for a longer sleep.

Finally - it would be good to still figure out a way to make this work well without sleeping.

Sure, but there's a lot of of file copying and mkdiring in RunMain, so digging into that sounds like ... work. My best guess is file syncing, but in any case there's certainly a data race going on where the binaries gets executed before they're somehow written to disk (or a variation of that theme).

I'd also like to have this sleep behind a runtime.GOOS == "darwin"; the other platforms are entirely unaffected.

I have force pushed (bad habit ...) a version with 200ms sleep + a runtime.GOOS == "darwin" conditional. Let's see if the test passes (the failing test must be unrelated).

@bep bep force-pushed the fix/unexpected-command-failure branch 2 times, most recently from 7beb349 to 5bad37a Compare May 2, 2023 21:07
@ldemailly
Copy link

That's awesome if it works, can you try File.Sync instead of the sleep (if possible) to see if that solves it as well?
(I guess the issue is there are a lot of Close() and it's not all in one place)

@bep
Copy link
Contributor Author

bep commented May 3, 2023

That's awesome if it works, can you try File.Sync instead of the sleep (if possible) to see if that solves it as well?
(I guess the issue is there are a lot of Close() and it's not all in one place)

I'll see if I can find some time during the day. So, it fixes my problem, so in that sense it works. I run tests on save in my editor, and before this patch I get this error at about 1/4 of the test runs (depending on the size of the project under test), and have to rerun the test, which breaks my workflow pretty bad. Whether this fixes the CI problem is yet to be proven, but the symptoms are the same, so I would be surprised if it didn't.

@bep
Copy link
Contributor Author

bep commented May 3, 2023

OK, I tried a variant to add a Sync, but that didn't fix it. Turns out the problem lives here:

https://github.com/rogpeppe/go-internal/blob/master/testscript/exe.go#L126

  • The links created by os.Link seems take "some time" to be ready on MacOS.
  • Fixing that issue would need to be done in Go (or Apple)
  • Disabling symlinks above (like it's done on Windows) would be a worse fix than this patch, so I suggest merging this and then create an issue in Go that can (maybe) do a better fix in the future.

@bep bep force-pushed the fix/unexpected-command-failure branch from 5bad37a to 7ff90d9 Compare May 3, 2023 07:10
@bep bep changed the title testscript: fix "signal: killed" exec errors testscript: fix "signal: killed" exec errors on MacOS May 3, 2023
@bep bep force-pushed the fix/unexpected-command-failure branch 2 times, most recently from 31202af to f6df25d Compare May 3, 2023 07:38
@mvdan
Copy link
Collaborator

mvdan commented May 3, 2023

Interesting, I wouldn't have guessed that the hard links were to blame. Note that they're hard links though, not symbolic links. I can't quite remember why I chose os.Link rather than os.Symlink, but I imagine it's because hard links are much closer to real files than symbolic links. And symbolic links always have their own set of problems, like breaking if the target is removed.

Disabling symlinks above (like it's done on Windows) would be a worse fix than this patch, so I suggest merging this and then create an issue in Go that can (maybe) do a better fix in the future.

I would actually say the opposite. Copying a binary will generally take much, much less than 50ms or 200ms. A hard link is typically even faster, yes, but here for MacOS we're aiming for avoiding the failures over speed. If disabling hard links on Mac fixes the issue, like we already do on Windows, then I would say that's definitely a better fix than an arbitrary sleep.

On `MacOS` there are lots of reports of unexpected failing tests with output similar to this:

```
 [signal: killed]
            FAIL: testscripts/myecho.txt:1: unexpected command failure
```

On CI builds the workaround has been to downgrade to a builder with MacOS <= 11 (e.g. macos-11 on GitHub).

In development on a MacBook, this is not an option.

This commit works around what seem to be a upstream bug in `os.Link` by adding a small sleep before `TestingM.Run()` to allow the write of the test
command hard links to be ready.

See rogpeppe#200
@bep bep force-pushed the fix/unexpected-command-failure branch from f6df25d to 6f1ce9e Compare May 3, 2023 08:41
@bep
Copy link
Contributor Author

bep commented May 3, 2023

Note that they're hard links though, not symbolic links. I can't quite remember why I chose os.Link rather than os.Symlink

OK, my mistake.

Copying a binary will generally take much, much less than 50ms or 200ms.

You're duplicating the binary n times (where n is the number of commands in the map, which can be many). If you have many commands/big binaries, I doubt you can beat 200ms, but it doesn't matter much.

I have created this PR:

#220

Which also fixes this issue for me. Pick the one you like.

@mvdan
Copy link
Collaborator

mvdan commented May 3, 2023

Both have passed CI now, so they both seem to help. I'm inclined to go with the other PR, just because sleeps are always going to be flakey or slow. I'll leave both PRs up for 24h to see if other Mac users chime in.

@bep
Copy link
Contributor Author

bep commented May 3, 2023

@mvdan note that you should also bump the macos version here:

https://github.com/rogpeppe/go-internal/blob/master/.github/workflows/test.yml#L19

But I'm pretty sure that's not picked up from a PR from a fork, so you should do that as a follow up. As I said, I'm 100% sure that this fixes my problem, but I'm not 100% sure that it also fixes all CI related problems.

@bep
Copy link
Contributor Author

bep commented May 3, 2023

@mvdan I had a second look on this, and I think I want to put in a vote for the "sleep fix". I checked the bin folders for a test I have running:

This patch:

41M

The other patch:

493M

The way the binaries are created are very clever, but it means lots of potentially very large files created on each run if your not able to hard link it.

@mvdan
Copy link
Collaborator

mvdan commented May 3, 2023

Yeah, more than happy to bump the macos version back to latest once we're reasonably sure that the flakes are solved.

The way the binaries are created are very clever, but it means lots of potentially very large files created on each run if your not able to hard link it.

That's true, although these are temporary files, so does the size matter that much? Speaking for myself, a 200ms sleep on every test run would bother me more than using another 50-100MiB worth of /tmp per run. Most of the testscript use cases I've seen in the past declare at most a handful of top-level commands.

I'd of course love to find a solution to continue using hard links. Sleeping doesn't seem like a good solution :) If we knew what to poll for, or somehow tell when the binaries are "ready" to be executed, then at least we could incrementally sleep until done.

@bep
Copy link
Contributor Author

bep commented May 4, 2023

That's true, although these are temporary files, so does the size matter that much? Speaking for myself, a 200ms sleep on every test run would bother me more than using another 50-100MiB worth of /tmp per run.

You understand that my numbers above are 50MiB vs 500MiB, right? I have tests that 1. I run on save in my editor and 2. Typically takes a second or two to finish (so I don't care too much about 200ms extra). For arguments sake, let's say I have a bigger project with more commands so it becomes 1GB on every save in my editor, that's a lot of SSD writes during a year ...

Also, knowing this is a issue with os.Link we can

  • Revert to, say, a 50ms sleep, as I'm pretty sure there's a very small window of failure here.
  • Also, I consider this to be a bug somewhere upstream (Apple or Go) that will eventually be fixed, so a temporary sleep should be a fine temporary patch.

@rogpeppe
Copy link
Owner

rogpeppe commented May 4, 2023

@bep You've probably tried this already, but given that you can reproduce this problem reasonably reliably, it might be interesting to try stat-ing and reading the contents of the link just after creating it. I'm specifically wondering if it can ever happen that that succeeds when the later exec fails. If not, then this could be used as the basis for a polling loop, assuming the availability is a global property of the linked file.

Another more direct way to implement a polling loop is actually to try to exec the commands immediately after creating the links. We could set some environment variable to make the action of the command a no-op in this case.

(For the record, I'm not really in favour of either the sleep or the copy-the-binary solution, unless those really are the only way forward).

@bep
Copy link
Contributor Author

bep commented May 4, 2023

@rogpeppe I have tried a few variations of stating and opening and reading some bytes from the file, but none of those work. I guess one could try to read the entire file in a poll loop, but...

However, I have tested this successfully, which is now my current favourite solution:

#221

@bep bep changed the title testscript: fix "signal: killed" exec errors on MacOS testscript: fix "signal: killed" exec errors on MacOS by adding a 200ms sleep on darwin May 4, 2023
@ldemailly
Copy link

Would symlinks instead of hardlinks work better/out of the box? I vaguely remember issues with hardlinks turning to copies when crossing fs and maybe tmpdir is different fs?

@mvdan
Copy link
Collaborator

mvdan commented May 5, 2023

@ldemailly see #221.

@mvdan
Copy link
Collaborator

mvdan commented May 5, 2023

@rogpeppe I have tried a few variations of stating and opening and reading some bytes from the file, but none of those work. I guess one could try to read the entire file in a poll loop, but...

Thanks for checking those. If symlinks don't have this race problem, then hopefully that's the easiest, as we don't need any sleeping nor polling.

@bep bep closed this May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

testscript: "signal: killed" exec errors on MacOS 12
4 participants