New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel envtests getting stuck in flock #1599
Comments
@Porges Does it work if you carry this patch with the non-blocking flag? diff --git a/go.mod b/go.mod
index 0ce3b6c1..8e8e3f0d 100644
--- a/go.mod
+++ b/go.mod
@@ -12,6 +12,7 @@ require (
github.com/imdario/mergo v0.3.12 // indirect
github.com/onsi/ginkgo v1.16.4
github.com/onsi/gomega v1.13.0
+ github.com/pkg/errors v0.9.1
github.com/prometheus/client_golang v1.11.0
github.com/prometheus/client_model v0.2.0
go.uber.org/goleak v1.1.10
diff --git a/pkg/internal/flock/flock_unix.go b/pkg/internal/flock/flock_unix.go
index 3dae621b..503de43b 100644
--- a/pkg/internal/flock/flock_unix.go
+++ b/pkg/internal/flock/flock_unix.go
@@ -18,7 +18,14 @@ limitations under the License.
package flock
-import "golang.org/x/sys/unix"
+import (
+ "github.com/pkg/errors"
+ "golang.org/x/sys/unix"
+)
+
+var (
+ ErrAlreadyLocked = errors.New("the file is already locked")
+)
// Acquire acquires a lock on a file for the duration of the process. This method
// is reentrant.
@@ -30,6 +37,10 @@ func Acquire(path string) error {
// We don't need to close the fd since we should hold
// it until the process exits.
+ err = unix.Flock(fd, unix.LOCK_NB|unix.LOCK_EX)
+ if errors.Is(err, unix.EWOULDBLOCK) { // This condition requires LOCK_NB.
+ return errors.Wrapf(ErrAlreadyLocked, "cannot lock file %q", path)
+ }
+ return err
- return unix.Flock(fd, unix.LOCK_EX)
}
diff --git a/pkg/internal/testing/addr/manager.go b/pkg/internal/testing/addr/manager.go
index 2326af15..9dcbaa43 100644
--- a/pkg/internal/testing/addr/manager.go
+++ b/pkg/internal/testing/addr/manager.go
@@ -17,6 +17,7 @@ limitations under the License.
package addr
import (
+ "errors"
"fmt"
"io/fs"
"net"
@@ -31,7 +32,7 @@ import (
// TODO(directxman12): interface / release functionality for external port managers
const (
- portReserveTime = 10 * time.Minute
+ portReserveTime = 2 * time.Minute
portConflictRetry = 100
portFilePrefix = "port-"
)
@@ -76,7 +77,8 @@ func (c *portCache) add(port int) (bool, error) {
return false, err
}
// Try allocating new port, by acquiring a file.
- if err := flock.Acquire(fmt.Sprintf("%s/%s%d", cacheDir, portFilePrefix, port)); os.IsExist(err) {
+ path := fmt.Sprintf("%s/%s%d", cacheDir, portFilePrefix, port)
+ if err := flock.Acquire(path); errors.Is(err, os.ErrExist) || errors.Is(err, flock.ErrAlreadyLocked) {
return false, nil
} else if err != nil {
return false, err
@@ -86,22 +88,19 @@ func (c *portCache) add(port int) (bool, error) {
var cache = &portCache{}
-func suggest(listenHost string) (int, string, error) {
+func suggest(listenHost string) (*net.TCPListener, int, string, error) {
if listenHost == "" {
listenHost = "localhost"
}
addr, err := net.ResolveTCPAddr("tcp", net.JoinHostPort(listenHost, "0"))
if err != nil {
- return -1, "", err
+ return nil, -1, "", err
}
l, err := net.ListenTCP("tcp", addr)
if err != nil {
- return -1, "", err
+ return nil, -1, "", err
}
- if err := l.Close(); err != nil {
- return -1, "", err
- }
- return l.Addr().(*net.TCPAddr).Port,
+ return l, l.Addr().(*net.TCPAddr).Port,
addr.IP.String(),
nil
}
@@ -112,10 +111,11 @@ func suggest(listenHost string) (int, string, error) {
// allocated within 1 minute.
func Suggest(listenHost string) (int, string, error) {
for i := 0; i < portConflictRetry; i++ {
- port, resolvedHost, err := suggest(listenHost)
+ listener, port, resolvedHost, err := suggest(listenHost)
if err != nil {
return -1, "", err
}
+ defer listener.Close()
if ok, err := cache.add(port); ok {
return port, resolvedHost, nil
} else if err != nil {
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Per this comment (cc @vincepri). We recently updated to v0.9.2 and encountered test failures due to timeout where it looks like a goroutine is getting stuck trying to
flock
a file.It seems that #1563 would be the culprit, and downgrading to v0.9.1 did allow our test run to pass.
At the moment we don’t limit the number of parallel envtests (we use
t.Parallel()
), so we are potentially running up to 19 instances (the number of tests we have). We are running our tests in a container based on a VS Code devcontainer image (uname -a
showsLinux ca4e9d7b3b89 5.10.43.3-microsoft-standard-WSL2 #1 SMP Wed Jun 16 23:47:55 UTC 2021 x86_64 GNU/Linux
). The host is GitHub’subuntu-latest
.An example failure is here. The goroutine-dump output from the test run is:
The text was updated successfully, but these errors were encountered: