Add -showDebugAddress to enable debugging hanging parallel tests #461

onsi · 2018-04-25T22:33:26Z

The experience when parallel tests hang currently sucks. (See #323 and #206)

I had some time recently to try out a few ideas and landed on this PR that I'd like feedback on.

The basic premise is that when you now run ginkgo -p -showDebugAddress Ginkgo will emit a debug http address that delivers a JSON payload that represents:

the current running nodes
the current running test on each node
any stdout output emitted by that test so far
any GinkgoWriter writes emitted by that test so far
when the test started running

The hope is that this information can be useful to help debug the tests. The JSON output is a bit noisy (it's a serialization of some of the reporter types).

This would ultimately benefit from some sort of user-friendly client (e.g. ginkgo debug <DEBUG ADDRESS> could emit some pretty printed output; or one could imagine a simple web client to render the output nicely). There are also alternative ways to do this (e.g. send a signal to ginkgo to emit this JSON to stdout instead of hitting an HTTP endpoint).

Would appreciate any and all feedback and also contributions to improve the user experience.

jvshahid · 2018-04-26T16:04:43Z

If this information is available on the coordinating node why wouldn't a combination of running ginkgo with a new flag (sorry couldn't come up with a good name) & C-c be sufficient. If you interrupt ginkgo in this new mode it could print the following for any child node that is still running:

send it sigabrt to capture the stack trace in the stderr
wait for the node to exit
print the current test that is running on that node and when it was started
GingkgoWriter output
Standrad out output
Standard err output (which should include the stacktrace dump)

I would like to point out that the stacktrace is very important to me when debugging hanging tests.

onsi · 2018-04-26T16:17:21Z

I was imagining you might want to get visibility into the test without it hanging, but could explore some of these options.

onsi · 2018-04-26T16:19:27Z

Actually - I've received an alternative solution suggestion that would be easier to implement and reason about. Instead, what if:

ginkgo -p -debug (for example)

caused each node to stream its output to a file on disk. Then the user could examine those files to understand what each node is doing and whether it is stuck.

In this mode a signal could be interpreted by the coordinating process as a prompt to have the nodes emit their stack trace - perhaps to the same file.

Thoughts?

blgm · 2018-04-26T16:36:33Z

A bit more on the 'files on disk' suggestion:

Initially it could be controlled by an environment variable, e.g. GINKGO_DEBUG_DIR
it would be good to specify the directory as you might want to put the output somewhere that it can be captured (for instance if running in Concourse)
the style of files could be similar to UNIX syslog - i.e. a high resolution timestamp followed by human-readable text
I don't think there's any need to do clever log rotation etc... for the first implementation
Another (even simpler) implementation would be to have each worker log to STDOUT. This could be controlled by a GINKGO_DEBUG environment variable, or with multiple uses of the -v flag - a bit like ssh which allows things like ssh -v -v -v.

onsi · 2018-04-26T16:39:26Z

I think I lean towards a flag over an env-var as it is ginkgo's primary interaction model. Can you share why environment variables might be preferred?

jvshahid · 2018-04-26T17:26:49Z

My only concern with the files on disk is it could be harder to get those files if the tests are running on concourse for example. Collecting the information and printing it to stdout could help debugging hanging tests in CI if concourse sends a SIGTERM and allow enough time for ginkgo to print the information to stdout

anEXPer · 2018-04-26T17:46:20Z

Files on disk is easy to look at when debugging failures in CI with hijack, etc, and easy to follow live locally with tail -f.

stdout isn't suitable, both because of redaction concerns in CI, and because of difficulties watching output from multiple nodes in a single stream when watching locally.

I prefer writing to a file as a solution over both an API and stdout.

blgm · 2018-04-27T12:27:17Z

@onsi, I don’t have a strong view on whether it’s a flag or an environment variable. I would lean towards an environment variable if it’s a “pre-release” feature that I want feedback from specific users, and might change radically. And a command line flag is more appropriate for something that’s stable and is ready for used by anyone.

onsi · 2018-05-06T20:16:20Z

I've pushed a file-based approach to parallel-debug-file. Would love folks to take a look and send feedback. @williammartin might need your help vendoring a couple of dependencies that needed to get pulled in if we choose to go down this road.

With this approach

ginkgo -p -debug

will emit output from each node to files named ginkgo-node-N.log in the test's directory.

blgm · 2018-05-29T21:03:21Z

@onsi, after what must seem like a geological age, here are some very quick thoughts on the parallel-debug-file branch:

If you run ginko recursively, then you get a lot of files which are a bit tricky to clean up. You could add them to .gitignore obviously. I just wonder whether a better user experience would be to put them somewhere else. But that could be fixed in a later iteration.
It would be nice if the lines had ISO 8601 timestamps at the start, a bit like a traditional UNIX syslog. That's particularly useful when looking at potential hangs as you have some idea about when the last line was written.
I wouldn't worry too much about vendoring just yet. The branch works in Travis. Vendoring in Go is very much in flux, and there are different solutions for different versions of Go.

onsi · 2018-05-30T00:37:53Z

thanks for taking a look @blgm

I'd love to find an interface that works well. I think the file approach has merit and agree that ending up with lots of files all over the place isn't great. Any thoughts on an alternative design?

As for timestamps. Unfortunately given the syscall hackery involved in rerouting stdout I don't think I can inject timestamps.

blgm · 2018-05-30T16:19:02Z

Perhaps creating them all in the same directory with a name scheme like ginkgo-node-U-N.log where U is some kind of ID? But it would make it harder to correlate files to the code that they relate to.

I'd be in favour of merging this as is, with a caveat that it's an experimental feature subject to change. Most users would not add the --debug flag so would not see any difference. But users who observe occasional Ginkgo hangs would have a tool to help them out. We could then get feedback from these users, and base any future changes on that feedback.

nodo · 2018-05-31T08:01:06Z

Just out of curiosity, what were the reasons to not follow the suggestion from @vito in #323 ?

I am not super familiar with the way Ginkgo handles parallelism, but each node could trap SIGQUIT and output the stack dump to a known file. Maybe we could also create a new process group with the coordinating node and the nodes running the test, so that a user could kill them all with one command and get the stack dumps.

The advantage of this approach is that there's no need to start Ginkgo with special flags. A disadvantage is that maybe the stack dump is harder to read than logs from the nodes.

nodo · 2018-05-31T08:08:05Z

Oh I have missed the second comment on this thread.

I was imagining you might want to get visibility into the test without it hanging, but could explore some of these options.

^_ fair enough, but it assumes users are starting ginkgo with the --debug flag. Which is a bit strange... unless they are having issues :)

onsi · 2018-07-10T18:29:46Z

Good point @nodo

We should update the docs here:
https://onsi.github.io/ginkgo/#ginkgo-and-continuous-integration

To recommend that folks add the debug flag in CI environments

blgm · 2018-07-10T22:58:59Z

I've merged the file solution via #499. I'll leave this open to remind us to update the docs.

bruce-ricard · 2019-04-15T19:51:40Z

I'll leave this open to remind us to update the docs.

Is it working @blgm ? :)

Add -showDebugAddress to enable debugging hanging parallel tests

a362c88

onsi mentioned this pull request Apr 25, 2018

in parallel mode, no output for hanging test on timeout-induced interrupt #206

Closed

dprotaso mentioned this pull request Apr 27, 2018

Evaluate test frameworks knative/serving#253

Closed

blgm mentioned this pull request Jul 10, 2018

add --debug flag to emit node output to files #499

Merged

amitkrout mentioned this pull request Aug 21, 2020

Ginkgo parallel test node CI timeout debugging redhat-developer/odo#3792

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add -showDebugAddress to enable debugging hanging parallel tests #461

Add -showDebugAddress to enable debugging hanging parallel tests #461

onsi commented Apr 25, 2018

jvshahid commented Apr 26, 2018

onsi commented Apr 26, 2018

onsi commented Apr 26, 2018

blgm commented Apr 26, 2018

onsi commented Apr 26, 2018

jvshahid commented Apr 26, 2018

anEXPer commented Apr 26, 2018

blgm commented Apr 27, 2018

onsi commented May 6, 2018

blgm commented May 29, 2018

onsi commented May 30, 2018

blgm commented May 30, 2018

nodo commented May 31, 2018

nodo commented May 31, 2018

onsi commented Jul 10, 2018

blgm commented Jul 10, 2018

bruce-ricard commented Apr 15, 2019 •

edited

Add -showDebugAddress to enable debugging hanging parallel tests #461

Are you sure you want to change the base?

Add -showDebugAddress to enable debugging hanging parallel tests #461

Conversation

onsi commented Apr 25, 2018

jvshahid commented Apr 26, 2018

onsi commented Apr 26, 2018

onsi commented Apr 26, 2018

blgm commented Apr 26, 2018

onsi commented Apr 26, 2018

jvshahid commented Apr 26, 2018

anEXPer commented Apr 26, 2018

blgm commented Apr 27, 2018

onsi commented May 6, 2018

blgm commented May 29, 2018

onsi commented May 30, 2018

blgm commented May 30, 2018

nodo commented May 31, 2018

nodo commented May 31, 2018

onsi commented Jul 10, 2018

blgm commented Jul 10, 2018

bruce-ricard commented Apr 15, 2019 • edited

bruce-ricard commented Apr 15, 2019 •

edited