Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node.js SIGABRT (abort) due to EBADF from epoll_ctl() call in libuv uv__io_poll() #4684

Closed
rhansen opened this issue Jan 28, 2021 · 8 comments · Fixed by #4887
Closed

Node.js SIGABRT (abort) due to EBADF from epoll_ctl() call in libuv uv__io_poll() #4684

rhansen opened this issue Jan 28, 2021 · 8 comments · Fixed by #4887
Labels

Comments

@rhansen
Copy link
Member

rhansen commented Jan 28, 2021

Lately I've noticed frequent node (v14.15.4) crashes when running backend tests. The crashes are caused by a call to abort() in libuv due to errno of EBADF ("bad file descriptor") set by a call to epoll_ctl() in uv__io_poll(). The crashes always seem to happen after the mocha tests finish running (I think) but before nyc prints out coverage stats. It doesn't always happen, so it's probably a race condition somewhere. I poked at the core file a bit but didn't see a smoking gun. Here's the stack trace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff783c859 in __GI_abort () at abort.c:79
#2  0x0000000000962331 in uv__io_poll (loop=loop@entry=0x7fffdeffcad8, timeout=<optimized out>) at ../deps/uv/src/unix/linux-core.c:254
#3  0x000000000137c418 in uv_run (loop=0x7fffdeffcad8, mode=UV_RUN_ONCE) at ../deps/uv/src/unix/core.c:385
#4  0x0000000000abce1d in node::worker::Worker::Run() ()
#5  0x0000000000abcf28 in node::worker::Worker::StartThread(v8::FunctionCallbackInfo<v8::Value> const&)::{lambda(void*)#1}::_FUN(void*) ()
#6  0x00007ffff7a14609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007ffff7939293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

If anyone out there is familiar with libuv and can help us debug, that would be greatly appreciated.

@webzwo0i
Copy link
Member

webzwo0i commented Jan 29, 2021

afaik there are only two worker threads involved atm, MinifyWorker and maybe handling soffice.

Looking at some of the failing backend runs I get the impression that it's not nyc but mocha that's having the problems, because the coredump msg is printed before mocha's test summary.

In mocha's changelog: #4465: Worker processes guaranteed (as opposed to "very likely") to exit before Mocha does; fixes a problem when using nyc with Mocha in parallel mode (@boneskull)
Also the latest changelog of workerpool:
Fixes and more robustness in terminating workers. Thanks @boneskull.
Fix #32, #175: the promise returned by Pool.terminate() now waits until subprocesses are dead before resolving. Thanks @boneskull.

However, we are at an older mocha version (that does not depend on workerpool yet) and we don't use parallel mode, but my best guess would still be to update mocha first. We probably need to take care of mochajs/mocha#4175 and mochajs/mocha#4315 because we use that functionality iirc.

@rhansen
Copy link
Member Author

rhansen commented Feb 1, 2021

I haven't seen any aborts lately. Did we fix it?

@JohnMcLear
Copy link
Member

I saw one today afiak.

@rhansen
Copy link
Member Author

rhansen commented Feb 7, 2021

I saw one today. They're not happening as often I think.

@rhansen
Copy link
Member Author

rhansen commented Feb 23, 2021

I finally figured out the problem: The upstream dirty package (not ueberDB) doesn't wait until it finishes writing queued data before closing the file descriptor. I'll send a PR upstream tomorrow.

The SIGABRTs started happening after commit edbe6d5 because that is when we started properly closing the database on exit.

We can probably work around the bug in ueberDB by sleeping a bit between flushing and closing the database.

@JohnMcLear
Copy link
Member

Hah wow good spot :D

@JohnMcLear
Copy link
Member

@rhansen does the latest merge close this?

@rhansen
Copy link
Member Author

rhansen commented Feb 25, 2021

I want to keep this bug open until the upstream bugfix is published, ueberdb is updated to use it, and Etherpad is updated to use the updated ueberdb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants