Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Worker exits but jest process never finishes #13183

Closed
vilmosioo opened this issue Aug 26, 2022 · 11 comments · Fixed by #13566
Closed

[Bug]: Worker exits but jest process never finishes #13183

vilmosioo opened this issue Aug 26, 2022 · 11 comments · Fixed by #13566

Comments

@vilmosioo
Copy link

vilmosioo commented Aug 26, 2022

Version

28.1.3

Steps to reproduce

  1. Run jest with worker count higher than 1. Empty test that just sleeps for a minute will do.
  2. Kill one of the workers intentionally. In practice, they may die for whatever reason, for us we suspect it is OOM.
  3. Jest will indefinitely hang.

Related and recent fix https://github.com/facebook/jest/pull/13054/files#diff-e99351d361fdd6f55d39c7c41af94576baccc2b9dce73a5bfd0b516c6ab648e9

However the workers may crash with other signals and those scenarios are not covered. In our case, after some debugging, the signal is null. For some reason these workers are crashing in jest-runtime at the compileFunction.call line, and causes a null exit code, which gets ignored. jest-runner waits on a thread pool that'll never fulfil the submitted job.

The signal appears to be SIGKILL instead of SIBABRT , and the exitCode appears to be null. Please see screenshots of the debug process.

Screenshot 2022-08-26 at 17 40 15

The above outputs

ChildProcessWorker exit: exitCode = null, pid = 1378, signal = SIGKILL

We apologize for the lack of a minimal reproduction, but hope the thorough investigation given will substitute.

Expected behavior

Jest exits when one of the workers crashes for whatever reason.

Actual behavior

Jest hangs when workers are unexpectedly SIGKILL'ed.

Additional context

This does not happen with typescript v4.6.4. This only happens after we upgraded to v4.7.4. This may not be relevant, we are thinking this is an OOM.

Environment

System:
ubuntu-jdk-npm:0.40.1920

Binaries:
Typescript 4.7.4
Node 16.13.0
@vilmosioo vilmosioo changed the title [Bug]: Jest workers hand with typescript v4.7 [Bug]: Jest workers hang and process never exits Aug 26, 2022
@vilmosioo vilmosioo changed the title [Bug]: Jest workers hang and process never exits [Bug]: Workers exits but jest process never exits Aug 26, 2022
@vilmosioo vilmosioo changed the title [Bug]: Workers exits but jest process never exits [Bug]: Worker exits but jest process never exits Aug 26, 2022
@vilmosioo vilmosioo changed the title [Bug]: Worker exits but jest process never exits [Bug]: Worker exits but jest process never finishes Aug 26, 2022
@SimenB
Copy link
Member

SimenB commented Aug 26, 2022

@phawxby thoughts?

@gluxon
Copy link
Contributor

gluxon commented Aug 26, 2022

Hi everyone — @vilmosioo is my teammate and I wanted to give a bit more info.

We spent a few days investigating this before filing the issue. I do believe the workers are being OOMKilled, but don't have logging to verify since this is running in a CircleCI docker container. I hope the report contains enough information to be helpful still. Please let us know if there's any extra details that would be useful.

We're not actively seeing this issue day to day, but thought reporting it would help the Jest team and other users. Thanks for making Jest.

@phawxby
Copy link
Contributor

phawxby commented Aug 26, 2022

Exit codes and exit signals were probably the most problematic bit of the fixes that I put in recently because they're extremely inconsistent between platforms. However I thought I had covered the OOM scenario, and I do have a test which forcibly induces an OOM crash here.
https://github.com/facebook/jest/blob/main/packages/jest-worker/src/workers/__tests__/WorkerEdgeCases.test.js#L238

My questions at this point would be.

  • Does this happen in jest v28? Is this a regression I've introduced?
  • Can you create a repro within the WorkerEdgeCases test suite I can use for investigation?

@gluxon
Copy link
Contributor

gluxon commented Aug 26, 2022

Thanks for taking a look @phawxby. I think I should have been more specific about the kind of OOM we think we're seeing.

  • Looking over the WorkerEdgeCases.test.js file, it looks like that catches OOMs within node itself.
  • I think what's happening for us is the process is dying externally due to Linux's oom_kill mechanism selecting a worker process to kill. We've noticed this happen when the CircleCI container's cgroup hits its allocated memory limit, and oom_kill selects the largest process to kill.

Does this happen in jest v28? Is this a regression I've introduced?

This does happen with Jest v28.1.0. Apologies, we actually haven't had a chance to test your fix on v29.0.0 yet. From inspecting source code, we didn't believe it would take effect since we're seeing SIGKILL rather than SIGABRT.

I think your fix is strictly an improvement. Thank you. 🙂 We'll work on migrating to Jest v29 soon.

Can you create a repro within the WorkerEdgeCases test suite I can use for investigation?

The full repro is platform-specific due to OOMKiller being Linux-specific. I think we can simulate OOMKiller by calling subprocess.kill("SIGKILL") on the worker process. Is that sort of test worth adding here? If so we can create a PR adding that to WorkerEdgeCases.

Exit codes and exit signals were probably the most problematic bit of the fixes that I put in recently because they're extremely inconsistent between platforms.

I'm sure there's a good reason, but I was curious why jest-worker might want to exit silently depending on the exit codes and signals?

https://github.com/facebook/jest/blob/dde24c85698259d9ce887ede9646847d0fff9554/packages/jest-worker/src/workers/ChildProcessWorker.ts#L374-L376

I see the other branches in this conditional either throw an error or retry.

@phawxby
Copy link
Contributor

phawxby commented Aug 27, 2022

@gluxon you make a good point. That wasn't actually code I originally wrote, I just worked around it so didn't really dig into why it's doing what it's doing. But you're right:

  • If it's an exit mode we know to be bad (OOM) then we should probably terminate with an error
  • If it's an exit and we're not active in the process of shutting down the worker then we should restart the worker because it's still needed

We probably shouldn't pay any attention to what the exit code or signal actually is unless there's a known problem, otherwise we just work based on what the intended state is

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the Stale label Sep 26, 2022
@gluxon
Copy link
Contributor

gluxon commented Oct 21, 2022

That wasn't actually code I originally wrote, I just worked around it so didn't really dig into why it's doing what it's doing.

Ah that clarification is helpful. Thank you for your work to make this more correct on internal OOMs.

We probably shouldn't pay any attention to what the exit code or signal actually is unless there's a known problem, otherwise we just work based on what the intended state is

Agree with this. I'm going to try creating a test case that mimics what OOMKiller does and see how the Jest maintainers feel about any changes to make these forms of OOMs more clear.

@github-actions github-actions bot removed the Stale label Oct 21, 2022
@gluxon
Copy link
Contributor

gluxon commented Oct 21, 2022

Created a minimal repro. This test hangs forever and never hits the 3s timeout.

// src/killed.test.ts

export async function wait(ms: number) {
    return new Promise((resolve) => setTimeout(resolve, ms));
}

test("jest-worker externally killed", async () => {
    await wait(2_000);
}, 3_000);

setTimeout(() => {
    // Self-kill to make repro easier.
    process.kill(process.pid);
}, 1_000);

https://github.com/gluxon/test-jest-worker-killed-repro

hanging

@gluxon
Copy link
Contributor

gluxon commented Nov 6, 2022

Starting a fix in #13566. Any reviews or feedback welcome!

@SimenB
Copy link
Member

SimenB commented Nov 7, 2022

@github-actions
Copy link

github-actions bot commented Dec 8, 2022

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Please note this issue tracker is not a help forum. We recommend using StackOverflow or our discord channel for questions.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants