Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix synchronization between main thread and worker #14541

Merged
merged 10 commits into from May 22, 2022

Conversation

liuxingbaoyu
Copy link
Member

@liuxingbaoyu liuxingbaoyu commented May 11, 2022

Q                       A
Fixed Issues? #14535 (comment)
Patch: Bug Fix?
Major: Breaking Change? ×
Minor: New Feature? ×
Tests Added + Pass?
Documentation PR Link
Any Dependency Changes?
License MIT

This seems to be a bug in nodejs.
But it's hard to reproduce and we want to be compatible with older versions of nodejs.
So fixed in babel.

This PR also fixes unstable CI.

@@ -60,7 +60,7 @@ exports.WorkerClient = class WorkerClient extends Client {
{ env: WorkerClient.#worker_threads.SHARE_ENV },
);

#signal = new Int32Array(new SharedArrayBuffer(4));
#signal = new Int32Array(new SharedArrayBuffer(8));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need doubling the byte length?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        if (resp) break;
        Atomics.wait(this.#signal, 1, 0, 30);

For convenience, I increased the byte length for sleep.

@JLHwung
Copy link
Contributor

JLHwung commented May 11, 2022

I will re-run CI test for a few more times. Hopefully the last flaky test is fixed.

@liuxingbaoyu
Copy link
Member Author

liuxingbaoyu commented May 11, 2022

Ha ha! build-standalone looks to be the new last one.
🤣

Copy link
Member

@nicolo-ribaudo nicolo-ribaudo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is it needed to run the loop 100 times? Isn't 2 enough?
  2. When the worker main thread is woken up but we receive undefined, does the worker wake up us again?
  3. Has this bug ever been reproduced outside of our CI? If not, we could consider making the fix a test-only fix rather than adding it to the published code.

@liuxingbaoyu
Copy link
Member Author

liuxingbaoyu commented May 13, 2022

  1. Is it needed to run the loop 100 times? Isn't 2 enough?

I do not know.
This involves the bottom layer of nodejs, I don't know any way to solve this problem 100% theoretically, I have to do this to fix it as much as possible.
Because the runtime environment is different, it may depend on machine performance and synchronous, asynchronous behavior in the code, etc.
There is no additional performance penalty after all.

  1. When the worker main thread is woken up but we receive undefined, does the worker wake up us again?

I don't quite understand what you mean here.

  1. Has this bug ever been reproduced outside of our CI? If not, we could consider making the fix a test-only fix rather than adding it to the published code.

There are currently no reports on this issue.
But it doesn't appear to be related to specific test code, but to release code.

 
This has no effect on users who don't experience the problem, only the first time the message can't be retrieved, the loop will start.

@nicolo-ribaudo
Copy link
Member

@liuxingbaoyu I pushed a commit with some logs to understand better what's causing the bug.

@nicolo-ribaudo nicolo-ribaudo force-pushed the fix-worker-test branch 2 times, most recently from ce563b8 to 7584140 Compare May 16, 2022 16:10
@babel-bot
Copy link
Collaborator

babel-bot commented May 16, 2022

Build successful! You can test your changes in the REPL here: https://babeljs.io/repl/build/51995/

@nicolo-ribaudo nicolo-ribaudo added the PR: Internal 🏠 A type of pull request used for our changelog categories label May 16, 2022
@liuxingbaoyu
Copy link
Member Author

liuxingbaoyu commented May 16, 2022

2. When the worker main thread is woken up but we receive undefined, does the worker wake up us again?

I may now understand a little what this means.

Atomics.wait should work fine.
MessageChannel relies on the event loop.
When #signal is awakened, the message has not been processed in the event loop.
So I use Atomics.wait as a sleep to let the message be processed.

I've also tried setTimeout(0) and setImmediate on the sender side, which helps, but still throws occasional exceptions.
So had to use loop sleep on the receiver side.

The above explanation may not be completely accurate because nodejs is too complicated.

@nicolo-ribaudo
Copy link
Member

^ @addaleax (if you have time to answer) Is it expected that using postMessage on a MessageChannel might not send the message synchronously? For context, we are doing this:

// main.js
      const signal = new Int32Array();
      const subChannel = new MessageChannel();

      // Tell the worker to start doing its job
      worker.postMessage({ signal, port: subChannel.port1 }, [subChannel.port1]);

      // Wait until the worker is done
      Atomics.wait(signal, 0, 0);

      // Read the worker response
      const response = receiveMessageOnPort(subChannel.port2);
// worker.js
parentPort.addListener("message", async ({ signal, port }) => {
  let response = await computeResponse();

  // Send the response to the main thread
  port.postMessage(response);
  port.close();

  // Wake up the main thread
  Atomics.store(signal, 0, 1);
  Atomics.notify(signal, 0);
});

And sometime receiveMessageOnPort returns undefined, we have to wait a bit for the message to be available.

@addaleax
Copy link
Contributor

@nicolo-ribaudo postMessage is fully synchronous.

Are you sure this code is doing what you want it to do? I see that there are calls to multiple Client methods on the same client, e.g.

client.getTokLabels(),
client.getVisitorKeys(),

These would result in multiple .#send() calls, where the first one would set #signal[0] to 1, and so the second Atomics.wait() call would return immediately instead of waiting for a reply from the worker (since the expectation in

let wakeReason = Atomics.wait(this.#signal, 0, 0);
is that #signal[0] is still 0), which means that postMessage() may not have run yet when it returns.

Am I missing something?

@nicolo-ribaudo
Copy link
Member

nicolo-ribaudo commented May 17, 2022

Thanks for your reply!

always resets the signal to 0 before calling postMessage. The main thread is fully synchronous, so there shouldn't be any race condition between these client.* calls.

The confusing thing is that the code works ~90% of the times.

@addaleax
Copy link
Contributor

Ah yeah, that one I was indeed missing. 🙂

Do you have a standalone reproduction? Can you compare the output of a failing/a successful run when running node as node --trace-atomics-wait?

(I think technically you should be using Atomics.store() here, but I’m not sure if it matters in this specific situation. It might also be worth trying not to re-use the same SharedArrayBuffer here, even though you’re right that that shouldn’t really make a difference.)

@nicolo-ribaudo
Copy link
Member

I cannot even reproduce the error locally 😂

I'll continue tweaking this PR, feel free to unsubscribe from notifications and I'll ping you if I have more info.

@liuxingbaoyu
Copy link
Member Author

liuxingbaoyu commented May 17, 2022

@addaleax
Thank you for your participation!
I have tried writing code to reproduce but failed.
setTimeout(0) and setImmediate are somewhat helpful before Atomics.store, but only reduce exceptions by 80%, not completely fixed.

@nicolo-ribaudo
Can you make ci work on multiple nodejs versions? Maybe it's a nodejs version specific bug.
Since the pr has been edited, it is easier for you to make changes.
In addition, it is best to only test eslint for faster speed.
yarn jest "(eslint)/.*parser.*/test"

@nicolo-ribaudo
Copy link
Member

The errors are just random at this point, now they are about some unrelated functions not being callable.

@nicolo-ribaudo
Copy link
Member

Using a new SAB every times seems to have fixed it? Now the logs shows that it always works at the first receiveMessageOnPort call.

@liuxingbaoyu
Copy link
Member Author

Maybe we should enable all tests and try again?

@liuxingbaoyu
Copy link
Member Author

The exception for receiveMessageOnPort did not reappear, but something more strange happened.😂

@nicolo-ribaudo
Copy link
Member

Yup, that's what I was seeing when I commented #14541 (comment) 😬

It looks like v8's internal memory is getting corrupted somehow, and variable values are randomly changed. Let's pretend we didn't see that failure.

@nicolo-ribaudo
Copy link
Member

The new error is on the main branch 😭 😭 😭 😭 😭 😭 😭 😭 😭

https://github.com/babel/babel/runs/6513922366

@liuxingbaoyu
Copy link
Member Author

There is nothing worse than this.😰😰😰
I also suspected this new problem caused by the change of the main branch before, but I have never seen it in the main branch.

@nicolo-ribaudo
Copy link
Member

nicolo-ribaudo commented May 19, 2022

I can reproduce the new error locally 🥳
I just cannot attach a debugger because it's in a for i in {1...100} bash loop.

@liuxingbaoyu
Copy link
Member Author

liuxingbaoyu commented May 19, 2022

My computer is windows system, but it seems to be reproducible on wsl. It crashes without a ton of exceptions though.

@nicolo-ribaudo
Copy link
Member

nicolo-ribaudo commented May 22, 2022

This now fixes the receiveMessageOnPort error (which happens more or less once in 20 runs), but doesn't fix the "things are randomly undefined" error (which happens more or less once in 40 runs).

I'm going to merge this PR for now.

@nicolo-ribaudo nicolo-ribaudo changed the title fix: eslint-parser worker Fix synchronization between main thread and worker May 22, 2022
@nicolo-ribaudo nicolo-ribaudo merged commit 1fcc571 into babel:main May 22, 2022
@liuxingbaoyu
Copy link
Member Author

I may have found the cause of that error.
But I didn't find a good fix.

The reason is that jest-light-runner does not isolate the runtime environment, and the node environment is destroyed in packages\babel-register\test\cache.js.
Do you have any suggestions for jest-light-runner to run a specific test in a standalone environment?

E.g
Determine the file name?
Determine the file content flag?
Add an api that is unique to jest-light-runner?

@github-actions github-actions bot added the outdated A closed issue/PR that is archived due to age. Recommended to make a new issue label Aug 22, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
outdated A closed issue/PR that is archived due to age. Recommended to make a new issue PR: Internal 🏠 A type of pull request used for our changelog categories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants