Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Parallel test process execution causing race condition and empty babel transpiled file read from disk cache #30577

Closed
sunilsurana opened this issue Apr 27, 2024 · 6 comments

Comments

@sunilsurana
Copy link
Member

sunilsurana commented Apr 27, 2024

Version

1.42.1

Steps to reproduce

It can be reproduced on Codespace or Azure VM's. Created this repo with steps for reproing issue. https://github.com/sunilsurana/reprofilesync

This repo mimics the behaviour of compilationcache.ts logic if it were to execute in parallel processes
Note that the issue may or may not repro on local laptop depending on disk and OS. It repros on codespace and azure vm

Expected behaviour

The tests should work deterministically fine when multiple processes trigger tests at same time

Actual behaviour

The test throws error because it could not find an import. It comes null because of empty file read when tests invoked thru multiple processes in parallel

Additional context

We distribute our tests across multiple workers and processes. So on a single machine if multiple process start test at same time the disk cached code reading part reads empty file because another process just began writing to it.
This happens because the fs.writeFileSync create 3 OS level calls which we can see using strace

open("file.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 9
pwrite(9, "content\n", 8, 0) = 8
close(9) = 0

The first open sys call truncates file if it exist and creates a blank file. Now at same time if reader comes from another process the reader will get file exists and when reading file it get blank file.

We were able to verify this by putting logs in playwright where the read and write were happening at exact millisecond precision and causing test failures.

Environment

System: Windows Azure vm 8 core 32GB RAM
OS image windows 2022 server
Node version 18LTS
sunilsurana added a commit to sunilsurana/playwright that referenced this issue Apr 27, 2024
…nspiled files from disk cache

First add a maker file to check if the previous write was fully completed. Then again in writefile section change the write mode to wx which uses exclusive lock and if two writes happen same time and we get EEXist error then ignore that
FIX for microsoft#30577
sunilsurana added a commit to sunilsurana/playwright that referenced this issue Apr 27, 2024
…nspiled files from disk cache

First add a maker file to check if the previous write was fully completed. Then again in writefile section change the write mode to wx which uses exclusive lock and if two writes happen same time and we get EEXist error then ignore that
FIX for microsoft#30577
@sunilsurana
Copy link
Member Author

sunilsurana commented Apr 27, 2024

Created a PR to fix it. We were able to patch playwright with this PR changes and verify it worked fine then.

#30578

@sunilsurana
Copy link
Member Author

sunilsurana commented Apr 27, 2024

Bit more detailed explanation.

The current main code block look like this

try {
 content = readFileSync codepath
 return content;
} catch {
  do nothing
}
codecontent = transpile file using babel
writeFileSync codecontent to codepath

Not if 1st process makes call to write codeconent and 2nd process make call to read codecontent at same time the 2nd process gets and empty file because the writefilesync first creates empty file and them writes content and then closes filedescriptor

To fix this the code is changed to

if marker file present which means the write by another process was fully completed
{
   try {
       content = readFileSync codepath
       return content;
     } catch {
        do nothing
     }
}
codecontent = transpile file using babel
try {
   writeFileSync codecontent to codepath using wx mode which is write exclusive
   writeFileSync marker file
} catch (error) {
   if error.errorcode = file already exist do nothing
   else throw error
}
return codecontent

Now the process will try to read file only if write by some other process has been succesfuly completed.
Also if lets say 2 process check marker file at same time and then enter write section at same time then only one will be able to write since it use wx mode which opens file for write in exclusive mode and uses os level mutex. So another process will get File already exist error which can be safely ignored


We had seen similar issue in past also and it was fixed earlier by using marker file check. But it was then still causing EExist errors in very rare case.
Since this commit which removes the marker file check we see issues in almost every test session since our test sessions and highly parallelized and start playwright test processes at same time

@Infectious-Smile
Copy link

+1

@dgozman
Copy link
Contributor

dgozman commented Apr 29, 2024

@sunilsurana Thank you for the issue!

First of all, I need to clarify your scenario. Are you running multiple npx playwright test invocations in parallel? If you do not, and you still observe the racy behavior, we'd really like to fix that. A real-world repro would be very welcome, even if it only happens 1 out of 100 runs.

However, if you do run multiple npx playwright test in parallel, we do not consider this to be a usual scenario and do not optimize for it. We pretty much recommend running multiple projects inside a single config with a single npx playwright test command. What would be your usecase to run multiple commands in parallel?

As you correctly identified, we have already tried the marker-file approach, and reverted it because of the added complexity and bugs still being reported by users. Therefore, we are not really optimistic that proposed PR will solve the issue without introducing new bugs, and need a very strong reason to move into this direction again.

@sunilsurana
Copy link
Member Author

sunilsurana commented Apr 29, 2024

Hi @dgozman ,
Yes we run yarn playwright test in parallel and that is what causes issue.
The reason is we are monorepo with different packages having different test setups. And we have some packages where we have huge number of tests where we divide those tests into multiple shards and our cloud test system treat each shard as indivdual "yarn playwright test verb"
Our cloud test system runs tests on multiple machines per PR/session running yarn test within different packages.
The distribution of tests in cloud is dynamic and we have cases of multiple packages using common global setup. So I think that is genuine case of running playwright in parallel and should be supported

I have provided a repro repo in the issue description. with the steps to execute.
Were the bugs reported by users due to using marker file? When marker file was there our system was working fine but rarely we got EEXIST if 3 process start at same time for which i added a check in the fixed PR.
Post 1.38 release where the marker check got removed we have been struggling to get reliable runs

@dgozman
Copy link
Contributor

dgozman commented May 2, 2024

The current workaround is to pass a unique PWTEST_CACHE_DIR environment variable to each npx playwright test invocation, giving it the absolute path to the cache directory. This way, multiple runs will not conflict when writing to the cache.

I'll close this issue for now, but we can reopen if there is more demand for this.

@dgozman dgozman closed this as completed May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants