Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async fixing #4024

Closed
staabm opened this issue Oct 10, 2018 · 28 comments
Closed

Async fixing #4024

staabm opened this issue Oct 10, 2018 · 28 comments

Comments

@staabm
Copy link
Contributor

staabm commented Oct 10, 2018

The project is already able to do linting in a parallel process.

Wouldnt it make sense to run the fixers in parallel per file?
(Assuming this would speedup the process considerably)

Whats your opinion?

@keradus
Copy link
Member

keradus commented Oct 10, 2018

it's not possible to run fixers per file in parallel, as there are dependencies between fixers, they have to run in proper order.

it's possible to run files in parallel

@staabm
Copy link
Contributor Author

staabm commented Oct 10, 2018

Thats what I mean. Run all files in parallel but the fixers per file in the same process.

In case such a thing would be welcome?

@keradus
Copy link
Member

keradus commented Oct 10, 2018

in general - yes.
but please, before raising the PR with code, raise a technical proposal how to achieve it ;)

@staabm
Copy link
Contributor Author

staabm commented Oct 10, 2018

For a first prototype I would use a fixed number of subprocesses and let run php-cs-fixer per file while managing the remaining work and the result printing from a master process.
Just for the goal to check how much time parallelzation might yield.

More or less parallelizing this loop
https://github.com/FriendsOfPHP/PHP-CS-Fixer/blob/41dc9e79779b8a0b89f811f93b077af25613c01f/src/Runner/Runner.php#L131

@keradus
Copy link
Member

keradus commented Oct 11, 2018

👎 for that.
bootstrapping PHP CS Fixer is costly, so if we have 1k files, I don't want to bootstrap PHP CS Fixer 1k times.

also, during fixing of one file, we are already linting next one in background.

finally, if there are 4 cores, it's pointless to create 1000 processes.

what we need is a WorkerPool / JobPool, that would be feeded with next and next jobs (files) and assign file to one of fixed amount of workers in the pool (== number of cores)

@staabm
Copy link
Contributor Author

staabm commented Oct 11, 2018

I totally agree, and thats also how I imagine it to work in the final version.

for the first shot, I want to get a feeling how much speedup we can get, without a big investment in code-changes.
after we are confident that its worth it (e.g. a 5x improvement) I would work for something more usefull.

@keradus
Copy link
Member

keradus commented Oct 11, 2018

then, just use parallel cli tool ( http://manpages.ubuntu.com/manpages/xenial/man1/parallel.1.html ) and run it against each file individually

@staabm
Copy link
Contributor Author

staabm commented Oct 11, 2018

some initial numbers:

$ time php ./PHP-CS-Fixer/php-cs-fixer fix ./sabre-dav/lib/
Loaded config default.
   1) sabre-dav/lib/CalDAV/CalendarRoot.php
   2) sabre-dav/lib/CalDAV/SharedCalendar.php
...
 218) sabre-dav/lib/DAV/Exception/TooManyMatches.php
 219) sabre-dav/lib/DAV/Exception/NotImplemented.php
 220) sabre-dav/lib/DAV/Exception/NotFound.php
 221) sabre-dav/lib/DAV/Exception/MethodNotAllowed.php
 222) sabre-dav/lib/DAV/Exception/LockTokenMatchesRequestUri.php
 223) sabre-dav/lib/DAV/Tree.php

Fixed all files in 10.706 seconds, 16.000 MB memory used

real    0m10.945s
user    0m9.764s
sys     0m0.180s
time find ./sabre-dav/lib/ | parallel -j 4 php ./PHP-CS-Fixer/php-cs-fixer fix {}
Fixed all files in 0.016 seconds, 10.000 MB memory used
Loaded config default.

Fixed all files in 0.039 seconds, 10.000 MB memory used
Loaded config default.

...

Fixed all files in 0.012 seconds, 10.000 MB memory used
Loaded config default.

Fixed all files in 0.059 seconds, 10.000 MB memory used
Loaded config default.

real    0m19.342s
user    0m52.620s
sys     0m14.296s

with the following setup:

php -v
PHP 7.0.32-1+ubuntu16.04.1+deb.sury.org+1 (cli) (built: Oct  1 2018 11:45:35) ( NTS )
Copyright (c) 1997-2017 The PHP Group
Zend Engine v3.0.0, Copyright (c) 1998-2017 Zend Technologies
    with Zend OPcache v7.0.32-1+ubuntu16.04.1+deb.sury.org+1, Copyright (c) 1999-2017, by Zend Technologies
    with blackfire v1.23.1~linux-x64-non_zts70, https://blackfire.io, by Blackfire

on a 4 CPUs ubuntu16 vmware VM

takeaway

  • as you already pointed out process overhead seems to be huge
  • doing it in parallel as is is even slower then linear

=> A) so a initial obvious optimization would be to try to speedup php-cs-fixer for the special case of fixing only 1 file
=> B) search for a way to make it parallel without processes e.g. co-routines (in-process) or threads or similar

@keradus
Copy link
Member

keradus commented Oct 12, 2018

=> A) so a initial obvious optimization would be to try to speedup php-cs-fixer for the special case of fixing only 1 file
=> B) search for a way to make it parallel without processes e.g. co-routines (in-process) or threads or similar

so, as I said, WorkerPool / JobPool

@staabm
Copy link
Contributor Author

staabm commented Oct 12, 2018

after having another a more in-depth look into the codebase, I guess we can speedup things by using async io for some components.

step1:
the default Linter TokenizerLinter does blocking IO in lintFile.
in case we could read those files in a non blocking fashion we might be able to do the tokenizing while files are beeing read from the filesystem (tokenize/fix file A while file B is read).

@staabm
Copy link
Contributor Author

staabm commented Oct 12, 2018

please find some initial coding on how a async io based linter could look like:

master...staabm:async1

@keradus
Copy link
Member

keradus commented Oct 13, 2018

yes, we need to read the file to lint in in TokenizerLinter - but what we just loaded to internal buffer and what we just gonna lint, in next step we gonna fix - for which we need to have the file read from IO anyway

@staabm
Copy link
Contributor Author

staabm commented Oct 13, 2018

After thinking a bit more about the problem and possible solutions I came to another idea:

I will try to use pcntl fork, so we dont have to pay the bootstrap cost for each worker.

@dmvdbrugge
Copy link
Contributor

I will try to use pcntl fork

Don't waste your time:

Note: This extension is not available on Windows platforms.
- PCNTL Introduction

@staabm
Copy link
Contributor Author

staabm commented Oct 15, 2018

I am fine with a feature detected parallization feature (which can work on windows via linux subsystem) - in case it doesnt require ugly code and leads to a decent perf improvement

@mfn
Copy link

mfn commented Dec 23, 2020

I'm picking up by the comment from @staabm #4024 (comment)

time find ./sabre-dav/lib/ | parallel -j 4 php ./PHP-CS-Fixer/php-cs-fixer fix {}

The issue here is the massive overhead of spawning a new process for each file 😅 as was pointed out already.

Besides coding a solution, with xargs it's possible to better tune this and see noticeable improvements but it has drawbacks.

First, for this experiment, php-cs-fixer needs a one-line change because currently it does not accept multiple paths provided on the command line:

diff --git a/src/Console/ConfigurationResolver.php b/src/Console/ConfigurationResolver.php
index 72352c350..29041bf5e 100644
--- a/src/Console/ConfigurationResolver.php
+++ b/src/Console/ConfigurationResolver.php
@@ -599,7 +599,7 @@ private function computeConfigFiles()
         if ($this->isStdIn() || 0 === \count($path)) {
             $configDir = $this->cwd;
         } elseif (1 < \count($path)) {
-            throw new InvalidConfigurationException('For multiple paths config parameter is required.');
+            $configDir = $this->cwd;
         } elseif (!is_file($path[0])) {
             $configDir = $path[0];
         } else {

This now allows to call php-cs-fixer fix <file1> … <file n>

My sample set (some private project):

$ find app tests -type f -iname \*php | wc -l
    2465

Running a single invocation:

$ time find app tests -type f -iname \*php | xargs ~/src/php-cs-fixer/php-cs-fixer fix --using-cache=no
Loaded config default from "/Users/neo/src/company/project/.php_cs.dist".
Paths from configuration file have been overridden by paths provided as command arguments.

Fixed all files in 98.335 seconds, 52.000 MB memory used

real	1m38.607s
user	1m37.876s
sys	0m0.564s

Now, using xargs:

  • assume we want 4 processes
  • note the number of total files 2465; divided by 4 processes let's make this 620 per process
$ time find app tests -type f -iname \*php | xargs -n 620 -P 4 ~/src/php-cs-fixer/php-cs-fixer fix --using-cache=no
Loaded config default from "/Users/neo/src/company/project/.php_cs.dist".
Loaded config default from "/Users/neo/src/company/project/.php_cs.dist".
Loaded config default from "/Users/neo/src/company/project/.php_cs.dist".
Loaded config default from "/Users/neo/src/company/project/.php_cs.dist".
Paths from configuration file have been overridden by paths provided as command arguments.
Paths from configuration file have been overridden by paths provided as command arguments.
Paths from configuration file have been overridden by paths provided as command arguments.
Paths from configuration file have been overridden by paths provided as command arguments.

Fixed all files in 9.588 seconds, 26.000 MB memory used

Fixed all files in 14.088 seconds, 22.000 MB memory used

Fixed all files in 32.412 seconds, 44.000 MB memory used

Fixed all files in 45.755 seconds, 42.000 MB memory used

real	0m46.104s
user	1m41.549s
sys	0m0.812s

Or going more extreme:

$ time find app tests -type f -iname \*php | xargs -n 310 -P 8 ~/src/php-cs-fixer/php-cs-fixer fix --using-cache=no
…
real	0m28.043s
user	1m45.297s
sys	0m0.990s

or

$ time find app tests -type f -iname \*php | xargs -n 155 -P 16 ~/src/php-cs-fixer/php-cs-fixer fix --using-cache=no
…
real	0m23.520s
user	2m43.397s

The improvements don't increase linearly with the number of processes but still are very measurable.

Note: tests were conducted on a MacBook Pro (16-inch, 2019), 2.4Ghz 8-core i9

This is due to the diversity of files to scan, something which also already was pointed out

so a initial obvious optimization would be to try to speedup php-cs-fixer for the special case of fixing only 1 file

However, I do not agree with the conclusion

takeaway

  • doing it in parallel as is is even slower then linear

But the answer was given via the follow-up comment #4024 (comment)

so, as I said, WorkerPool / JobPool

The xargs attempt of course is a "poor pool emulation", as due to the diversity of source files to scan and the way xargs (not really) distributes the input, we end up with processes having same amount of files but far more work to do.


Further:

I guess some challenges are the flow of data back:

  • file processing progress report across the pool of processes
  • collect data back for constructing a full cache
  • probably more I forgot, I hardly know the internals of php-cs-fixer 😄

ps: any idea about the "For multiple paths config parameter is required" limitation?

@keradus
Copy link
Member

keradus commented Dec 23, 2020

any idea about the "For multiple paths config parameter is required" limitation?

@keradus
Copy link
Member

keradus commented Dec 23, 2020

nice reading, @mfn ;)

It was never a big prio for core project maintainers, as usually the performance problem is happening only for FIRST run of the tool. Then, when we have .php_cs, one does not have that many files modified to be in deep need of parallelization.

If we would build-in some solution, i would suggest to either use some 3rd party solution, so we can use all the hard work from open source community and not come up with something custom, eg swoole php or other lib.
Or, maybe we wait for PHP build-in solution (and maybe have it polyfilled for older PHP runtimes), like https://wiki.php.net/rfc/fiber

@staabm
Copy link
Contributor Author

staabm commented Dec 23, 2020

Thx for the followup.

We see php-cs fixing taking several minutes in CI builds - which was the initial motivation for this change.

I think the most strait forward (and less maintenance work) way forward would be to allow several file paths as cli arguments and do the fork/parallel stuff at bash level with xargs.

Do you guys agree?

@keradus
Copy link
Member

keradus commented Dec 23, 2020

We see php-cs fixing taking several minutes in CI builds

do you have the cache file in your CI? if not, consider adding it

I think the most strait forward (and less maintenance work) way forward would be to allow several file paths as cli arguments

possible since creation of this github issue. just explicitly fingerpoint the config that should be used.

@mfn
Copy link

mfn commented Dec 23, 2020

possible since creation of this github issue. just explicitly fingerpoint the config that should be used.

🤦

Now it hit me, I didn't get this from the other issue. I can confirm, this works with an official release:
time find app tests -type f -iname \*php | xargs -n 155 -P 16 ./bin/php-cs-fixer-v2.phar --config=.php_cs.dist fix --using-cache=no

"problem solved" ;) then I guess

@mfn
Copy link

mfn commented Dec 23, 2020

The drawback here, btw, is that you can't use the "finder" as defined per the .php_cs; I neglected to mention this so far.

I initially tried to come up with some super smart separation of the finder config from the .php_cs to first extract the definite list of files (for feeding it back via xargs), but that wouldn't work for project where the phar is used.
At least I couldn't make it work: the phar can be includeed, but you can't just expect to have access to it's libraries that way, as it always runs the stub.php. But granted, at this point I did not investigate further.

@staabm
Copy link
Contributor Author

staabm commented Dec 23, 2020

do you have the cache file in your CI? if not, consider adding it

i tried doing so with several differents attempts but it did not work. Maybe the reason is git does not properly set modification stamps of files when cloning or similar.. don‘t know whats going on..

The drawback here, btw, is that you can't use the "finder" as defined per the .php_cs; I neglected to mention this so far.

maybe it would make sense to add a command which prints a list of files (like e.g. find would) but based on thr $finder used in the config, so we can use this result with xargs to parallelize stuff?

@mfn
Copy link

mfn commented Dec 23, 2020

maybe it would make sense to add a command which prints a list of files (like e.g. find would) but based on thr $finder used in the config, so we can use this result with xargs to parallelize stuff?

phpunit supports for example --list-tests/--list-tests-xml; so maybe something like --list-files?

@mfn
Copy link

mfn commented Dec 23, 2020

I can't believe I bothered coming up with this hack, but this solves my needs for faster fixing in CI (where, as pointed out, the cache does not work reliable => I've the same issues with phpstan and CI caching, FWIW) by re-using the defined "finder":

$ cat .php_cs.finder.php
<?php

declare(strict_types = 1);

$finder = PhpCsFixer\Finder
  ::create()
  ->in(__DIR__)
  ->exclude([
    // etc.
    'tmp',
  ]);

if ('true' === getenv('PHP_CS_FIXER_LIST_FILES')) {
    foreach ($finder as $file) {
        echo $file->getPathname(), "\n";
    }
    exit(0);
}

return $finder;
$ head -n 8 .php_cs.dist
<?php

declare(strict_types = 1);

$finder = include './.php_cs.finder.php';

return PhpCsFixer\Config::create()
  ->setFinder($finder)

=> PHP_CS_FIXER_LIST_FILES=true ./bin/php-cs-fixer-v2.phar fix | xargs -n 171 -P 16 ./bin/php-cs-fixer-v2.phar fix --config=.php_cs.dist

🤦 but works.

@staabm
Copy link
Contributor Author

staabm commented Dec 23, 2020

just went ahead and added a new list-files command based on the latest observations within this thread:

#5390

keradus added a commit that referenced this issue Apr 17, 2021
…keradus)

This PR was merged into the 2.19-dev branch.

Discussion
----------

feature #4024 added a `list-files` command

as discussed in #4024 this PR proposes a `list-files` command, which from a user-perspective can be used to utilize `xargs` to get some basic parallelization into php-cs-fixer.

this is especially interessting for CI build cases, in which the php-cs-fixer builtin caching mechanism doesn't work (e.g. because git does use the correct file modification stamps on clone or the CI build is running within a readonly environment).

example usage:
```
vendor/bin/php-cs-fixer list-files | xargs -n 10 -P 8 vendor/bin/php-cs-fixer fix --config=.php_cs.dist --using-cache=no
```
`-n` defines how many files a single subprocess process
`-P` defines how many subprocesses the shell is allowed to spawn for parallel processing (usually similar to the number of CPUs your system has)

see xargs help page: https://wiki.ubuntuusers.de/xargs/

as can be seen in #4024 (comment) you might get a perf boost by a factor of 3-6x depending on the number of files beeing fixed, number of CPUs etc.

this isn't the most elegant way of how parallelzation can be applied on the php-cs-fixer case, but its a really simple one which works with very simple means and therefore has a rather low entry barrier.
you don't need fancy php extension or any rather complicated setup. simple bash means are enought to get a decent perf boost.
also its pretty simple from php-cs-fixer perspective to keep the maintenance burden as low as possible.

in case you would consider this change acceptable, I am willing to provide automated tests, docs and everything else required and missing atm.

any opinions?

Commits
-------

ed65a9e Update doc/usage.rst
c95d430 Update doc/usage.rst
81753b4 try getRealPath()
a3ea03a fix test on windows
a32a8ff fix typo
add9542 to provide multiple files to fix command, you need to specify config file
d3f420d fis CS
d107a1f fix typo
47acf3a Update doc/usage.rst
172d577 fix deprecation
3f6a255 fix CS
d9dbec3 try to fix windows compat
06beee5 fix test
ac8773d .php_cs -> .php-cs-fixer.php
925f47c added in/out test
4e93714 moved ListFilesTest project into test/Fixtures/
725cc86 added separate path for SplFileInfo and sfSplFileInfo
444500c docs
7e92ca7 fix test-expectation
2f9cd68 fix test
5d2104d added ListFilesCommandTest
1f6f32b rm un-used imports
44bbe92 use relative paths instead of real-path to shorten output
22315d9 remove getHelp(), seems to be only used for the FixCommand
8c87ac9 Update src/Console/Command/ListFilesCommand.php
2a25c8c alphabetical order
149d757 removed un-used imports
302733d added a list-files command
@keradus
Copy link
Member

keradus commented Apr 17, 2021

list-files command added and example of how to run tool in parallel described in https://github.com/FriendsOfPHP/PHP-CS-Fixer/pull/5390/files#diff-e2f32bf1fe37315337b25b0d5efe2a8dde59b0f5ea2c3d6fddcf3f8b6d0fe9d1R150

if you have any further improvements, we are happy to welcome a PR! :)

@keradus keradus closed this as completed Apr 17, 2021
@mfn
Copy link

mfn commented Apr 17, 2021

OMG! Can't wait for a release to test this, thank you @staabm and @keradus !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants