Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add FixerBlame class and one test #7865

Draft
wants to merge 53 commits into
base: master
Choose a base branch
from

Conversation

connorhu
Copy link

@connorhu connorhu commented Mar 5, 2024

First step to fixing #7171 and similar issues

What are the known problems:

  • slow. Instead of constantly converting tokens to text, the diff should be done on the tokens
  • class names, class locations, array key names
  • slow
  • does not use the result even when producing the output
  • slow
  • low amount of test files
  • and slow

It will remain a draft until the known problems are solved.

Note: Since it works on the same files, the PR is rebased to the Wirone:codito/bombazo branch. Once #7777 is merged, it can be rebased back to the master.

Wirone and others added 30 commits March 1, 2024 02:34
Empty for now, because logic for running fixers must be moved here. Basically it's a PoC of command being hidden - works on PHP 7.4 and 8.3 with latest Composer packages on each version.

Command's arguments/options are more or less what we need in the worker, but it may be changed later.
The same collection of files has to be determined for both sequential and parallel run.
It's working, but needs polishing here and there. It speeds up analysis significantly (like 5x for this codebase on my computer).
Parallel runner must be introduced slowly, and start as an opt-in. When it's stable enough, we can change it to be default one.
Using `ParallelConfig::detect()` can be used for utilising all available resources. Tested by running analysis natively on the host and inside Docker (with core limit set on the OrbStack level). In both scenarios it detects available cores correctly.
Detected by Composer Require Checker
Cache check is done during file iteration - if file is in cache and wasn't changed, then it's skip by iterator and doesn't even go to the worker. That's why `ReadonlyCacheManager` used on worker's side is a bit superfluous, but I thought it's safer to use it instead of `NullCacheManager`.

Anyway, each file is stored to cache when worker reports analysis result to the parallelisation operator, so it's basically similar to sequential analysis, because file write is done only in one process.
Co-authored-by: Julien Falque <julien.falque@gmail.com>
Use absolute file path for reporting analysis result and convert it to relative path when processing on parallelisation operator's side. It gives the same effect in terms of caching, but does not require `DirectoryInterface::getAbsolutePath()`.
It improves caching, because cache save can be triggered more frequently and errors or interrupting during processing large chunk won't cause analysis info being lost (which would lead to re-analysing file that was already processed).
- extract constants to dedicated class with more fitting name
- use constants in every place where communication between runner and worker occurs
There were 2 problems:
- when child processes were spawned, `ArgvInput` was created to pass the arguments/options to the worker, but when fix/check command was executed via command tester, `ArgvInput` was using PHPUnit's argv from top-level command (and was failing with "The "--configuration" option does not exist.")
- Passing `InputInterface` instance from command to process (did an experiment with static property set in command to skip passing input through the Runner instance) was not enough because output was invalid and tests were failing.

I hope we can figure out something better after discussion on code review 😉.
Some of the tests are little hacky at this point, maybe it's possible to refactor a bit to achieve better testability. Good enough for now, though.
@connorhu connorhu force-pushed the feature/real-line-numbers branch 4 times, most recently from 0ae17bc to 6203087 Compare March 5, 2024 09:12
@connorhu connorhu changed the title Added FixerBlame tool and one test feat: Add FixerBlame class and one test Mar 5, 2024
@connorhu connorhu force-pushed the feature/real-line-numbers branch 5 times, most recently from 991cd90 to b66693b Compare March 5, 2024 09:42
@coveralls
Copy link

coveralls commented Mar 5, 2024

Coverage Status

coverage: 95.03% (-1.0%) from 96.029%
when pulling 57bcf57 on connorhu:feature/real-line-numbers
into b68c8ba on PHP-CS-Fixer:master.

@devnix
Copy link

devnix commented Mar 8, 2024

Would this solve the use-case for checkstyle reports? #3601

@connorhu
Copy link
Author

connorhu commented Mar 8, 2024

@devnix Yes, that is my goal.

@Wirone
Copy link
Member

Wirone commented Mar 8, 2024

I don't have currently capacity for thorough review here, but please consider #7777 in your diffing/blaming process, as parallel processing is ready and should be merged before any other crucial changes related to the fixing process are made. With that approach you would need to pass blame info back from worker to the main process.

Also I don't like the experimental namespace, it should rather be in some "normal" namespace, but marked with @internal / @experimental.

@connorhu
Copy link
Author

connorhu commented Mar 8, 2024

@Wirone There's still time for a review after I've solved the known problems, so it's not current yet.

@connorhu connorhu force-pushed the feature/real-line-numbers branch 10 times, most recently from 01be641 to b5d1bc0 Compare March 9, 2024 06:04
@@ -0,0 +1,51 @@
<?php
Copy link
Member

@keradus keradus Mar 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid having all commits of Wirone in your PR, as they are other PR and making diff for your PR crazy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's an outcome of my comment, but I only asked for taking the parallel runner into consideration (make sure it will work), not to rebase on my work 😅. It could work if we had branches in the same repo, and a proper target branch was selected, but not like this. Hard to tell what is the best approach here, maybe a draft with a branch based on master (like before), but waiting until my PR is merged?

Copy link
Author

@connorhu connorhu Mar 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to align it with PR #7777 so I rebaseled it to see how much conflict there was (fortunately I only had to resolve 2-3 things). @Wirone's work is the more polished, more important and is expected to be in the master sooner. As mentioned in the PR description, once the parallel feature is merged this can be rebased back to the master and then the commits for the parallel run will be removed. There is no point in starting the review until there are implementation (and performance) issues with the PR anyway.

* with this source code in the file LICENSE.
*/

namespace PhpCsFixer\FixerBlame;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please describe approach you want to take in RFC style before implementing it. it will help reviewers to recognise what and how you want to do, and help you to align on direction before spending time on implementing it.

right now, I'm happy with udiff and don't seeing blame-named class is very confusing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the problems with the implementation at the moment is that, because of the way udiff works, I have to do a lot of manual work to track exactly which lines, which fixer, made what changes (which lines were removed and which lines were inserted). One sign of this is that I keep converting the token array back to text and running diff on it (and then trying to identify the separate patches in the diff output). The next logical step would be to see how much work it would be to modify udiff to be able to do its job on any data structure (in this case a Token[]). If this is a solvable problem, then 1) I would use less resources 2) I wouldn't have to deal with the problem of diffing the token array.

The theory works anyway, because it can accurately track multi-step transformations, but it's also obvious that in its current form it doesn't just do what it's actually supposed to do.

That's why I can't describe in any style what happens, because it doesn't happen the way I want to see it happen and 1000% I'll have to rewrite the blame class if udiff knows the comparison.

Copy link
Member

@keradus keradus Mar 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I failed to understand the comment.

Please describe the problem you want to solve and describe the concept of solution, before you go deep into blame class.
Without that I fail to understand why you even want to tech udiff to understand Token[]

Copy link
Author

@connorhu connorhu Mar 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the problem?
The problem is that for various outputs, the violations indicate incorrect line and character positions.

What is a possible solution to this problem?
a) remove the outputs or not report the line and character numbers
b) the fixers let the Tokens class know what they have done
c) keep track from fixer to fixer which fixer changes what

What is the problem with which approach?

a) while I'm all for something either working well or not existing at all, this is obviously not the answer
b) touching all fixers, modifying existing interfaces is a lot of work and an unwanted BC break
c) this might work if we can somehow track what happened between two fixer states, because then we only have to track the change within the Runner and don't have to go deeper, but it is inaccurate

What tools do we currently have to monitor changes?
Currently Sebastian's library Diff is available, which he basically created for texts or arrays of texts, and he told me that he didn't want to change that, which is acceptable.
With the implementation I have made, I currently only have the ability to track line level changes and in addition I always have to convert Tokens back to text to diff two states together.
This back-conversion currently means unwanted memory usage and unnecessary iterations, so it's not ideal yet. This is where I am at the moment:

  1. Sebastian's tool is given and cannot be changed
  2. It is not practical to import a new diff library just to use it in this one case
  3. Implementing my own diff just to keep track of token changes is a lot of work and logic that is only present because of one function.

The idea works as long as you don't put two violations in a line:

if ($variable !== false && $variable2 !== false)

That's two yoda style violations, but due to the limitations of diff I see this as one. So far I am not happy with the solution for this. As long as I am not satisfied with what my solution can do, there is no point in talking about how to integrate it nicely.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I'm now wondering if just changing the Tokens class is enough to keep track of the changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imagine the case:

<?php
class A {
    public method zzz() {} ## line 2

    public method aaa() {
        var_dump("a" == "b"); ## line 5
    }
}
  • 1st fixer moves method zzz from top of file to bottom of file
  • 2nd fixer change content of method (eg == => ===)

which lines should be reported as modified?
imho 1st fixer: line 2 and 2nd fixer: line 5.
But when 2nd fixer is applied, line 5 is actually line 3.

  • i do not want to touch Tokens
  • i do not want to have own logic for tracking modified lines (Eg PatchInfo)

I suggest a new command/BC break that is not applying one fixer on top of another (like we need for fix), but one after another , separately, always against original file variant - that way, you can determine modified line (with current udiff approach) for original code.
That would work only for check and not for fix (for when we actually want to have those reports) and will make check blind for changes that are based on top of changes done by other rule (so check may detect less things that fix would do, because fix will apply changes on modified file variant)

Copy link
Author

@connorhu connorhu Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I achieved quite nice results on complex code with the solution I made ( https://github.com/connorhu/PHP-CS-Fixer/blob/57bcf57cde0cc5844070eb62ed38a5320085eb35/tests/Fixtures/Integration/misc/blame_test.test#L46-L56 ). The function will only work correctly if it finds the original position where the violation occurs, and now the code can do it (mostly). I'm between two extremes at the moment: in the Runner class I lose some details, and in the Tokens class I get too much detail about the changes. Anyway, the potential of the Tokens class is very promising, but the setCode method makes the changes untraceable (fortunately only 1 fixer uses it as far as I can see).

In the beginning I did as you wrote. I made a "complex" code, set several fixers and saved each step in a separate file, then diffed them. I wrote down what I wanted to see (what the logical end result would be) and iterated until I got what the logical end result is.

I suggest a new command...

I'll sleep on it and think about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2024-03-18 at 22 59 53

pssst, this one already exists - check (just it's checking one after another and not independently vs original code - the line 5 vs line 3 topic I mentioned)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pssst, this one already exists - check (just it's checking one after another and not independently vs original code - the line 5 vs line 3 topic I mentioned)

I know, so I don't understand why the code you wrote is an issue. Fixing only fixes, checking only checks.
The "think about it" was to see what options are available if you don't want to modify the Tokens class and don't want to work on the diff result. For now it feels like a step backwards, but I have to think about it.

Your example code results in two violations:

  • method order at line 2 char 0
  • triple equal sign comparison operator at line 5 at char 0

In both cases, char 0 is still an incorrect result, but this is a side effect of the diff.

For me it's already an issue:

        var_dump("a" == "b", "c" == "b");

and

        var_dump("a" == "b");
        var_dump("c" == "b");

My solution takes these as one violation. Which is not ok.

I can see that you don't want to deal with such things, which you have the possibility to do anyway, by removing the output formats (or outsourcing them from the core code to a plugin system).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants