Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental analysis: caching Mutations for multiple runs #1085

Open
Fenikkusu opened this issue Feb 28, 2020 · 9 comments
Open

Incremental analysis: caching Mutations for multiple runs #1085

Fenikkusu opened this issue Feb 28, 2020 · 9 comments
Labels

Comments

@Fenikkusu
Copy link
Contributor

Is your feature request related to a problem? Please describe.
When running on very large projects, it can take time to generate the mutations in addition to running the mutations. As an example, I have a project that has roughly 380 files and 26,867 lines of code. This project takes approximately 20 minutes to run start to finish. I have a second project with roughly 5,000 files and 900,000 lines of code. While I'm not sure the exact time the second project would take, a rough calculation puts us at over 4 hours to process infection.

Describe the solution you'd like
PHP-CS-Fixer has the ability to specify a cache file. This is a cache file tracks the paths of all the files in the project and the last modified time. When you run PHP-CS-Fixer a second time, it compares the modified times of the files to what is stored in the cache file. If the file has changed, then it will test the file again.

I believe a similar feature would be useful in infection. I think it would be useful to actually have two caches in infection. The first cache would be for generating mutations. The system would only process the file to generate the mutations if the files have changed since the last time infection ran.

The second place a cache would be useful is when running the mutations. The system would only run mutations for files that have changed since the last time infection was ran.

Describe alternatives you've considered
Haven't really thought that far ahead.

@sanmai
Copy link
Member

sanmai commented Feb 28, 2020

The problem here is that the code and tests are interdependent. It is not that tests are always at fault for an escaping mutation. Sometime the code isn't as properly written as it should to be well-tested by the very same test.

Another problem is that any change in any part of the code may cause a new mutation to escape, just as well as an old mutation to be caught. Now, consider there's a per mutation per line of code cache. Under which terms we're going to invalidate it? OK, we can invalidate it if either the test or the subject changes. But what if a mutation is caused by another file, which is technically not covered by the test?

This is a great idea, but I have too many questions with too little answers so I'm not even sure where we should begin to implement it.

@Fenikkusu
Copy link
Contributor Author

From a 10,000 foot view, I would assume the code coverage becomes the determining factor. That is to say that if a file has changed from the last run, that infection would automatically compare to the code coverage and run all mutations that touch the given file based on code coverage. I believe this would be the most 'make sense' way to invalidate the cache. Infection is already smart enough to skip files based off their coverage. I think it would be a simple enough task to just 'add' these files to the skipped files as if they weren't covered.

@Fenikkusu
Copy link
Contributor Author

Fenikkusu commented Feb 28, 2020

I just took a quick look at the code. I'm not highly knowledgable about the ends and outs, so there may be a better place, but a brief look at the system, using the concept I mentioned above, makes me think that the MutationGenerator would likely be a starting point to implementing such a feature. In the generate method, test the cache and skip the file from generating if it hasn't changed.

The only problem with this would ensuring that the mutation gets generated if another file changed and this file is covered by the same test.

@theofidry
Copy link
Member

I think there is only two-three things that are cacheable unfortunately:

  • for a given source file, its corresponding AST
  • for a given source file: its corresponding mutations; It requires however to be invalidated ass soon as the mutator config changes or the infection commit changes - not very reliable and efficient
  • for a given source coverage file, its tests that we collect from parsing it

Anything else it not cacheable.

IMO for large codebases, if you want the score you do it in a night build that can take hours or dozen of hours, but otherwise infection should most certainly be use incrementally. For now this can be done by restricting it on the changed source files only. Maybe this can be improved though

@Fenikkusu
Copy link
Contributor Author

@theofidry , How do you suggest restricting it? Perhaps I'm missing something or not seeing what you are seeing.

@theofidry
Copy link
Member

@Fenikkusu
Copy link
Contributor Author

@theofidry , Thank You. I will look into that.

An additional thought did occur to me: since the --filter options exists, I wonder if it might be possible to add in a --filter-cache as a midway point.. While you can use the filter option in combination with git, I would think it might be possible to simply do the afore mentioned compare with a cache file, and then auto-populate the filter options using the files marked as changed. I'm not sure there is much benefit in doing that since it can be done through git.

@theofidry
Copy link
Member

Yes maybe we could have an incremental option which works with the last run

@maks-rafalko maks-rafalko changed the title Caching Mutations For Multiple Runs Incremental analysis: caching Mutations for multiple runs Aug 9, 2021
@maks-rafalko
Copy link
Member

Several ideas for the inspiration from @hcoles #1549 (comment)

You might be interesting in pitest's incremental analysis feature. This PR looks to implement one of the strategies it employs to speed things up, but others are also possible

https://pitest.org/quickstart/incremental_analysis/

Reading it again (it's a long time since I wrote it), I'm not sure number 5 is a great idea, but the others give pitest a huge speedup once the data has been collected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants