Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix performance issue due to search in tokens #210

Merged
merged 6 commits into from Jun 22, 2021

Conversation

bagerard
Copy link
Contributor

@bagerard bagerard commented May 20, 2021

Fixes #176

I looked into this and the issue was in the following block

comment_in_line = any(
    token_type == tokenize.COMMENT
    for token_type, _, _, _, _ in self._tokens)

self._tokens is actually containing the tokens from beginning of the file until the physical line yielded by flake8, this means that self._tokens is getting bigger and bigger every time flake8 invokes this plugin on the next lines, thus searching in self._tokens gets more and more expensive for large files as it's inefficient to search in large list by design (the large file is > 12000 lines so that makes a lot of tokens).

I switch the code to process the whole file instead of work line by line, that way it has to search for the comment token only once.

On the long python file attached in #176, this makes the processing time go from 30 sec to just 2 sec 馃殌

Copy link
Member

@sobolevn sobolevn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Thanks a lot for your work!

Let's run the CI and check it out. 馃憤

tests/test_comments.py Outdated Show resolved Hide resolved
flake8_eradicate.py Outdated Show resolved Hide resolved
Copy link
Member

@sobolevn sobolevn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like, you've accidentally deleted pyproject.toml

@bagerard bagerard force-pushed the fix_performance_issue branch 2 times, most recently from 77c34db to 40cd990 Compare May 21, 2021 08:51
@bagerard
Copy link
Contributor Author

omg... sorry for that. I'm using pip to install it locally, not poetry so I had to clean those files temporarily...
These manually triggered pipeline are really killing productivity, let's hope github finds another solution to address the cryptomining issues

@bagerard
Copy link
Contributor Author

Alright, now it should be good. I've set it up with poetry locally.
I added an additional piece of code to skip the # -*- coding: utf-8 -*- line in file (so that it doesn't waste time processing the file if that is the only tokenize.COMMENT found in the file) and I added a test for that

@bagerard
Copy link
Contributor Author

Hi, please re-run the pipeline when you have a chance ;)

@sobolevn
Copy link
Member

Done!

@codecov
Copy link

codecov bot commented May 24, 2021

Codecov Report

Merging #210 (5f6b875) into master (6a5aab0) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master      #210   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            1         1           
  Lines           41        42    +1     
  Branches         5         7    +2     
=========================================
+ Hits            41        42    +1     
Impacted Files Coverage 螖
flake8_eradicate.py 100.00% <100.00%> (酶)

Continue to review full report at Codecov.

Legend - Click here to learn more
螖 = absolute <relative> (impact), 酶 = not affected, ? = missing data
Powered by Codecov. Last update 6a5aab0...5f6b875. Read the comment docs.

flake8_eradicate.py Outdated Show resolved Hide resolved
@bagerard
Copy link
Contributor Author

Note that consuming the file_tokens from flake8 made it slightly less trivial/efficient to ignore the file encoding '# -*- coding:' comment so I ended up removing that optimization. It was not a major improvement and anyway the '# -*- coding:' thing is something from the past so it may be better to avoid polluting the code with that (e.g pyupgrade removes them by default https://github.com/asottile/pyupgrade#-coding--comment)

@bagerard
Copy link
Contributor Author

bagerard commented Jun 2, 2021

Would you mind running the pipeline on it ?

@bagerard
Copy link
Contributor Author

bagerard commented Jun 8, 2021

Let me know if there is anything else I should do or if it's fine as it is

@sobolevn
Copy link
Member

sobolevn commented Jun 8, 2021

@bagerard I will return to this PR sometime soon! Please, stay tuned 馃檪

@sobolevn
Copy link
Member

I finally got the time to work on this, sorry for the long wait!


### Features

- Imrpoves performance on long files #210
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Thanks!

@sobolevn sobolevn merged commit b073917 into wemake-services:master Jun 22, 2021
@sobolevn
Copy link
Member

@bagerard thanks a lot for your work! 馃憤

@bagerard bagerard deleted the fix_performance_issue branch June 22, 2021 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Very slow on long lists
2 participants