Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid hanging forever after a parallel job was killed #7834

Merged
merged 10 commits into from Dec 12, 2022

Conversation

daniel-wer
Copy link
Contributor

Type of Changes

Type
🐛 Bug fix

Description

Replace multiprocessing.pool with concurrent.futures.ProcessPoolExecutor to avoid deadlocks.

In a multiprocessing.pool, if a process terminates in a non-clean fashion (for example, due to OOM or a segmentation fault), the pool will silently replace said process, but the work that the process was supposed to do will never be done, causing pylint to hang indefinitely. The concurrent.futures.ProcessPoolExecutor will raise a BrokenProcessPool exception in that case, avoiding the hang.

See #3899 (comment) for reproduction instructions.

The hangs caused by this issue ate up quite a few CI minutes for us over time. Although this doesn't fix the underlying sporadic issue (which is likely due to OOM) at least pylint will fail early instead of hanging forever.

A followup could be to recover from this type of error by catching the BrokenProcessPool exception and then instantiating a new ProcessPoolExecutor to continue the linting.

Closes #3899

…tor to avoid deadlocks.

In a multiprocessing.pool, if a process terminates in a non-clean fashion
(for example, due to OOM or a segmentation fault), the pool will silently
replace said process, but the work that the process was supposed to do
will never be done, causing pylint to hang indefinitely.
The concurrent.futures.ProcessPoolExecutor will raise a
BrokenProcessPool exception in that case, avoiding the hang.
@github-actions

This comment has been minimized.

@Pierre-Sassoulas
Copy link
Member

This look promising, thank you for opening the PR !

@daniel-wer
Copy link
Contributor Author

Thank you for your work maintaining this great tool 🙏 The tests were failing due to a silly mistake, should be fixed.

@coveralls
Copy link

coveralls commented Nov 24, 2022

Pull Request Test Coverage Report for Build 3675169755

  • 6 of 10 (60.0%) changed or added relevant lines in 2 files are covered.
  • 239 unchanged lines in 18 files lost coverage.
  • Overall coverage increased (+0.05%) to 95.473%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pylint/lint/parallel.py 3 5 60.0%
pylint/lint/run.py 3 5 60.0%
Files with Coverage Reduction New Missed Lines %
pylint/checkers/base/name_checker/checker.py 1 99.23%
pylint/pyreverse/mermaidjs_printer.py 1 98.21%
pylint/lint/run.py 2 86.36%
pylint/checkers/stdlib.py 3 96.69%
pylint/extensions/typing.py 4 97.67%
pylint/pyreverse/vcg_printer.py 5 92.75%
pylint/pyreverse/diagrams.py 7 91.95%
pylint/checkers/logging.py 8 94.7%
pylint/checkers/format.py 10 96.52%
pylint/graph.py 10 86.36%
Totals Coverage Status
Change from base Build 3534208420: 0.05%
Covered Lines: 17652
Relevant Lines: 18489

💛 - Coveralls

@github-actions

This comment has been minimized.

Copy link
Member

@Pierre-Sassoulas Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing the tests :) Would you mind adding a changelog's fragment please ? Also I don't know if it's feasible but could we imagine an automated test where we kill a job to test the original issue?

@Pierre-Sassoulas Pierre-Sassoulas added backport maintenance/2.15.x Needs backport Needs to be cherry-picked on the current patch version by a pylint's maintainer and removed Work in progress backport maintenance/2.15.x labels Nov 24, 2022
@DanielNoord DanielNoord self-requested a review November 25, 2022 08:09
@daniel-wer
Copy link
Contributor Author

I added the changelog fragment and also added a test that an error in the initializer for the parallel jobs, doesn't lead to a deadlock (as it would have, before this PR).

However, I'm not yet sure how to design the test where we kill a job to test the original issue. One could start a linting run that needs a little bit longer (What would be the best way to do that?) and then execute something like pgrep pylint | sort -u | tail -n1 | xargs kill --verbose -9 (from this comment). However, I'm not sure how to get the timing right without having the test take up too much time. Also, I don't want to introduce any flakiness to the tests 🤔 Any ideas?

@github-actions

This comment has been minimized.

@Pierre-Sassoulas
Copy link
Member

However, I'm not yet sure how to design the test where we kill a job to test the original issue.

I don't know either. We're going to hit timing /concurrency issues and flakiness if we do it with an external command to kill a process launched inside our test. I never did that with pytest I was hoping a magical pytest fixture would exists to be honest 😄Maybe a manual test is enough ?

@DanielNoord
Copy link
Collaborator

@daniel-wer Do you understand the failing test?

@daniel-wer
Copy link
Contributor Author

@DanielNoord I'm not entirely sure. My first suspicion was that it is timing-related. Since I configured the test to timeout after 1 second, it could happen that the time until the first pool process is initialized (and the TypeError: 'int' object is not iterable is thrown which leads to a BrokenProcessPool exception) is more than 1 second.

However, the captured stderr indicates that the TypeError was already raised. In all other configurations this led to the expected BrokenProcessPool exception. Might be that the timing was very unlucky in this case 🤔

We could try to configure a more generous timeout of 5 seconds since the timeout will only be triggered if the test fails and won't affect normal test run times. This way we could rule out timing-related issues. What do you think?

@DanielNoord
Copy link
Collaborator

Let's try the higher timeout indeed! Sorry for the last response!

@github-actions

This comment has been minimized.

Copy link
Collaborator

@DanielNoord DanielNoord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a check for references to multiprocessing and saw that we have a couple of checks that test whether it is available. Is this also a concern for concurrency?

Edit: It seems like it is, so we should probably change all those references and tests as well.

@daniel-wer
Copy link
Contributor Author

@DanielNoord Thank you, I was not aware of that. Fixed.

@DanielNoord
Copy link
Collaborator

Although this is probably also important, I thought it might be better to remove those multiprocessing imports. With this change we no longer user multiprocessing, so checking whether we can import it doesn't make a lot of sense. We probably need to important concurrent in those places and see if that works.

@daniel-wer
Copy link
Contributor Author

Could you give me a pointer to the relevant files? I had a look and from what I can see the multiprocessing module is used in some of the functional checker tests or to call .current_process() or .cpu_count(). I think neither of these usages can be replaced with the concurrent module (which doesn't have the .current_process() or .cpu_count() functionality).

The only thing I can see is that in pylint/lint/run.py in addition to multiprocessing.cpu_count() the existence of the concurrent module should be checked to fallback to a single process if it is not available, but I don't see where multiprocessing imports could be removed.

@DanielNoord
Copy link
Collaborator

The last thing was indeed the only thing I thinking of.

Just to double check: does the determining of cpus with multiprocessing also work for the amount of workers that concurrent can use?

@daniel-wer
Copy link
Contributor Author

@DanielNoord

The last thing was indeed the only thing I thinking of.

Ok great, I adapted the check.

Just to double check: does the determining of cpus with multiprocessing also work for the amount of workers that concurrent can use?

The docs state that the "ProcessPoolExecutor uses the multiprocessing module", so yes that should be the case. Also, this kind of functionality was not duplicated for the concurrent.futures module which is rather an alternative interface for interacting with a process pool but not changing the underlying process fundamentals like current_process() or cpu_count().

@Pierre-Sassoulas

Maybe a manual test is enough ?

Using the latest version of this PR, I again confirmed that killing a random pylint worker process during a parallel linting no longer leads to a deadlock, but immediately exits with a BrokenProcessPool exception, using the reproduction instructions from #3899 (comment).

@github-actions

This comment has been minimized.

@DanielNoord DanielNoord self-requested a review December 10, 2022 10:49
@DanielNoord
Copy link
Collaborator

I hope to find time to review this somewhere this weekend!

Copy link
Collaborator

@DanielNoord DanielNoord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed two changes to docstrings.

Two last (for real this time) questions.

Sorry for taking so long. Parallelism is hard and I don't want to leave any stuff from the old implementation behind after merging this.

tests/test_check_parallel.py Outdated Show resolved Hide resolved
@@ -23,19 +23,23 @@
except ImportError:
multiprocessing = None # type: ignore[assignment]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one use of this still, it is used to get the id of the process. Can we do that with concurrent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that's not possible with concurrent. As I wrote in my last comment, "this kind of functionality was not duplicated for the concurrent.futures module which is rather an alternative interface for interacting with a process pool but not changing the underlying process fundamentals like current_process() or cpu_count()". The concurrent module is using multiprocessing internally, it is not supposed to be a replacement.

@github-actions

This comment has been minimized.

@Pierre-Sassoulas Pierre-Sassoulas added backport maintenance/2.15.x and removed Needs backport Needs to be cherry-picked on the current patch version by a pylint's maintainer labels Dec 10, 2022
@Pierre-Sassoulas Pierre-Sassoulas added this to the 2.15.9 milestone Dec 10, 2022
Copy link
Member

@Pierre-Sassoulas Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. (I was worried that the backport would conflict with #7814 but we did not merge it)

daniel-wer and others added 2 commits December 12, 2022 12:06
Co-authored-by: Daniël van Noord <13665637+DanielNoord@users.noreply.github.com>
@github-actions
Copy link
Contributor

🤖 According to the primer, this change has no effect on the checked open source code. 🤖🎉

This comment was generated for commit 69729db

@DanielNoord DanielNoord merged commit 5eca8ec into pylint-dev:main Dec 12, 2022
@DanielNoord
Copy link
Collaborator

Thanks for all the work here @daniel-wer

github-actions bot pushed a commit that referenced this pull request Dec 12, 2022
* Replace multiprocessing.pool with concurrent.futures.ProcessPoolExecutor to avoid deadlocks.

In a multiprocessing.pool, if a process terminates in a non-clean fashion
(for example, due to OOM or a segmentation fault), the pool will silently
replace said process, but the work that the process was supposed to do
will never be done, causing pylint to hang indefinitely.
The concurrent.futures.ProcessPoolExecutor will raise a
BrokenProcessPool exception in that case, avoiding the hang.

Co-authored-by: Daniël van Noord <13665637+DanielNoord@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
(cherry picked from commit 5eca8ec)
Pierre-Sassoulas pushed a commit that referenced this pull request Dec 13, 2022
* Replace multiprocessing.pool with concurrent.futures.ProcessPoolExecutor to avoid deadlocks.

In a multiprocessing.pool, if a process terminates in a non-clean fashion
(for example, due to OOM or a segmentation fault), the pool will silently
replace said process, but the work that the process was supposed to do
will never be done, causing pylint to hang indefinitely.
The concurrent.futures.ProcessPoolExecutor will raise a
BrokenProcessPool exception in that case, avoiding the hang.

Co-authored-by: Daniël van Noord <13665637+DanielNoord@users.noreply.github.com>
(cherry picked from commit 5eca8ec)

Co-authored-by: Daniel <daniel.werner@scalableminds.com>
tom-cook-veea added a commit to tom-cook-veea/pylint that referenced this pull request Dec 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hangs forever after a parallel job gets killed
4 participants