Avoid hanging forever after a parallel job was killed #7834

daniel-wer · 2022-11-23T18:25:08Z

Type of Changes

	Type
✓	🐛 Bug fix

Description

Replace multiprocessing.pool with concurrent.futures.ProcessPoolExecutor to avoid deadlocks.

In a multiprocessing.pool, if a process terminates in a non-clean fashion (for example, due to OOM or a segmentation fault), the pool will silently replace said process, but the work that the process was supposed to do will never be done, causing pylint to hang indefinitely. The concurrent.futures.ProcessPoolExecutor will raise a BrokenProcessPool exception in that case, avoiding the hang.

See #3899 (comment) for reproduction instructions.

The hangs caused by this issue ate up quite a few CI minutes for us over time. Although this doesn't fix the underlying sporadic issue (which is likely due to OOM) at least pylint will fail early instead of hanging forever.

A followup could be to recover from this type of error by catching the BrokenProcessPool exception and then instantiating a new ProcessPoolExecutor to continue the linting.

Closes #3899

…tor to avoid deadlocks. In a multiprocessing.pool, if a process terminates in a non-clean fashion (for example, due to OOM or a segmentation fault), the pool will silently replace said process, but the work that the process was supposed to do will never be done, causing pylint to hang indefinitely. The concurrent.futures.ProcessPoolExecutor will raise a BrokenProcessPool exception in that case, avoiding the hang.

Pierre-Sassoulas · 2022-11-23T19:58:01Z

This look promising, thank you for opening the PR !

daniel-wer · 2022-11-24T11:11:33Z

Thank you for your work maintaining this great tool 🙏 The tests were failing due to a silly mistake, should be fixed.

coveralls · 2022-11-24T13:05:19Z

Pull Request Test Coverage Report for Build 3675169755

6 of 10 (60.0%) changed or added relevant lines in 2 files are covered.
239 unchanged lines in 18 files lost coverage.
Overall coverage increased (+0.05%) to 95.473%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pylint/lint/parallel.py	3	5	60.0%
pylint/lint/run.py	3	5	60.0%

Files with Coverage Reduction	New Missed Lines	%
pylint/checkers/base/name_checker/checker.py	1	99.23%
pylint/pyreverse/mermaidjs_printer.py	1	98.21%
pylint/lint/run.py	2	86.36%
pylint/checkers/stdlib.py	3	96.69%
pylint/extensions/typing.py	4	97.67%
pylint/pyreverse/vcg_printer.py	5	92.75%
pylint/pyreverse/diagrams.py	7	91.95%
pylint/checkers/logging.py	8	94.7%
pylint/checkers/format.py	10	96.52%
pylint/graph.py	10	86.36%

Totals
Change from base Build 3534208420:	0.05%
Covered Lines:	17652
Relevant Lines:	18489

💛 - Coveralls

Pierre-Sassoulas

Thank you for fixing the tests :) Would you mind adding a changelog's fragment please ? Also I don't know if it's feasible but could we imagine an automated test where we kill a job to test the original issue?

…esn't lead to a deadlock.

daniel-wer · 2022-11-28T11:16:15Z

I added the changelog fragment and also added a test that an error in the initializer for the parallel jobs, doesn't lead to a deadlock (as it would have, before this PR).

However, I'm not yet sure how to design the test where we kill a job to test the original issue. One could start a linting run that needs a little bit longer (What would be the best way to do that?) and then execute something like pgrep pylint | sort -u | tail -n1 | xargs kill --verbose -9 (from this comment). However, I'm not sure how to get the timing right without having the test take up too much time. Also, I don't want to introduce any flakiness to the tests 🤔 Any ideas?

Pierre-Sassoulas · 2022-11-28T21:39:27Z

However, I'm not yet sure how to design the test where we kill a job to test the original issue.

I don't know either. We're going to hit timing /concurrency issues and flakiness if we do it with an external command to kill a process launched inside our test. I never did that with pytest I was hoping a magical pytest fixture would exists to be honest 😄Maybe a manual test is enough ?

pylint/lint/parallel.py

DanielNoord · 2022-11-30T19:47:21Z

@daniel-wer Do you understand the failing test?

daniel-wer · 2022-12-01T11:59:36Z

@DanielNoord I'm not entirely sure. My first suspicion was that it is timing-related. Since I configured the test to timeout after 1 second, it could happen that the time until the first pool process is initialized (and the TypeError: 'int' object is not iterable is thrown which leads to a BrokenProcessPool exception) is more than 1 second.

However, the captured stderr indicates that the TypeError was already raised. In all other configurations this led to the expected BrokenProcessPool exception. Might be that the timing was very unlucky in this case 🤔

We could try to configure a more generous timeout of 5 seconds since the timeout will only be triggered if the test fails and won't affect normal test run times. This way we could rule out timing-related issues. What do you think?

DanielNoord · 2022-12-04T12:00:33Z

Let's try the higher timeout indeed! Sorry for the last response!

DanielNoord

I just did a check for references to multiprocessing and saw that we have a couple of checks that test whether it is available. Is this also a concern for concurrency?

Edit: It seems like it is, so we should probably change all those references and tests as well.

…mscripten and WASI.

daniel-wer · 2022-12-06T11:32:22Z

@DanielNoord Thank you, I was not aware of that. Fixed.

DanielNoord · 2022-12-06T14:48:57Z

Although this is probably also important, I thought it might be better to remove those multiprocessing imports. With this change we no longer user multiprocessing, so checking whether we can import it doesn't make a lot of sense. We probably need to important concurrent in those places and see if that works.

daniel-wer · 2022-12-07T09:28:55Z

Could you give me a pointer to the relevant files? I had a look and from what I can see the multiprocessing module is used in some of the functional checker tests or to call .current_process() or .cpu_count(). I think neither of these usages can be replaced with the concurrent module (which doesn't have the .current_process() or .cpu_count() functionality).

The only thing I can see is that in pylint/lint/run.py in addition to multiprocessing.cpu_count() the existence of the concurrent module should be checked to fallback to a single process if it is not available, but I don't see where multiprocessing imports could be removed.

DanielNoord · 2022-12-07T20:34:29Z

The last thing was indeed the only thing I thinking of.

Just to double check: does the determining of cpus with multiprocessing also work for the amount of workers that concurrent can use?

…ssing for single process fallback.

daniel-wer · 2022-12-08T22:48:20Z

@DanielNoord

The last thing was indeed the only thing I thinking of.

Ok great, I adapted the check.

Just to double check: does the determining of cpus with multiprocessing also work for the amount of workers that concurrent can use?

The docs state that the "ProcessPoolExecutor uses the multiprocessing module", so yes that should be the case. Also, this kind of functionality was not duplicated for the concurrent.futures module which is rather an alternative interface for interacting with a process pool but not changing the underlying process fundamentals like current_process() or cpu_count().

@Pierre-Sassoulas

Maybe a manual test is enough ?

Using the latest version of this PR, I again confirmed that killing a random pylint worker process during a parallel linting no longer leads to a deadlock, but immediately exits with a BrokenProcessPool exception, using the reproduction instructions from #3899 (comment).

DanielNoord · 2022-12-10T10:49:49Z

I hope to find time to review this somewhere this weekend!

DanielNoord

I pushed two changes to docstrings.

Two last (for real this time) questions.

Sorry for taking so long. Parallelism is hard and I don't want to leave any stuff from the old implementation behind after merging this.

tests/test_check_parallel.py

DanielNoord · 2022-12-10T11:23:17Z

pylint/lint/parallel.py

@@ -23,19 +23,23 @@
 except ImportError:
    multiprocessing = None  # type: ignore[assignment]


This is one use of this still, it is used to get the id of the process. Can we do that with concurrent?

No that's not possible with concurrent. As I wrote in my last comment, "this kind of functionality was not duplicated for the concurrent.futures module which is rather an alternative interface for interacting with a process pool but not changing the underlying process fundamentals like current_process() or cpu_count()". The concurrent module is using multiprocessing internally, it is not supposed to be a replacement.

Pierre-Sassoulas

LGTM. (I was worried that the backport would conflict with #7814 but we did not merge it)

Co-authored-by: Daniël van Noord <13665637+DanielNoord@users.noreply.github.com>

for more information, see https://pre-commit.ci

github-actions · 2022-12-12T12:11:13Z

🤖 According to the primer, this change has no effect on the checked open source code. 🤖🎉

This comment was generated for commit 69729db

DanielNoord · 2022-12-12T22:34:56Z

Thanks for all the work here @daniel-wer

* Replace multiprocessing.pool with concurrent.futures.ProcessPoolExecutor to avoid deadlocks. In a multiprocessing.pool, if a process terminates in a non-clean fashion (for example, due to OOM or a segmentation fault), the pool will silently replace said process, but the work that the process was supposed to do will never be done, causing pylint to hang indefinitely. The concurrent.futures.ProcessPoolExecutor will raise a BrokenProcessPool exception in that case, avoiding the hang. Co-authored-by: Daniël van Noord <13665637+DanielNoord@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> (cherry picked from commit 5eca8ec)

* Replace multiprocessing.pool with concurrent.futures.ProcessPoolExecutor to avoid deadlocks. In a multiprocessing.pool, if a process terminates in a non-clean fashion (for example, due to OOM or a segmentation fault), the pool will silently replace said process, but the work that the process was supposed to do will never be done, causing pylint to hang indefinitely. The concurrent.futures.ProcessPoolExecutor will raise a BrokenProcessPool exception in that case, avoiding the hang. Co-authored-by: Daniël van Noord <13665637+DanielNoord@users.noreply.github.com> (cherry picked from commit 5eca8ec) Co-authored-by: Daniel <daniel.werner@scalableminds.com>