Hangs forever after a parallel job gets killed #3899

martinpitt · 2020-10-13T15:41:55Z

This is similar to issue #1495 -- there is apparently some giant memory leak in pylint which makes it eat ginormous amounts of RAM. This issue is about mishandling parallel workers with -j -- once one worker dies, pylint runs into a deadlock.

Steps to reproduce

podman run -it --rm --memory=2G fedora:33

This issues is not specific to podman -- you can use docker, or a VM, or anything where you can control the amount of RAM.

Inside the container, prepare:

dnf install -y python3-pip git
pip install pylint
git clone --depth=1 https://github.com/rhinstaller/anaconda
cd anaconda

Run pylint in parallel; my nproc is 4, thus -j0 selects 4, but let's select -j4 explicitly:

pylint -j4 pyanaconda/

Current behavior

As per issue #1495, the 5 pylint processes pile up more and more RAM usage, until one gets killed because of OOM:

Memory cgroup out of memory: Killed process 89155 (pylint) total-vm:573588kB, anon-rss:564804kB, file-rss:0kB, shmem-rss:8kB, UID:1000 pgtables:1164kB oom_score_adj:0

I still get some remaining output, then pylint hangs forever:

[...]
pyanaconda/payload/dnf/payload.py:41:0: C0411: standard import "from fnmatch import fnmatch" should be placed before "import dnf" (wrong-import-order)
pyanaconda/payload/dnf/payload.py:42:0: C0411: standard import "from glob import glob" should be placed before "import dnf" (wrong-import-order)
pyanaconda/payload/dnf/payload.py:46:0: C0411: third party import "from pykickstart.constants import GROUP_ALL, GROUP_DEFAULT, KS_MISSING_IGNORE, KS_BROKEN_IGNORE, GROUP_REQUIRED" should be placed before "import pyanaconda.localization" (wrong-import-order)
pyanaconda/payload/dnf/payload.py:48:0: C0411: third party import "from pykickstart.parser import Group" should be placed before "import pyanaconda.localization" (wrong-import-order)
pyanaconda/payload/dnf/payload.py:50:0: C0412: Imports from package pyanaconda are not grouped (ungrouped-imports)

The other processes are still around:

martin     89152  5.1  0.2 266584 40680 pts/0    Sl+  17:23   0:24 /usr/bin/python3 /usr/local/bin/pylint -j0 pyanaconda/
martin     89153 26.9  2.8 472352 465536 pts/0   S+   17:23   2:09 /usr/bin/python3 /usr/local/bin/pylint -j0 pyanaconda/
martin     89154 27.4  3.3 541924 536164 pts/0   S+   17:23   2:12 /usr/bin/python3 /usr/local/bin/pylint -j0 pyanaconda/
martin     89156 28.4  3.2 529740 522980 pts/0   S+   17:23   2:16 /usr/bin/python3 /usr/local/bin/pylint -j0 pyanaconda/
martin     89586 19.2  0.8 348184 135620 pts/0   S+   17:30   0:21 /usr/bin/python3 /usr/local/bin/pylint -j0 pyanaconda/

The first three and the fifth wait in a futex:

❱❱❱ strace -p 89152
strace: Process 89152 attached
futex(0x55d4edd66550, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY^Cstrace: Process 89152 detached
 <detached ...>

and the fourth is apparently the controller, which waits in a read():

❱❱❱ strace -p 89156
strace: Process 89156 attached
read(3,

After pressing Control-C, I get

Process ForkPoolWorker-1:
Process ForkPoolWorker-2:
Process ForkPoolWorker-4:
Process ForkPoolWorker-5:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib64/python3.9/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 365, in get
    with self._rlock:
  File "/usr/lib64/python3.9/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
KeyboardInterrupt
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib64/python3.9/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 365, in get
    with self._rlock:
  File "/usr/lib64/python3.9/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib64/python3.9/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 365, in get
    with self._rlock:
  File "/usr/lib64/python3.9/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib64/python3.9/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 366, in get
    res = self._reader.recv_bytes()
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 221, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 384, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt

Expected behavior

pylint should notice that one worker died, and immediately abort, instead of hanging forever (which also blocks CI systems until their timeout, which is usually quite long).

pylint --version output

pylint 2.6.0
astroid 2.4.2
Python 3.9.0 (default, Oct  6 2020, 00:00:00) 
[GCC 10.2.1 20200826 (Red Hat 10.2.1-3)]

I tried pip install pylint astroid --pre -U but that doesn't install anything new.

The text was updated successfully, but these errors were encountered:

pylint hits 2 GiB per process pretty often [1], thus still causing OOM failures [2]. Bump the divisor. [1] pylint-dev/pylint#1495 [2] pylint-dev/pylint#3899

pylint hits 2 GiB per process pretty often [1], thus still causing OOM failures [2]. Bump the divisor. [1] pylint-dev/pylint#1495 [2] pylint-dev/pylint#3899 Cherry-picked from master PR rhinstaller#2923 Related: rhbz#1885635

max() was bogus -- we want *both* RAM and number of CPUs to limit the number of jobs, not take whichever is highest. [1] pylint-dev/pylint#1495 [2] pylint-dev/pylint#3899 Cherry-picked from master PR rhinstaller#2923 Related: rhbz#1885635

pylint hits 2 GiB per process pretty often [1], thus still causing OOM failures [2]. Bump the divisor. [1] pylint-dev/pylint#1495 [2] pylint-dev/pylint#3899 Cherry-picked from master PR rhinstaller#2923 Related: rhbz#1885635

pylint hits 2 GiB per process pretty often [1], thus still causing OOM failures [2]. Bump the divisor. [1] pylint-dev/pylint#1495 [2] pylint-dev/pylint#3899

hippo91 · 2020-10-17T10:30:05Z

@martinpitt thanks for this report. Probably the situation will be better if pylint-dev/astroid#837 get merged.

martinpitt · 2020-10-19T08:30:26Z

@hippo91 : Indeed, fixing the memory leak is most appreciated, of course. I filed this separately though as handling crashed subthreads is not something that's related to the memory leak itself. Thanks!

hippo91 · 2020-10-31T15:09:45Z

@martinpitt could you check if you still face this issue after #3869 has been merged into master?

martinpitt · 2020-11-24T07:34:22Z

First, this is a quicker reproducer, especially after the huge memory issue got addressed (#1495). After the original instructions from the description, one can explicitly kill a thread, and see how it reacts:

pgrep pylint | sort -u | tail -n1 | xargs kill --verbose -9

And it indeed hangs at the end, so still the same situation as a month ago. I applied the patch:

curl https://github.com/PyCQA/pylint/commit/70cdb9d74913f3ccf0b60ed040d0530a42732980.patch | patch /usr/local/lib/python3.9/site-packages/pylint/lint/check_parallel.py

(There's some fuzz and some rejections due to the ChangeLog and such, but these can be ignored). Re-running the test and the kill has no change, though: It still hangs forever at the end. So this is not fixed yet.

The default timeout is 6 hours, which just causes unnecessary hangs, delays, and blocked self-hosted runners if anything goes wrong (such as pylint-dev/pylint#3899). A normal run takes about 10 minutes, so set a timeout of 30 minutes. Use one hour for the full container rebuild+validation+upload to be on the safe side (it usually takes between 15 and 18 minutes).

hippo91 · 2020-11-24T19:42:16Z

@martinpitt thanks for your feed pack. The issue remains open so.

loeiten · 2022-05-30T07:01:25Z

I'm also bumping into this issue when using pre-commit on the whole codebase. Running pylint with pre-commit for only a few files seems to work (for example on a few files staged for commit).
Are there any known workarounds?

loeiten · 2023-04-19T19:55:22Z

I'm also bumping into this issue when using pre-commit on the whole codebase. Running pylint with pre-commit for only a few files seems to work (for example on a few files staged for commit).
Are there any known workarounds?

If anyone are facing the same issue with pre-commit: The workaround in my case was to set number of processes equal to 1 with

          - id: pylint
            name: pylint
            entry: pylint
            language: system
            types: [python]
            args:
              [
                -j 1
              ]

However #7834 might have fixed this

martinpitt mentioned this issue Oct 13, 2020

Restrict pylint parallelism to available RAM rhinstaller/anaconda#2918

Merged

martinpitt added a commit to martinpitt/anaconda that referenced this issue Oct 14, 2020

More aggressive pylint downscaling

b71481d

pylint hits 2 GiB per process pretty often [1], thus still causing OOM failures [2]. Bump the divisor. [1] pylint-dev/pylint#1495 [2] pylint-dev/pylint#3899

martinpitt added a commit to martinpitt/anaconda that referenced this issue Oct 14, 2020

More aggressive pylint downscaling

89af64d

pylint hits 2 GiB per process pretty often [1], thus still causing OOM failures [2]. Bump the divisor. [1] pylint-dev/pylint#1495 [2] pylint-dev/pylint#3899

hippo91 added Blocker 🙅 Blocks the next release Bug 🪲 labels Oct 23, 2020

martinpitt mentioned this issue Nov 24, 2020

Excessive memory usage to scan big projects #1495

Closed

martinpitt mentioned this issue Nov 24, 2020

Set shorter workflow timeouts rhinstaller/anaconda#3016

Merged

Pierre-Sassoulas added multiprocessing and removed Blocker 🙅 Blocks the next release labels May 30, 2022

daniel-wer mentioned this issue Nov 23, 2022

Avoid hanging forever after a parallel job was killed #7834

Merged

Pierre-Sassoulas added this to the 2.15.9 milestone Dec 10, 2022

DanielNoord closed this as completed in #7834 Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hangs forever after a parallel job gets killed #3899

Hangs forever after a parallel job gets killed #3899

martinpitt commented Oct 13, 2020

hippo91 commented Oct 17, 2020

martinpitt commented Oct 19, 2020

hippo91 commented Oct 31, 2020

martinpitt commented Nov 24, 2020

hippo91 commented Nov 24, 2020

loeiten commented May 30, 2022 •

edited

loeiten commented Apr 19, 2023

Hangs forever after a parallel job gets killed #3899

Hangs forever after a parallel job gets killed #3899

Comments

martinpitt commented Oct 13, 2020

Current behavior

Expected behavior

pylint --version output

hippo91 commented Oct 17, 2020

martinpitt commented Oct 19, 2020

hippo91 commented Oct 31, 2020

martinpitt commented Nov 24, 2020

hippo91 commented Nov 24, 2020

loeiten commented May 30, 2022 • edited

loeiten commented Apr 19, 2023

loeiten commented May 30, 2022 •

edited