Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hangs forever after a parallel job gets killed #3899

Closed
martinpitt opened this issue Oct 13, 2020 · 7 comments · Fixed by #7834
Closed

Hangs forever after a parallel job gets killed #3899

martinpitt opened this issue Oct 13, 2020 · 7 comments · Fixed by #7834

Comments

@martinpitt
Copy link

This is similar to issue #1495 -- there is apparently some giant memory leak in pylint which makes it eat ginormous amounts of RAM. This issue is about mishandling parallel workers with -j -- once one worker dies, pylint runs into a deadlock.

Steps to reproduce

  1. podman run -it --rm --memory=2G fedora:33

This issues is not specific to podman -- you can use docker, or a VM, or anything where you can control the amount of RAM.

  1. Inside the container, prepare:
dnf install -y python3-pip git
pip install pylint
git clone --depth=1 https://github.com/rhinstaller/anaconda
cd anaconda
  1. Run pylint in parallel; my nproc is 4, thus -j0 selects 4, but let's select -j4 explicitly:
pylint -j4 pyanaconda/

Current behavior

As per issue #1495, the 5 pylint processes pile up more and more RAM usage, until one gets killed because of OOM:

Memory cgroup out of memory: Killed process 89155 (pylint) total-vm:573588kB, anon-rss:564804kB, file-rss:0kB, shmem-rss:8kB, UID:1000 pgtables:1164kB oom_score_adj:0

I still get some remaining output, then pylint hangs forever:

[...]
pyanaconda/payload/dnf/payload.py:41:0: C0411: standard import "from fnmatch import fnmatch" should be placed before "import dnf" (wrong-import-order)
pyanaconda/payload/dnf/payload.py:42:0: C0411: standard import "from glob import glob" should be placed before "import dnf" (wrong-import-order)
pyanaconda/payload/dnf/payload.py:46:0: C0411: third party import "from pykickstart.constants import GROUP_ALL, GROUP_DEFAULT, KS_MISSING_IGNORE, KS_BROKEN_IGNORE, GROUP_REQUIRED" should be placed before "import pyanaconda.localization" (wrong-import-order)
pyanaconda/payload/dnf/payload.py:48:0: C0411: third party import "from pykickstart.parser import Group" should be placed before "import pyanaconda.localization" (wrong-import-order)
pyanaconda/payload/dnf/payload.py:50:0: C0412: Imports from package pyanaconda are not grouped (ungrouped-imports)

The other processes are still around:

martin     89152  5.1  0.2 266584 40680 pts/0    Sl+  17:23   0:24 /usr/bin/python3 /usr/local/bin/pylint -j0 pyanaconda/
martin     89153 26.9  2.8 472352 465536 pts/0   S+   17:23   2:09 /usr/bin/python3 /usr/local/bin/pylint -j0 pyanaconda/
martin     89154 27.4  3.3 541924 536164 pts/0   S+   17:23   2:12 /usr/bin/python3 /usr/local/bin/pylint -j0 pyanaconda/
martin     89156 28.4  3.2 529740 522980 pts/0   S+   17:23   2:16 /usr/bin/python3 /usr/local/bin/pylint -j0 pyanaconda/
martin     89586 19.2  0.8 348184 135620 pts/0   S+   17:30   0:21 /usr/bin/python3 /usr/local/bin/pylint -j0 pyanaconda/

The first three and the fifth wait in a futex:

❱❱❱ strace -p 89152
strace: Process 89152 attached
futex(0x55d4edd66550, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY^Cstrace: Process 89152 detached
 <detached ...>

and the fourth is apparently the controller, which waits in a read():

❱❱❱ strace -p 89156
strace: Process 89156 attached
read(3,

After pressing Control-C, I get

Process ForkPoolWorker-1:
Process ForkPoolWorker-2:
Process ForkPoolWorker-4:
Process ForkPoolWorker-5:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib64/python3.9/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 365, in get
    with self._rlock:
  File "/usr/lib64/python3.9/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
KeyboardInterrupt
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib64/python3.9/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 365, in get
    with self._rlock:
  File "/usr/lib64/python3.9/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib64/python3.9/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 365, in get
    with self._rlock:
  File "/usr/lib64/python3.9/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib64/python3.9/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 366, in get
    res = self._reader.recv_bytes()
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 221, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 384, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt

Expected behavior

pylint should notice that one worker died, and immediately abort, instead of hanging forever (which also blocks CI systems until their timeout, which is usually quite long).

pylint --version output

pylint 2.6.0
astroid 2.4.2
Python 3.9.0 (default, Oct  6 2020, 00:00:00) 
[GCC 10.2.1 20200826 (Red Hat 10.2.1-3)]

I tried pip install pylint astroid --pre -U but that doesn't install anything new.

martinpitt added a commit to martinpitt/anaconda that referenced this issue Oct 14, 2020
pylint hits 2 GiB per process pretty often [1], thus still causing OOM
failures [2]. Bump the divisor.

[1] pylint-dev/pylint#1495
[2] pylint-dev/pylint#3899
martinpitt added a commit to martinpitt/anaconda that referenced this issue Oct 14, 2020
pylint hits 2 GiB per process pretty often [1], thus still causing OOM
failures [2]. Bump the divisor.

[1] pylint-dev/pylint#1495
[2] pylint-dev/pylint#3899

Cherry-picked from master PR rhinstaller#2923

Related: rhbz#1885635
martinpitt added a commit to martinpitt/anaconda that referenced this issue Oct 14, 2020
max() was bogus -- we want *both* RAM and number of CPUs to limit the
number of jobs, not take whichever is highest.

[1] pylint-dev/pylint#1495
[2] pylint-dev/pylint#3899

Cherry-picked from master PR rhinstaller#2923

Related: rhbz#1885635
martinpitt added a commit to martinpitt/anaconda that referenced this issue Oct 14, 2020
pylint hits 2 GiB per process pretty often [1], thus still causing OOM
failures [2]. Bump the divisor.

[1] pylint-dev/pylint#1495
[2] pylint-dev/pylint#3899

Cherry-picked from master PR rhinstaller#2923

Related: rhbz#1885635
martinpitt added a commit to martinpitt/anaconda that referenced this issue Oct 14, 2020
pylint hits 2 GiB per process pretty often [1], thus still causing OOM
failures [2]. Bump the divisor.

[1] pylint-dev/pylint#1495
[2] pylint-dev/pylint#3899
@hippo91
Copy link
Contributor

hippo91 commented Oct 17, 2020

@martinpitt thanks for this report. Probably the situation will be better if pylint-dev/astroid#837 get merged.

@martinpitt
Copy link
Author

@hippo91 : Indeed, fixing the memory leak is most appreciated, of course. I filed this separately though as handling crashed subthreads is not something that's related to the memory leak itself. Thanks!

@hippo91 hippo91 added Blocker 🙅 Blocks the next release Bug 🪲 labels Oct 23, 2020
@hippo91
Copy link
Contributor

hippo91 commented Oct 31, 2020

@martinpitt could you check if you still face this issue after #3869 has been merged into master?

@martinpitt
Copy link
Author

First, this is a quicker reproducer, especially after the huge memory issue got addressed (#1495). After the original instructions from the description, one can explicitly kill a thread, and see how it reacts:

pgrep pylint | sort -u | tail -n1 | xargs kill --verbose -9

And it indeed hangs at the end, so still the same situation as a month ago. I applied the patch:

curl https://github.com/PyCQA/pylint/commit/70cdb9d74913f3ccf0b60ed040d0530a42732980.patch | patch /usr/local/lib/python3.9/site-packages/pylint/lint/check_parallel.py

(There's some fuzz and some rejections due to the ChangeLog and such, but these can be ignored). Re-running the test and the kill has no change, though: It still hangs forever at the end. So this is not fixed yet.

martinpitt added a commit to martinpitt/anaconda that referenced this issue Nov 24, 2020
The default timeout is 6 hours, which just causes unnecessary hangs,
delays, and blocked self-hosted runners if anything goes wrong (such as
pylint-dev/pylint#3899).

A normal run takes about 10 minutes, so set a timeout of 30 minutes.
Use one hour for the full container rebuild+validation+upload to be on
the safe side (it usually takes between 15 and 18 minutes).
@hippo91
Copy link
Contributor

hippo91 commented Nov 24, 2020

@martinpitt thanks for your feed pack. The issue remains open so.

@loeiten
Copy link

loeiten commented May 30, 2022

I'm also bumping into this issue when using pre-commit on the whole codebase. Running pylint with pre-commit for only a few files seems to work (for example on a few files staged for commit).
Are there any known workarounds?

@loeiten
Copy link

loeiten commented Apr 19, 2023

I'm also bumping into this issue when using pre-commit on the whole codebase. Running pylint with pre-commit for only a few files seems to work (for example on a few files staged for commit).
Are there any known workarounds?

If anyone are facing the same issue with pre-commit: The workaround in my case was to set number of processes equal to 1 with

          - id: pylint
            name: pylint
            entry: pylint
            language: system
            types: [python]
            args:
              [
                -j 1
              ]

However #7834 might have fixed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants