Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for case where db_pool connections can be lost. #527

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

erickj00001
Copy link

I've encountered a case where eventlet.db_pool can lose a connection even though it was returned using put(), and after that, all greenthreads will wait indefinitely for a connection (assuming max_size=1, or if it happens enough times).

The problem occurs when:
(1) A greenthread was waiting for a connection from the pool
(2) Another thread calls put() to return a connection to the pool, but it has expired, so the first thread will be allowed to create a new connection
(3) The create() method raises an exception that doesn't inherit from Exception, such as eventlet.Timeout

Note that Pool.get uses "except:" instead of "except: Exception" to make sure all exceptions from create() are accounted for, but BaseConnectionPool.get incorrectly uses "except: Exception" in the case where a returned connection was found to be unusable.

Here's a standalone test that demonstrates the problem, with comments that explain in more detail:
https://github.com/erickj00001/test-scripts/blob/c1f041b4428b01cf62c6fb68718d5d30f95224ed/db_pool_test.py#L1-L108

Output:

[ 0.000] [conn-1] Opening (delay 0)...
[ 0.000] [conn-1] Opened
[ 0.000] (main) Checked out connection conn-1
[ 0.000] (test1) Requesting connection
[ 6.007] (main) Returning connection conn-1
[ 6.007] [conn-1] Closed
[ 6.007] [conn-2] Opening (delay 10)...
[ 6.007] Waiting for test1...
[ 8.003] [conn-2] Connection aborted!
[ 8.003] (test1) Timed out
[ 8.003] test1 finished
[ 9.004] (test2) Requesting connection
[17.005] (test2) Timed out
[17.005] --- TEST FAILED ---

After changing the "except Exception:" to "except:" in BaseConnectionPool.get, the test passes:

[ 0.000] [conn-1] Opening (delay 0)...
[ 0.000] [conn-1] Opened
[ 0.000] (main) Checked out connection conn-1
[ 0.000] (test1) Requesting connection
[ 6.005] (main) Returning connection conn-1
[ 6.005] [conn-1] Closed
[ 6.006] [conn-2] Opening (delay 10)...
[ 6.006] Waiting for test1...
[ 8.003] [conn-2] Connection aborted!
[ 8.003] (test1) Timed out
[ 8.003] test1 finished
[ 9.004] (test2) Requesting connection
[ 9.004] [conn-3] Opening (delay 0)...
[ 9.004] [conn-3] Opened
[ 9.004] (test2) Checked out connection conn-3
[ 9.004] --- TEST PASSED ---

@codecov-io
Copy link

codecov-io commented Oct 6, 2018

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (72a29b6) 53% compared to head (ea84ab6) 53%.

Files Patch % Lines
eventlet/db_pool.py 0% 2 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##           master   #527   +/-   ##
=====================================
  Coverage      53%    53%           
=====================================
  Files          88     88           
  Lines        9881   9880    -1     
  Branches     1852   1853    +1     
=====================================
  Hits         5335   5335           
+ Misses       4156   4155    -1     
  Partials      390    390           
Flag Coverage Δ
ipv6 22% <0%> (+<1%) ⬆️
py310epolls 52% <0%> (+<1%) ⬆️
py310poll 52% <0%> (+<1%) ⬆️
py310selects 52% <0%> (+<1%) ⬆️
py311epolls 52% <0%> (+<1%) ⬆️
py312epolls 50% <0%> (+<1%) ⬆️
py37epolls 50% <0%> (+<1%) ⬆️
py38epolls 52% <0%> (+<1%) ⬆️
py38openssl 51% <0%> (+<1%) ⬆️
py38poll 52% <0%> (+<1%) ⬆️
py38selects 52% <0%> (-1%) ⬇️
py39dnspython1 50% <0%> (+<1%) ⬆️
py39epolls 52% <0%> (+<1%) ⬆️
py39poll 52% <0%> (+<1%) ⬆️
py39selects 52% <0%> (-1%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@temoto
Copy link
Member

temoto commented Oct 6, 2018

@erickj00001 except: is usually error, because it will also catch SystemExit, KeyboardInterrupt, GeneratorExit. Please add your test to patch, it will probably fail in Python3, where meaning of except: is changed.

@erickj00001
Copy link
Author

It works fine for me in Python 3.

Anyway, it raises the exception again immediately (after decrementing a counter). It's just important that the counter get decremented there. The base class version of get() uses "except:", like I said. Why do you suppose it's correct there, but not in the derived class version?

@temoto
Copy link
Member

temoto commented Oct 6, 2018

I said nothing about except: in base class. Likely it is not appropriate everywhere except hub and greenthread code. Last time I looked at fixing it, proper syntax broke too many tests and I postponed it.

@erickj00001
Copy link
Author

Please help me understand here. Are you saying that there isn't a problem with the code as is? I believe my test script adequately demonstrates that there is a problem, but if you disagree then please let me know your reasoning.

Or, are you just saying that you don't like the syntax of the proposed solution? We can change it to use finally: instead if you prefer, something like this:

try:
    conn = self.create()
finally:
    if conn is None:
        # unconditionally increase the free pool because
        # even if there are waiters, doing a full put
        # would incur a greenlib switch and thus lose the
        # exception stack
        self.current_size -= 1

You mention that you want the test script checked in as part of the patch. Do I understand that to mean you are not able to try it as presented, and you will only run it if I integrate it into the unit test framework? Again, this is not to complain -- I just wish to understand the requirements.

@temoto
Copy link
Member

temoto commented Oct 6, 2018

Clearly there are problems with current code, arguably the biggest one is class Timeout(BaseException). Then except: catching things it shouldn't.

Sorry for ambiguity.

My points are:

  • (important) since you have test, it should be part of patch
  • because all tests pass is requirement before merge (it will guard against recurring problem in same code forever)
  • Eventlet doesn't rely on humans running tests, it's done automatically with help of TravisCI
  • if possible, don't use except:
    • if you target particular exception types, write them
    • if you target all exceptions, write except Exception, but it doesn't work with eventlet.Timeout, so if it works with everything except X - that's a problem with X, so it should be except (Exception, Timeout)
    • but if you don't need exception case at all and could do with try/finally - of course use it

The problem occurs if:
(1) A greenthread was waiting for a connection from the pool
(2) Another thread calls put() to return a connection to the pool,
    but it has expired, so the first thread will be cleared to
    create a new connection
(3) The create() method raises an exception that doesn't inherit
    from Exception, such as eventlet.Timeout

Note that eventlet.pools.Pool.get uses "except:" instead of
"except: Exception" to make sure all exceptions from create() are
accounted for, but eventlet.db_pool.get() incorrectly uses
"except: Exception" in the case where a returned connection was
found to be unusable.
@erickj00001
Copy link
Author

Ok, I've added my script as a unit test (as part of the existing db_pool_test.py). I reduced the sleep delays to the minimum, so as not to slow down the testing unnecessarily.

I've also updated the proposed fix so it uses finally: instead of except:.

@itamarst itamarst added the bug label Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants