Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loginTimeout is not followed under network partition and clock jump #1468

Closed
1 of 2 tasks
stevenliang16 opened this issue Apr 12, 2019 · 5 comments
Closed
1 of 2 tasks

Comments

@stevenliang16
Copy link

stevenliang16 commented Apr 12, 2019

I'm submitting a ...

  • bug report
  • feature request

Describe the issue
The driver does not follow the login timeout when there is clock jump on client and network partition between client and server. We have set loginTimeout to 10 seconds and socketTimeout to 20 seconds, but we have seen some queries hanging for more than 1 hour.

Driver Version?
42.0.0

Java Version?
1.8.0_191

OS Version?
Ubuntu 16.04

PostgreSQL Version?
11.2

Analysis
There are two bugs that together cause the issue

  1. Use of non-monotonic time System.currentTimeMillis()
    https://github.com/pgjdbc/pgjdbc/blob/REL42.0.0/pgjdbc/src/main/java/org/postgresql/Driver.java#L364-L412
    Login timeout is calculated with System.currentTimeMillis(), which is not monotonic. If there is a backward clockjump after the calculation of expiry and before the calculation of delay, it is possible that delay has a very large value, which is then passed to wait() method (on line 400).
    This wait() method will wait until either another thread invokes the notify() method or the notifyAll() method for this object, or a specified amount of time has elapsed.

  2. No socket timeout set for socket read in enableSSL
    When making a connection, the separate thread will call ConnectionFactoryImpl.openConnectionImpl to open a connection. Inside the function, it calls enableSSL to send ssl startup packet and enable ssl for an established connection. However, this step does not have socket timeout set, which means it can get stuck forever.

  3. How these two bugs cause the issue
    Due to the first bug, we have a very long time to wait inside wait() method. Ideally, the ConnectThread.run() will either make a connection or fail if socketTimeout and connectTimeout are working as expected. However, the second bug causes makeConnection in ConnectThread.run to be stuck forever, which means the calling thread has to wait until the long delay has elapsed.

To Reproduce

  1. Inject a backward clock jump on the client. This can be a real backward clockjump, or you can manually set the delay to a very large value here.
  2. Make enableSSL wait on the socket forever. This can be done by adding an iptables rule that drops all packets to the postgresql port.
    Please note that you need to drop all packets after the connection is established and before enableSSL is called. (one way is to add sleep before enableSSL) Otherwise, you will see Connection timeout error because a connection cannot be established.

Expected behaviour
The query fails after loginTimeout has elapsed.

@bokken
Copy link
Member

bokken commented Apr 13, 2019

Are there other timeouts (outside the Driver class) also using currentTimeMillis?
I am in favor of using nanoTime for these calculations[1], but want to be consistent through project.

[1] - https://stackoverflow.com/a/54566928/676877

@davecramer
Copy link
Member

davecramer commented Apr 13, 2019

Just happened by this one

this.lastStatusUpdate = System.currentTimeMillis() - updateIntervalMs;

That said I think fixing that should be done in another PR

@stevenliang16
Copy link
Author

@bokken @davecramer Thanks for the reply! I am also in favor of using nanoTime for calculating timeouts. Will someone from your team take on the task or should I try to send out a PR for this? Thanks!

stevenliang16 added a commit to scaledata/pgjdbc that referenced this issue Apr 23, 2019
…sl handshake

Summary:
Details of the issue can be found in:
pgjdbc#1468

PR: pgjdbc#1469

Test Plan:
Run manual tests with and without the fix.
Verify that the fix did prevent the issue from happening.

Reviewers: #callisto, garvit

Reviewed By: #callisto, garvit

Tags: #callisto

JIRA Issues: CDM-110595

Differential Revision: https://phabricator.rubrik.com/D87384
@davecramer
Copy link
Member

hmmm apparently we still have some stragglers besides tests that use currentTimeMillis

@davecramer
Copy link
Member

fixed by #1617

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants