Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

42.2.8: Infinite loops with message "Ignoring request to process copy results, already processing" #1592

Open
1 of 2 tasks
hyunkshinft opened this issue Oct 25, 2019 · 47 comments
Assignees
Labels

Comments

@hyunkshinft
Copy link
Contributor

hyunkshinft commented Oct 25, 2019

I'm submitting a ...

  • bug report
  • feature request

Describe the issue
A clear and concise description of what the issue is.

The JDBC driver falls into infinite loops during the streaming. One thread attempt abort streaming by closing the stream while another tread is still reading it. It occasionally loops while printing "Ignoring request to process copy results, already processing" indefinitely.

This looks like related to the PR #1467

Driver Version?
42.2.8
Java Version?
java version "1.8.0_131"

OS Version?
Linux ip-10-47-10-161 4.9.51-10.52.amzn1.x86_64 #1 SMP Fri Sep 29 01:16:19 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

PostgreSQL Version?
10.7
To Reproduce
Steps to reproduce the behaviour:

thread 1:

stream.read(); or stream.readPending();

thread 2:

stream.close();
replicationConnection.close()

Expected behaviour
A clear and concise description of what you expected to happen.

thread 1 throws an exception.

And what actually happens

Infinite loops printing the message "Ignoring request to process copy results, already processing"

Logs
If possible PostgreSQL logs surrounding the occurrence of the issue
Additionally logs from the driver can be obtained adding

loggerLevel=TRACE&loggerFile=pgjdbc-trace.log 

to the connection string

davecramer added a commit to davecramer/pgjdbc that referenced this issue Oct 29, 2019
@davecramer
Copy link
Member

@hyunkshinft can you test the PR above ?

@hyunkshinft
Copy link
Contributor Author

@davecramer thanks for the quick fix. I will try this as soon as I can and let you know of the result.

@hyunkshinft
Copy link
Contributor Author

@davecramer unfortunately this did not fix the problem. FYI the following is the piece of code that closed the stream.

        try {
            if (context.stream != null && !context.stream.isClosed()) context.stream.close();
            if (context.replicationConnection != null) context.replicationConnection.close();
        } catch (SQLException tolerable) {
        }

@davecramer
Copy link
Member

hmmm strange. I would think that should cause an exception to be thrown.
FYI, there's no thread safety here. In reality you should not be closing it in another thread.

@hyunkshinft
Copy link
Contributor Author

I thought so too. Your fix seems very obvious to me but somehow it passed connection close check.

@davecramer
Copy link
Member

Something else must be going on. Any chance you can debug or provide a stand alone test case ?

@hyunkshinft
Copy link
Contributor Author

It does not happen always and might not duplicate under debugger but I will try. I am sorry but I don't know exact scenario yet.

@hyunkshinft
Copy link
Contributor Author

hyunkshinft commented Oct 30, 2019

@davecramer one thing that I want to remark about this particular connection is the WAL records are usually as big as 48MB and the end to end connection has very long pipe with at least two ssh tunneling.

@davecramer
Copy link
Member

@hyunkshinft OK, I have some cycles to look at this. Any update before I do ?

@hyunkshinft
Copy link
Contributor Author

@davecramer Now I don't close the stream so I don't hit "Ignoring request ..." message anymore. I even added an exception to fail immediately as below in processCopyResults in order to avoid excessive logging in case of this error.

    if ( processingCopyResults.compareAndSet(false,true) == false ) {
      PSQLException error = new PSQLException(GT
          .tr("Ignoring request to process copy results, already processing"),
          PSQLState.UNEXPECTED_ERROR);
      LOGGER.log(Level.INFO, "Ignoring request to process copy results, already processing", error);
      throw error;
//      return null;
    }

I know you have implemented this thread protection code to address the issue with "Unexpected packet type during copy" error. But I am still getting "Unexpected packet type during copy" error with out hitting above Exception code.

class java.io.IOException: Unexpected packet type during copy: 109
org.postgresql.core.v3.QueryExecutorImpl::processCopyResults(QueryExecutorImpl.java::1293)
org.postgresql.core.v3.QueryExecutorImpl::readFromCopy(QueryExecutorImpl.java::1064)
org.postgresql.core.v3.CopyDualImpl::readFromCopy(CopyDualImpl.java::37)
org.postgresql.core.v3.replication.V3PGReplicationStream::receiveNextData(V3PGReplicationStream.java::158)
org.postgresql.core.v3.replication.V3PGReplicationStream::readInternal(V3PGReplicationStream.java::123)

My code has retry algorithms (close existing stream and connection then reopen) in case of PSQLException and it works for a while after retry then hit the same error again and again.

class java.io.IOException: Unexpected packet type during copy: 32
org.postgresql.core.v3.QueryExecutorImpl::processCopyResults(QueryExecutorImpl.java::1293)
org.postgresql.core.v3.QueryExecutorImpl::readFromCopy(QueryExecutorImpl.java::1064)
org.postgresql.core.v3.CopyDualImpl::readFromCopy(CopyDualImpl.java::37)
org.postgresql.core.v3.replication.V3PGReplicationStream::receiveNextData(V3PGReplicationStream.java::158)
org.postgresql.core.v3.replication.V3PGReplicationStream::readInternal(V3PGReplicationStream.java::123)

Working and fail goes on and on until it hits OOM when that "unexpected packet type" is among 'E' or 'N' probably coincidently. If it gets big elen or nlen then the process dies with OOM as below

class java.lang.OutOfMemoryError: Java heap space
org.postgresql.core.VisibleBufferedInputStream::doubleBuffer(VisibleBufferedInputStream.java::152)
org.postgresql.core.VisibleBufferedInputStream::readMore(VisibleBufferedInputStream.java::136)
org.postgresql.core.VisibleBufferedInputStream::ensureBytes(VisibleBufferedInputStream.java::109)
org.postgresql.core.PGStream::receiveErrorString(PGStream.java::383)
org.postgresql.core.v3.QueryExecutorImpl::receiveErrorResponse(QueryExecutorImpl.java::2463)
org.postgresql.core.v3.QueryExecutorImpl::processCopyResults(QueryExecutorImpl.java::1144)
org.postgresql.core.v3.QueryExecutorImpl::writeToCopy(QueryExecutorImpl.java::1026)
org.postgresql.core.v3.CopyDualImpl::writeToCopy(CopyDualImpl.java::19)

I added the check in receiveErrorResponse() and receiveNoticeResponse() and return error when there is invalid elen or nlen as below.

    if (elen <= 4 || elen >= 100 * 1024) { // error message greater tahn 100K? Something wrong.

        return new PSQLException(GT.tr("Protocol error.  Invalid error message length " + elen), PSQLState.PROTOCOL_VIOLATION);
    }

And

    if (nlen <= 4 || nlen >= 100 * 1024) { // error message greater tahn 100K? Something wrong.

        throw new IOException("Protocol error.  Invalid error message length " + nlen);
    }

And now it fails at different place.

class java.lang.OutOfMemoryError: Java heap space
org.postgresql.core.PGStream::receive(PGStream.java::473)
org.postgresql.core.v3.QueryExecutorImpl::processCopyResults(QueryExecutorImpl.java::1247)
org.postgresql.core.v3.QueryExecutorImpl::readFromCopy(QueryExecutorImpl.java::1064)
org.postgresql.core.v3.CopyDualImpl::readFromCopy(CopyDualImpl.java::37)
org.postgresql.core.v3.replication.V3PGReplicationStream::receiveNextData(V3PGReplicationStream.java::158)
org.postgresql.core.v3.replication.V3PGReplicationStream::readInternal(V3PGReplicationStream.java::123)
org.postgresql.core.v3.replication.V3PGReplicationStream::readPending(V3PGReplicationStream.java::80)

I have never seen a message "Ignoring request to process copy results, already processing" if I don't stop stream and still getting above errors. Which means that "Unexpected .." error might not be related to thread racing condition in the first place?

@hyunkshinft
Copy link
Contributor Author

hyunkshinft commented Nov 1, 2019

@davecramer I will describe our code design here a bit in order to help you understand what is going on.

I have two threads.
The first is a producer which keeps reading from the stream and adding WAL data(ByteBuffer) to a ConcurrentLinkedQueue.
The 2nd is the consumer which reads WAL(ByteBuffer) from the queue and processes it. this thread will never access stream (previously it closed the stream when it decided to exit, but now it does not).

There is a flow control in place through the concurrent queue between the producer and the consumer.

I suppose the stream API allocates a new ByteBuffer every time stream.read() or stream.readPending() is called so the consumer completely owns the buffer once it is read and returned by stream.read().

@davecramer
Copy link
Member

@hyunkshinft your code must have more than one reader. How would it be possible to get around the lock that was put in to avoid the Unexpected char message ? This might also explain the first symptom as well.

@hyunkshinft
Copy link
Contributor Author

@davecramer I am pretty sure that we have one reader thread. that is why I have never seen log message nor exception caused by this code.

    if ( processingCopyResults.compareAndSet(false,true) == false ) {
      PSQLException error = new PSQLException(GT
          .tr("Ignoring request to process copy results, already processing"),
          PSQLState.UNEXPECTED_ERROR);
      LOGGER.log(Level.INFO, "Ignoring request to process copy results, already processing", error);
      throw error;
//      return null;
    }

@davecramer
Copy link
Member

OK, let me try to write a test case to test this.

@hyunkshinft
Copy link
Contributor Author

@davecramer I think I found a cause of the problems for "Unexpected packet type ..." as well as OOM, but I am not sure if it has any thing to do with the original infinite loop issue reported here. As far as I don't close the stream I haven't encountered this infinite loop. What is so bad about this infinite loop is that it produces so many logs especially in the production it increases cloud logging cost dramatically without being noticed as it is just an INFO message.

@davecramer
Copy link
Member

@hyunkshinft I'd love to know the cause. are you able to share?

@hyunkshinft
Copy link
Contributor Author

hyunkshinft commented Nov 8, 2019

Here is a proposed fix for the "class java.io.IOException: Unexpected packet type during copy: " and OOM. The root cause of these problems is that when you hit SocketTimeout in the middle of CopyData you should not fail but just wait as far as the connection is still connected. Otherwise, your next processCopyResuts will start at random position and ends up with Unexpected ... if the random character is unknown one and if the random character just happens to be the valid command then the next 4 bytes are taken as message length which is completely random. So you end up with unpredictable error including OOM thereafter. SocketTimeout can happen by the various reasons. Slow server, network latency, etc.

diff -Naur 42.2.6/pgjdbc/pgjdbc/src/main/java/org/postgresql/core/VisibleBufferedInputStream.java 42.2.6-patched/pgjdbc/pgjdbc/src/main/java/org/postgresql/core/VisibleBufferedInputStream.java
--- 42.2.6/pgjdbc/pgjdbc/src/main/java/org/postgresql/core/VisibleBufferedInputStream.java	2019-11-08 09:27:07.000000000 -0800
+++ 42.2.6-patched/pgjdbc/pgjdbc/src/main/java/org/postgresql/core/VisibleBufferedInputStream.java	2019-11-08 00:58:19.000000000 -0800
@@ -8,6 +8,7 @@
 import java.io.EOFException;
 import java.io.IOException;
 import java.io.InputStream;
+import java.net.SocketTimeoutException;
 
 /**
  * A faster version of BufferedInputStream. Does no synchronisation and allows direct access to the
@@ -138,7 +139,7 @@
       canFit = buffer.length - endIndex;
     }
     int read = wrapped.read(buffer, endIndex, canFit);
-    if (read < 0) {
+    if (read <= 0) {
       return false;
     }
     endIndex += read;
@@ -211,7 +212,12 @@
 
     // then directly from wrapped stream
     do {
-      int r = wrapped.read(to, off, len);
+      int r;
+      try {
+        r = wrapped.read(to, off, len);
+      } catch (SocketTimeoutException e) {
+        return read;
+      }
       if (r <= 0) {
         return (read == 0) ? r : read;
       }

@hyunkshinft
Copy link
Contributor Author

@davecramer let me know what you think.

@davecramer
Copy link
Member

Were you able to confirm that this fixes things for you? I'd like to remove the code I put in and try this with the test case with very large copy. Seems like this might be the underlying reason

@hyunkshinft
Copy link
Contributor Author

Yes, I am running this code without problem for about a day now. It works for both 42.2.6 and 42.2.8 but as I said 42.2.8 has other problems so yes, you may consider revert your change.

@hyunkshinft
Copy link
Contributor Author

If you can briefly pause the communication during CopyData then you can duplicate this problem easily.

@davecramer
Copy link
Member

davecramer commented Nov 8, 2019 via email

@davecramer
Copy link
Member

OK, I removed the lock and put your code in and I can still get the error with unexpected character pretty easily.

@hyunkshinft
Copy link
Contributor Author

hyunkshinft commented Nov 8, 2019 via email

@davecramer
Copy link
Member

No not using threaded readers. As I said JDBC is not thread-safe.
#1466 scroll to the end. The guys from braintree that originally found the problem are testing and have not been able to repeat it. Need to look at this further next week

@hyunkshinft
Copy link
Contributor Author

Which means you can still duplicate the problem with new code while the issuer of #1466 cannot?

@davecramer
Copy link
Member

That is the short version, yes. But I am using their code to duplicate it. So at the moment I don't have an answer

@hyunkshinft
Copy link
Contributor Author

It could be that there is more than SocketTimeoutException that causes the processCopyResults to fail prematurely. If you can easily duplicate then I suggest you to catch all Throwable instead of just SocketTimeoutException and see what else is being captured.

@davecramer
Copy link
Member

I caught everything. Still failed.

@davecramer davecramer self-assigned this Nov 28, 2019
@hyunkshinft
Copy link
Contributor Author

hyunkshinft commented Dec 24, 2019

@davecramer I noticed that my proposed fix didn't make to the official release so I modified the test created by #1466 in order to duplicate the error quickly and reliably. The test reproduced "Unexpected packet type error" using the version 42.2.9. I also included the test with my proposed fix. Please find modified test case here.

https://github.com/hyunkshinft/streamProblems#proposed-fix.

Hopefully you incorporate my proposed fix in the next official release.

@davecramer
Copy link
Member

@hyunkshinft I did try this and was able to reproduce the error quicker I think I may have to revisit this when I come back in the new year. Thanks for reminding me

@hyunkshinft
Copy link
Contributor Author

@davecramer please note that the proposed fix has been slightly changed.

@davecramer
Copy link
Member

OK, I've removed the "hack" and applied your patch to the latest code and this seems to work. I'd like a few others to test it.

I will push a fix so everyone can test it

@davecramer
Copy link
Member

Thanks see #1665

@davecramer
Copy link
Member

so it appears that catching and ignoring the socket timeout exception causes other issues with getting async notifies. We use the socket timeout and since it is caught and ignored we never timeout.

@davecramer
Copy link
Member

@hyunkshinft ya, while this fixes one problem it causes others. Need a rethink. Honestly I think we really need to look at reading data from the backend in a thread.

@hyunkshinft
Copy link
Contributor Author

hyunkshinft commented Jan 10, 2020

@davecramer Sorry for the late response. I have been on vacation. I agree that it could cause an issue with non-blocking read. I will take a look at next week and propose you a quick solution. I think we should handle socket timeout differently when we are fresh with expecting a new message and when we have read partial data already.

@davecramer
Copy link
Member

@hyunkshinft no worries. As I said I think we need to have reads running in a background thread.

@hyunkshinft
Copy link
Contributor Author

@davecramer is there an easy way to setup a regression test environment locally so that I can test my fix?

@davecramer
Copy link
Member

davecramer commented Jan 14, 2020 via email

@hyunkshinft
Copy link
Contributor Author

@davecramer I have added additional fix that addressed the problem with non-blocking IO. Hopefully this fixes the test failure you saw. The idea of this fix is that the socket timeout applies only at the message boundary. Which means socket timeout is taken into account until you saw the first character of message in processNotifies and processCopyResults. Once you have the first character of a new message, any socket timeout thereafter will be ignored until the full message is received in order to preserve the message boundary. The fix has been implemented on top of 42.2.6. Please take a look when you have a chance.

https://github.com/hyunkshinft/streamProblems#proposed-fix

@davecramer
Copy link
Member

@hyunkshinft ok, better but now failing setNetworkTImeout Test see #1679

@hyunkshinft
Copy link
Contributor Author

@davecramer thanks for checking this out. I am going to address the test failure here. #1681

@hyunkshinft
Copy link
Contributor Author

@davecramer I have fixed all the unit test failures in #1681. Would you please review it from that PR or review and incorporate the latest changes into yours for inclusion in the next release?

davecramer added a commit that referenced this issue Jan 30, 2020
…er thread closes the connection (#1594)

* fix: Fixes issue #1592 where one thread is reading the copy and another thread closes the connection
paplorinc pushed a commit to paplorinc/pgjdbc that referenced this issue Feb 10, 2020
* origin/master: (427 commits)
  refactor: make PSQLState enum consts for integrity constraint violations (pgjdbc#1699)
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release REL42.2.10
  pass gpg key through arguments
  add passphrase to release mvn task
  chore: update signing key
  Metadata queries improvment (pgjdbc#1694)
  WIP release notes for 42.2.10 (pgjdbc#1688)
  chore(deps): bump checkstyle from 8.28 to 8.29 in /pgjdbc (pgjdbc#1691)
  Cleanup PGProperty, sort values, and add some missing to docs (pgjdbc#1686)
  fix: Fixes issue pgjdbc#1592 where one thread is reading the copy and another thread closes the connection (pgjdbc#1594)
  Fixing LocalTime rounding (losing precision) (pgjdbc#1570)
  sync error message value with tested value (pgjdbc#1664)
  add DatabaseMetaDataCacheTest to test suite to run it (pgjdbc#1685)
  Fix Network Performance of PgDatabaseMetaData.getTypeInfo() method (pgjdbc#1668)
  fix: Issue pgjdbc#1680 updating a boolean field requires special handling to set it to t or f instead of true or false (pgjdbc#1682)
  fix testSetNetworkTimeoutEnforcement test failure (pgjdbc#1681)
  minor: fix checkstyle violation of unused import (pgjdbc#1683)
  fix: pgjdbc#1677 NumberFormatException when fetching PGInterval with small value (pgjdbc#1678)
  fix: actually use milliseconds instead of microseconds for timeouts (pgjdbc#1653)
  ...
@ejboy
Copy link

ejboy commented May 31, 2022

@hyunkshinft @davecramer I wonder if the change in https://github.com/pgjdbc/pgjdbc/pull/1681/files#diff-1c9bbe3aa0c1f2d0d7a6ecef10d981c2be064a526e8f91e2b47b04823fd50dedR142 is intentional. I.e. immediate returning false instead of setting available to false?

The problem with that change is that nextStreamAvailableCheckTime field is not updated, which can lead to repeated calls to hasMessagePending blocking for 1 millisecond.

@davecramer
Copy link
Member

davecramer commented Oct 11, 2022 via email

@ejboy
Copy link

ejboy commented Oct 13, 2022

Thank you for replying.

Have you tested your solution? Can you provide a PR with the change so we
can review it ?

Yes, we've run into this issue due to some odd usage of the driver on our side, where it manifested in essentially 1 millisecond being wasted and that was unexpected for us. The fix has been in place and was tested for a couple of months.

I will prepare a PR soon. (May take week+, cause I need to run it with legal etc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants