Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

javax.net.ssl.SSLException: handshake timed out #907

Closed
gurudatta11 opened this issue Nov 21, 2019 · 22 comments
Closed

javax.net.ssl.SSLException: handshake timed out #907

gurudatta11 opened this issue Nov 21, 2019 · 22 comments
Labels
status/cannot-reproduce We cannot reproduce this issue

Comments

@gurudatta11
Copy link

gurudatta11 commented Nov 21, 2019

System Architecture

We as client can communicate to different devices (servers) with different ip addresses.
All the servers share a common root certificate to expose it as TLS, but unique key/ keystore per device.

Expected Behavior

No Handshake time out should occur.

When making a call to few devices concurrently (let say less than some threshold 4) error will not occur, but if we invoke calls concurrently more than this threshold , getting to see handshake timeout issues.

Debugged with option -Djavax.net.debug=ssl , still no luck on figuring out why the issue is happening.

There is no problem with the server, when we try like one on one with a server, we never encountered handshake timeout issue, but when tried on multiple servers concurrently few error out and few get success in handshake.

I think there is some concurrency issues going on in reactor netty, unable to figure out where.
Please share architecture diagram if there is any for reactory netty .

Any pointers would be helpful to resolve this issue.

Actual Behavior

Getting Below error:

2019-11-20 13:07:22,108 361283 [reactor-http-epoll-8] WARN  r.n.http.client.HttpClientConnect - [id: 0xab3f418a, L:/<ip1>:40554 - R:<ip2>/<ip2>:5000] The connection observed an
 error
javax.net.ssl.SSLException: handshake timed out
        at io.netty.handler.ssl.SslHandler$5.run(SslHandler.java:2011)
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:150)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:413)
        at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
reactor-http-epoll-8, called closeOutbound()
reactor-http-epoll-8, closeOutboundInternal()
reactor-http-epoll-8, SEND TLSv1.2 ALERT:  warning, description = close_notify
reactor-http-epoll-8, WRITE: TLSv1.2 Alert, length = 26
reactor-http-epoll-8, called closeInbound()
reactor-http-epoll-8, fatal error: 80: Inbound closed before receiving peer's close_notify: possible truncation attack?
javax.net.ssl.SSLException: Inbound closed before receiving peer's close_notify: possible truncation attack?
%% Invalidated:  [Session-1010, TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256]
reactor-http-epoll-8, SEND TLSv1.2 ALERT:  fatal, description = internal_error
reactor-http-epoll-8, Exception sending alert: java.io.IOException: writer side was already closed.
2019-11-20 13:07:22,110 361285 [org.springframework.kafka.KafkaListenerEndpointContainer#1-1-C-1] ERROR com.bmg.service.HttpService - javax.net.ssl.SSLException: handshake timed out, {}
reactor.core.Exceptions$ReactiveException: javax.net.ssl.SSLException: handshake timed out
        at reactor.core.Exceptions.propagate(Exceptions.java:326)
        at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:91)
        at reactor.core.publisher.Mono.block(Mono.java:1494)

client make a call to server
gets HandshakeTimeout after around 1.15 min to 1.40 min

Since default HandshakeTimeout is 10 secs, also tried setting below system variable property , but still handshake timeout occurs.
-Dreactor.netty.tcp.sshHandshakeTimeout=120000
HandshakeTimeout didn't occur after 2 minutes are per above config, it occurred around same range 1.15 min to 1.40 min,

Even though error says handshakeTimeout, it feels like this call is being internally queued and tried after certain time and then handshakeTimeout occurs.

Steps to Reproduce

JdkSsl context being used by reactor netty.
Getting HttpClient as below:
Note: getting a newConnection (instead of HttpClient.create() ) else there is weird concurrency problem going on, instead of hitting one server it's hitting different server and also used to get reactor.netty.http.client.PrematureCloseException (Reference: https://projectreactor.io/docs/netty/release/reference/index.html#_connect) hence using newConnection.

public HttpClient getHttpClient(SslContext sslContext, int connectTimeOutInMilliSeconds,
    int readTimeOutInMilliSeconds) {

  HttpClient httpClient = HttpClient.newConnection().tcpConfiguration(tcpClient ->
      tcpClient
          .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, connectTimeOutInMilliSeconds)
          .doOnConnected(connection -> connection
              .addHandlerLast(new ReadTimeoutHandler(readTimeOutInMilliSeconds,
                  TimeUnit.MILLISECONDS))
              .addHandlerLast((new WriteTimeoutHandler(readTimeOutInMilliSeconds,
                  TimeUnit.MILLISECONDS)))));
  if (sslContext != null) {
    httpClient = httpClient.secure(sslContextSpec -> sslContextSpec.sslContext(sslContext));
  }
  return httpClient;
}

Getting sslContext as below:

  private SslContext getTrustAllSslWebClient() {
    try {
      return SslContextBuilder
          .forClient()
          .trustManager(InsecureTrustManagerFactory.INSTANCE)
          .build();
    } catch (SSLException e) {
//ignore
    }
  }

Minimal yet complete reproducer code (or URL to code)

This is difficult to reproduce without complete production setup.

Possible Solution

Your Environment

  • Reactor version(s) used:
    reactor-netty :0.8.13:RELEASE

  • Other relevant libraries versions (eg. netty, ...):
    netty -> 4.1.43.FINAL
    derived based on spring boot parent version given below

 <parent>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-parent</artifactId>
    <version>2.1.10.RELEASE</version>
    <relativePath/> <!-- lookup parent from repository -->
  </parent>

JVM version (e.g. java -version)

openjdk version "1.8.0_222"
Also tried on java11.

OS version (e.g. uname -a)

4.15.0-66-generic #75-Ubuntu

@gurudatta11 gurudatta11 added status/need-triage A new issue that still need to be evaluated as a whole type/bug A general bug labels Nov 21, 2019
@gurudatta11
Copy link
Author

is it related to worker count ? some how the error as handshake time out is misleading.
-Dreactor.netty.ioWorkerCount=16

Created issue at netty project, got deferred to here, fyi, netty/netty#9792

@violetagg
Copy link
Member

@gurudatta11 Is it possible that you open too many connections?

Look here #796 (comment)

Try to switch to a fixed connection pool and tell us whether you see the issue with this configuration.

@violetagg violetagg added for/user-attention This issue needs user attention (feedback, rework, etc...) and removed status/need-triage A new issue that still need to be evaluated as a whole labels Nov 21, 2019
@gurudatta11
Copy link
Author

SSLException with handshake time out is gone,
with either of the below fixes

  1. Setting worker count to a higher number
    reactor.netty.ioWorkerCount=128

  2. By making a fixed connection pool with max connections a higher number.

@gurudatta11
Copy link
Author

Weird thing is I don't think it's not handshake time out exception at all, it's connection pool or worker count problem. The way the error is propagated is wrong.
@violetagg

@violetagg
Copy link
Member

violetagg commented Nov 22, 2019

@gurudatta11 Did you check the link to the issue above? Note that with version 0.9 the connection pool is fixed by default, but with version 0.8 it is elastic by default.

@gurudatta11
Copy link
Author

@violetagg
sorry, I have not seen the link you provided.
After checking out, #796 (comment)
yes, that will fix the issue if the number of connections are less than the given fixed limit (500) or user defined.

But not sure, if that's the actual fix, if connections are being queued internally in the thread pool after the max limit, what's the expected behavior ??
If below two points are not valid, this issue can be closed.

  • Is that supposed to timeout if the connection pool is full, instead of being queued to get a connection from pool ?
  • Even if times out, why is error being thrown as different error (handshake time out).
    (may be log as warn what the actual error is , or output actual error)

Also logging as warn, if connection pool reached maximum limit might be helpful.

For my current setup:
I check the number of processors are 4,
so workers are being given as 4 , as per code, and default acquireTimeout is 45 seconds reactor netty
https://github.com/reactor/reactor-netty/blob/master/src/main/java/reactor/netty/resources/LoopResources.java#L47

https://github.com/reactor/reactor-netty/blob/master/src/main/java/reactor/netty/resources/ConnectionProvider.java#L47

Since in my prod code, I used new connection,
https://github.com/reactor/reactor-netty/blob/master/src/main/java/reactor/netty/resources/NewConnectionProvider.java#L49

Also, Is it possible to update existing documentation with high level architecture diagram for reactor-netty.

@violetagg
Copy link
Member

@violetagg
sorry, I have not seen the link you provided.
After checking out, #796 (comment)
yes, that will fix the issue if the number of connections are less than the given fixed limit (500) or user defined.

But not sure, if that's the actual fix, if connections are being queued internally in the thread pool after the max limit, what's the expected behavior ??

If you use the fixed ConnectionProvider you cannot go beyond the max connections, which is not the case when you use elastic. elastic is by default for 0.8.x, while fixed is by default for 0.9.x

If below two points are not valid, this issue can be closed.

  • Is that supposed to timeout if the connection pool is full, instead of being queued to get a connection from pool ?

If you use the fixed ConnectionProvider there is acquire timeout, which is not the case when you use elastic.

  • Even if times out, why is error being thrown as different error (handshake time out).
    (may be log as warn what the actual error is , or output actual error)

Once a connection is acquired/created, TLS handshake starts. If there are too many connections then it might be that 10s (the default TSL handshake timeout) is not enough.

Also logging as warn, if connection pool reached maximum limit might be helpful.

Please create an enhancement issue for this

For my current setup:
I check the number of processors are 4,
so workers are being given as 4 , as per code, and default acquireTimeout is 45 seconds reactor netty
https://github.com/reactor/reactor-netty/blob/master/src/main/java/reactor/netty/resources/LoopResources.java#L47

https://github.com/reactor/reactor-netty/blob/master/src/main/java/reactor/netty/resources/ConnectionProvider.java#L47

Since in my prod code, I used new connection,
https://github.com/reactor/reactor-netty/blob/master/src/main/java/reactor/netty/resources/NewConnectionProvider.java#L49

This provider does not use pooling. Which means we will create a connection every time and then this connection will be thrown after finishing. There is no acquire timeout.

Also, Is it possible to update existing documentation with high level architecture diagram for reactor-netty.

Please create an enhancement issue for this

@gurudatta11
Copy link
Author

Feature requests:
#916
and
#917
are created

@violetagg
As mentioned before, I don't see any issue with SSL timeout in general (it's not taking 10 secs for ssl handshake, after debugging the network and server).
in production without changing code, by setting environment variable, we observed zero issues with SSL handshake timeout.

Setting worker count to a higher number
reactor.netty.ioWorkerCount=128

But when the required connections are more than the configured max connection, it's reporting as ssl handshake timeout.

@violetagg violetagg added this to the 0.9.x Maintenance Backlog milestone Mar 20, 2020
@Nikhilkoneru
Copy link

Nikhilkoneru commented Jul 1, 2020

Even I had the same issue. Try this, it fixed my issue. The problem is not with connection timeout its with read timeout

@Bean
    public WebClient webClient(){
        HttpClient httpClient = HttpClient.create()
                .tcpConfiguration(tcpClient -> {
                    tcpClient = tcpClient.option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 60000);
                    tcpClient = tcpClient.doOnConnected(conn -> conn
                            .addHandlerLast(new ReadTimeoutHandler(60000, TimeUnit.MILLISECONDS)));
                    return tcpClient;
                });
        ClientHttpConnector connector = new ReactorClientHttpConnector(httpClient);

        return WebClient.builder().baseUrl(endpoint).clientConnector(connector).build();
    }

@uvarajk
Copy link

uvarajk commented Apr 29, 2021

@violetagg - We're receiving the SSL Handshake Timout Exception in our production environment.

Captured few details for the reference those are mentioned below :

  1. In the Debug Log below are the default connection pool configurations
    r.n.resources.PooledConnectionProvider : Creating a new [http] client pool [PoolFactory{evictionInterval=PT0S, leasingStrategy=fifo, maxConnections=500, maxIdleTime=-1, maxLifeTime=-1, metricsEnabled=false, pendingAcquireMaxCount=1000, pendingAcquireTimeout=45000}] for [/172.25.0.12:443]

  2. We are able to see that the application is working fine for some time(4 to 5 hours) and slowly the responses are received late
    Taken <<24442>> milliseconds to retrieve data from endpoint
    Taken <<32875>> milliseconds to retrieve data from endpoint
    Taken <<2532046>> milliseconds to retrieve data from endpoint

  3. During the delayed responses following are the errors/exceptions
    io.netty.handler.timeout.ReadTimeoutException: null
    io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms

  4. We have configured the following timeout's for the HTTPClient
    web-client.connect-timeout=5
    web-client.read-timeout=5
    web-client.use-connection-pooling=true
    reactor.netty.ioWorkerCount=128
    web-client.response-timeout=5

It would be a great help if we could know what's going wrong.

@violetagg
Copy link
Member

@uvarajk Please open a new issue, with Reactor Netty version, Java, OS, the scenario.

What component provides those properties?

web-client.connect-timeout=5
web-client.read-timeout=5
web-client.use-connection-pooling=true
reactor.netty.ioWorkerCount=128
web-client.response-timeout=5

You have response-timeout why do you specify read-timeout?

@uvarajk
Copy link

uvarajk commented Apr 29, 2021

@violetagg opened a new issue @ #1617 (comment)

@violetagg violetagg modified the milestones: 0.9.x Backlog, 1.0.x Backlog Jun 16, 2021
@santitigaga
Copy link

already has some solution for this issue?

@arun-a-nayagam
Copy link

Hi,
We are facing the same issue. Is the solution to set reactor.netty.ioWorkerCount to a higher value like 128?
Also, how to set this value (reactor.netty.ioWorkerCount=128)?
We are using Java SpringBoot Webflux code using webclient.

@violetagg
Copy link
Member

@santitigaga @arun-a-nayagam Please open a new issue with a reproducible example.

@violetagg
Copy link
Member

Closing this issue as there is no enough information in order to proceed with the investigation

@violetagg violetagg added status/cannot-reproduce We cannot reproduce this issue and removed type/bug A general bug for/user-attention This issue needs user attention (feedback, rework, etc...) labels Nov 1, 2021
@violetagg violetagg removed this from the 1.0.x Backlog milestone Nov 1, 2021
@DInTheName
Copy link

Why not fix the problem?

@forewei
Copy link

forewei commented Dec 4, 2022

I have the same code, different servers, some servers will have a handshake timeout, some will not

@DInTheName
Copy link

DInTheName commented Dec 4, 2022 via email

@DInTheName
Copy link

DInTheName commented Nov 28, 2023 via email

@dyrone
Copy link

dyrone commented Mar 21, 2024

upgrade jdk to 1.8.251

@DInTheName
Copy link

DInTheName commented Mar 26, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/cannot-reproduce We cannot reproduce this issue
Projects
None yet
Development

No branches or pull requests

9 participants