TLSv1.3 can fail with HTTP/2 and Session Tickets Enabled #10041

carl-mastrangelo · 2020-02-18T04:46:29Z

Netty version

4.1.45 + TCNative 2.0.28

JVM version (e.g. `java -version`)

Java 1.8

OS version (e.g. `uname -a`)

Linux / Mac

Repro

The steps to reproduce this are fairly difficult, and I don't know enough of the OpenSSL API, but I can give the manual steps I took to get here. Netflix is trying to enable TLSv1.3 on some of its servers, but it results in some corrupted SSL connections. This appears most commonly during connection startup, but I think it can happen at any point.

Typical errors look like SSL_ERROR_RX_RECORD_TOO_LONG in Firefox, or ERR_SSL_PROTOCOL_ERROR in Chrome, but this is in fact due to data corruption in Netty. The core issue is a bad interaction in the two overloads ofSslHandler.wrap:

netty/handler/src/main/java/io/netty/handler/ssl/SslHandler.java

Lines 1003 to 1058 in 136db86

    
           private SSLEngineResult wrap(ByteBufAllocator alloc, SSLEngine engine, ByteBuf in, ByteBuf out) 
        
                   throws SSLException { 
        
               ByteBuf newDirectIn = null; 
        
               try { 
        
                   int readerIndex = in.readerIndex(); 
        
                   int readableBytes = in.readableBytes(); 
        
                   // We will call SslEngine.wrap(ByteBuffer[], ByteBuffer) to allow efficient handling of 
        
                   // CompositeByteBuf without force an extra memory copy when CompositeByteBuffer.nioBuffer() is called. 
        
                   final ByteBuffer[] in0; 
        
                   if (in.isDirect() || !engineType.wantsDirectBuffer) { 
        
                       // As CompositeByteBuf.nioBufferCount() can be expensive (as it needs to check all composed ByteBuf 
        
                       // to calculate the count) we will just assume a CompositeByteBuf contains more then 1 ByteBuf. 
        
                       // The worst that can happen is that we allocate an extra ByteBuffer[] in CompositeByteBuf.nioBuffers() 
        
                       // which is better then walking the composed ByteBuf in most cases. 
        
                       if (!(in instanceof CompositeByteBuf) && in.nioBufferCount() == 1) { 
        
                           in0 = singleBuffer; 
        
                           // We know its only backed by 1 ByteBuffer so use internalNioBuffer to keep object allocation 
        
                           // to a minimum. 
        
                           in0[0] = in.internalNioBuffer(readerIndex, readableBytes); 
        
                       } else { 
        
                           in0 = in.nioBuffers(); 
        
                       } 
        
                   } else { 
        
                       // We could even go further here and check if its a CompositeByteBuf and if so try to decompose it and 
        
                       // only replace the ByteBuffer that are not direct. At the moment we just will replace the whole 
        
                       // CompositeByteBuf to keep the complexity to a minimum 
        
                       newDirectIn = alloc.directBuffer(readableBytes); 
        
                       newDirectIn.writeBytes(in, readerIndex, readableBytes); 
        
                       in0 = singleBuffer; 
        
                       in0[0] = newDirectIn.internalNioBuffer(newDirectIn.readerIndex(), readableBytes); 
        
                   } 
        
                   for (;;) { 
        
                       ByteBuffer out0 = out.nioBuffer(out.writerIndex(), out.writableBytes()); 
        
                       SSLEngineResult result = engine.wrap(in0, out0); 
        
                       in.skipBytes(result.bytesConsumed()); 
        
                       out.writerIndex(out.writerIndex() + result.bytesProduced()); 
        
                       switch (result.getStatus()) { 
        
                       case BUFFER_OVERFLOW: 
        
                           out.ensureWritable(engine.getSession().getPacketBufferSize()); 
        
                           break; 
        
                       default: 
        
                           return result; 
        
                       } 
        
                   } 
        
               } finally { 
        
                   // Null out to allow GC of ByteBuffer 
        
                   singleBuffer[0] = null; 
        
                   if (newDirectIn != null) { 
        
                       newDirectIn.release(); 
        
                   } 
        
               } 
        
           }

At the bottom, on line 1043, If the engine.wrap call results in a BUFFER_OVERFLOW, it resize the buffer to try again. The bug here is that when this happens, the original data that was attempted to be written gets lost. In my case, the first 68 or so bytes discarded, leaving a partial response to be written out. This is the first half of the bug.

The second part is the setup to this bug, and is the other wrap overload:

https://github.com/netty/netty/blob/netty-4.1.45.Final/handler/src/main/java/io/netty/handler/ssl/SslHandler.java#L801-L882

On line 821, the call to allocateOutNetBuf attempts to create a buffer large enough to hold the write. In my case, the readable data is 46 bytes. This results in a call to ReferenceCountedOpenSslEngine.calculateMaxLengthForWrap(), which adds 22 bytes of extra headroom, resulting in a 68 byte buffer. This is normally the correct amount if session tickets are not enabled. When they are, several hundred additional bytes are needed. These session ticket bytes are included by the OpenSSL library, and don't appear to be account for by Netty.

This series of events is most common when turning on TLSv1.3, HTTP/2, and Session Tickets. In my experimentation, the SSL handshake actually succeeds, but then crashes shortly after. The sequence of events looks like:

Call openssl s_client -tls1_3 -connect 127.0.0.1:7006 -alpn h2 -debug -msg
SSL Handshake proceeds successfully, resulting in the client logging a successful handshake[1].
On the server side, SslHandler.setHandshakeSuccess is invoked, and fires the event up the pipeline.
ApplicationProtocolNegotiationHandler captures the handshake event, sees h2 has been picked, and installs Http2FrameCodec, and invokes handlerAdded.
This causes the frame window to be written and flushed, along with the initial settings.
It calls all the way back down into the wrap() function as mentioned above, trying to wrap the 46 bytes of application data.
In a normal, non session ticket case, this sends the initial 46 bytes, which looks something like [2]. Note the initial 17 03 03, which correctly indicates this is application data.
In the failure case, Session Tickets get added in by the SSL library, the response is too big, causing engine.wrap to return -1. When attempted again, the session tickets are written out, but they are missing the 5 byte header that identifies them.
OpenSSL ends up writing bogus 450 byte packet, which the client reads, misinterprets as a bad length, and closes the connection.

I was able to force this to succeed by manually growing the out buffer in the wrap call to a very large size. This allows the initial engine.wrap to succeed, send the session ticket through, following by the application data. When this happens, openssl prints out Post-Handshake New Session Ticket arrived.

@normanmaurer I'm really not sure how to fix this, I have packet captures of most of this, with the failure case, the success case, and the non-session ticket case. I don't know enough of the OpenSSL API to make a call on what should happen.

[1]:

SSL handshake has read 3475 bytes and written 304 bytes
Verification: OK
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
ALPN protocol: h2
Early data was not sent
Verify return code: 0 (ok)
---

[2]

read from 0x7fed4e50e0b0 [0x7fed4f00c803] (5 bytes => 5 (0x5))
0000 - 17 03 03 00 3f                                    ....?
<<< ??? [length 0005]
    17 03 03 00 3f
read from 0x7fed4e50e0b0 [0x7fed4f00c808] (63 bytes => 63 (0x3F))
0000 - 3c 0f 39 fe 64 c2 26 c5-8d 97 29 f0 ea ba ee aa   <.9.d.&...).....
0010 - 0f 58 d1 1b 3d 67 3a 36-3d 52 8f f6 ec ae a6 2f   .X..=g:6=R...../
0020 - 04 78 a4 16 97 40 40 2f-f2 58 ec a2 eb bd d3 24   .x...@@/.X.....$
0030 - 60 33 88 8c 8d 77 f2 6f-5e 9f ec 29 2d e3 2f      `3...w.o^..)-./
<<< TLS 1.3 [length 0001]
    17

The text was updated successfully, but these errors were encountered:

normanmaurer · 2020-02-18T14:17:30Z

@carl-mastrangelo thanks a lot of the detailed description... I will look into it.

normanmaurer · 2020-02-18T15:55:26Z

@carl-mastrangelo is this netty-tcnative-boringssl-static ?

carl-mastrangelo · 2020-02-18T16:07:51Z

@normanmaurer yes. I haven't tried with the dynamic artifacts.

normanmaurer · 2020-02-18T16:16:02Z

@carl-mastrangelo ok cool... Thanks for verifying...

@davidben we use SSL_max_seal_overhead to calculate the maximal overhead for encrypting. Does this not include what is needed for the session tickets here ? Or do we need to add some extra call ?

davidben · 2020-02-19T03:17:15Z

Ah yeah, we should probably be incorporating any of the pending handshake bytes into that. I didn't get to looking at this today, but I'll poke at it tomorrow. (Mostly I need to check whether anything else uses SSL_max_seal_overhead and would want the current behavior.)

Note this will mean that SSL_max_seal_overhead will change over the course of the connection, depending on what random incidental messages are queued up (NewSessionTicket at the start and then the occasional KeyUpdate, if anyone ever sent them). Would that be fine for you?

davidben · 2020-02-19T03:19:00Z

(This probably changed for you because we used to send NewSessionTicket during the server handshake, but now we defer it to the first SSL_write. Depending on the sizes of tickets, sizes of transport buffers, and exact I/O patterns, sending it eagerly could result in deadlocks or spurious TCP resets and we wanted to surprising behaviors or random cliffs like that.)

normanmaurer · 2020-02-19T05:41:53Z

@davidben yes this would be fine :) please ping me once you have a fix and I will see how we can adopt it. Unfortunately we tracking the chromium-stable branch so it may take some time before we can use it :(

carl-mastrangelo · 2020-02-19T18:17:35Z

@davidben It looks like SSL_write is designed to return -1 when the buffer is not big enough. I think there is another bug here, because a subsequent write seems to lose bytes.

Put another way, It seems like the code is design to handle this case, but isn't.

davidben · 2020-02-19T19:03:53Z

Netty does a number of odd things with assumptions about write overheads in order to implement the SSLEngine API, which is probably where the dropping of data comes from.

(Really the API Netty wants isn't SSL_write or BIOs in the first place but alas we've yet to have the time to build the BIO-less API.)

davidben · 2020-02-19T19:04:49Z

(We quite extensibly test things assuming a single-byte write buffer so it would be quite surprising if data was getting dropped in BoringSSL.)

carl-mastrangelo · 2020-02-19T23:33:10Z

If that's the case, this issue is 2 bugs:

Netty is calling SSL methods out of order, resulting in some dropped bytes.
BoringSSL needs to return the ticket-adjusted size for SSL_max_seal_overhead.

Fixing the BoringSSL issue will mask the first one.

carl-mastrangelo · 2020-02-21T19:15:33Z

cc: @kyagna

hyperxpro · 2020-02-23T08:23:50Z

I know this my question is out the frame but I couldn't find any answer for it.

How to enable Session Tickets?

normanmaurer · 2020-02-24T09:26:33Z

@hyperxpro https://github.com/netty/netty/blob/netty-4.1.45.Final/handler/src/main/java/io/netty/handler/ssl/OpenSslSessionContext.java#L98

normanmaurer · 2020-02-24T14:14:33Z

@davidben is there a "maximum" number of bytes per SessionTicket that we can reserve here ?

Lukasa · 2020-02-24T14:33:48Z

I think reserving a maximum number of bytes here is unwise. The problem that has to be addressed is that Netty is assuming a behaviour from BoringSSL that doesn't match what it actually does.

Netty seems to be assuming that SSL_write will write only the TLS record encapsulating the application data, and nothing else. Based on what @davidben has said above, that's not true: it will also potentially emit other queued records. While in principle Netty could enumerate what those records are and produce a worst-case bound, it's not the right design pattern: that behaviour can change without warning, and at any time.

I think Netty has to work to break the assumption it's making here: it's not reflective of what BoringSSL actually does.

normanmaurer · 2020-02-24T18:22:40Z

@Lukasa @davidben I think the "real problem" is that for whatever reason we seem to "loose" data. Netty should just try to expand the buffer if it not fits in and try again. But something is not working as expected here in this case :/

normanmaurer · 2020-02-25T20:06:01Z

@carl-mastrangelo ok good news is that I can reproduce it... Now the fun begins. Will keep you posted .

normanmaurer · 2020-02-26T13:40:54Z

@davidben @Lukasa @carl-mastrangelo ok update here... it is a netty bug and I have a fix ready, just working on a unit test now. So no surprise here, BoringSSL works as expected :)

…renceCountedOpenSslEngine.wrap(...) Motivation: We did not correctly account for produced bytes when SSL_write(...) returns -1 in all cases. This could lead to lost data and so a corrupt SSL connection. Modifications: Always ensure we calculate the produced bytes correctly Result: Fixes #10041

normanmaurer · 2020-02-26T15:06:07Z

Fixed by #10063 ...

#10063) Motivation: We did not correctly account for produced bytes when SSL_write(...) returns -1 in all cases. This could lead to lost data and so a corrupt SSL connection. Modifications: - Always ensure we calculate the produced bytes correctly - Add unit tests Result: Fixes #10041

normanmaurer · 2020-02-27T07:52:02Z

@davidben we fixed a bug in netty that solves the problem for us but I would still be interested to hear if you have any plans to adjust SSL_max_seal_overhead to include the tickets etc.

#10063) Motivation: We did not correctly account for produced bytes when SSL_write(...) returns -1 in all cases. This could lead to lost data and so a corrupt SSL connection. Modifications: - Always ensure we calculate the produced bytes correctly - Add unit tests Result: Fixes #10041

carl-mastrangelo · 2020-02-27T16:15:05Z

@normanmaurer thanks!

netty#10063) Motivation: We did not correctly account for produced bytes when SSL_write(...) returns -1 in all cases. This could lead to lost data and so a corrupt SSL connection. Modifications: - Always ensure we calculate the produced bytes correctly - Add unit tests Result: Fixes netty#10041

normanmaurer added this to the 4.1.46.Final milestone Feb 18, 2020

normanmaurer self-assigned this Feb 26, 2020

normanmaurer added the defect label Feb 26, 2020

normanmaurer mentioned this issue Feb 26, 2020

Correctly calculate the produced bytes in all cases when calling Refe… #10063

Merged

normanmaurer closed this as completed in #10063 Feb 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TLSv1.3 can fail with HTTP/2 and Session Tickets Enabled #10041

TLSv1.3 can fail with HTTP/2 and Session Tickets Enabled #10041

carl-mastrangelo commented Feb 18, 2020 •

edited

normanmaurer commented Feb 18, 2020

normanmaurer commented Feb 18, 2020

carl-mastrangelo commented Feb 18, 2020

normanmaurer commented Feb 18, 2020

davidben commented Feb 19, 2020

davidben commented Feb 19, 2020 •

edited

normanmaurer commented Feb 19, 2020

carl-mastrangelo commented Feb 19, 2020

davidben commented Feb 19, 2020

davidben commented Feb 19, 2020

carl-mastrangelo commented Feb 19, 2020

carl-mastrangelo commented Feb 21, 2020 •

edited

hyperxpro commented Feb 23, 2020 •

edited

normanmaurer commented Feb 24, 2020

normanmaurer commented Feb 24, 2020

Lukasa commented Feb 24, 2020

normanmaurer commented Feb 24, 2020

normanmaurer commented Feb 25, 2020

normanmaurer commented Feb 26, 2020

normanmaurer commented Feb 26, 2020

normanmaurer commented Feb 27, 2020

carl-mastrangelo commented Feb 27, 2020

TLSv1.3 can fail with HTTP/2 and Session Tickets Enabled #10041

TLSv1.3 can fail with HTTP/2 and Session Tickets Enabled #10041

Comments

carl-mastrangelo commented Feb 18, 2020 • edited

Netty version

JVM version (e.g. java -version)

OS version (e.g. uname -a)

Repro

normanmaurer commented Feb 18, 2020

normanmaurer commented Feb 18, 2020

carl-mastrangelo commented Feb 18, 2020

normanmaurer commented Feb 18, 2020

davidben commented Feb 19, 2020

davidben commented Feb 19, 2020 • edited

normanmaurer commented Feb 19, 2020

carl-mastrangelo commented Feb 19, 2020

davidben commented Feb 19, 2020

davidben commented Feb 19, 2020

carl-mastrangelo commented Feb 19, 2020

carl-mastrangelo commented Feb 21, 2020 • edited

hyperxpro commented Feb 23, 2020 • edited

normanmaurer commented Feb 24, 2020

normanmaurer commented Feb 24, 2020

Lukasa commented Feb 24, 2020

normanmaurer commented Feb 24, 2020

normanmaurer commented Feb 25, 2020

normanmaurer commented Feb 26, 2020

normanmaurer commented Feb 26, 2020

normanmaurer commented Feb 27, 2020

carl-mastrangelo commented Feb 27, 2020

carl-mastrangelo commented Feb 18, 2020 •

edited

JVM version (e.g. `java -version`)

OS version (e.g. `uname -a`)

davidben commented Feb 19, 2020 •

edited

carl-mastrangelo commented Feb 21, 2020 •

edited

hyperxpro commented Feb 23, 2020 •

edited