Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLSv1.3 can fail with HTTP/2 and Session Tickets Enabled #10041

Closed
carl-mastrangelo opened this issue Feb 18, 2020 · 22 comments · Fixed by #10063
Closed

TLSv1.3 can fail with HTTP/2 and Session Tickets Enabled #10041

carl-mastrangelo opened this issue Feb 18, 2020 · 22 comments · Fixed by #10063
Assignees
Labels
Milestone

Comments

@carl-mastrangelo
Copy link
Member

carl-mastrangelo commented Feb 18, 2020

Netty version

4.1.45 + TCNative 2.0.28

JVM version (e.g. java -version)

Java 1.8

OS version (e.g. uname -a)

Linux / Mac

Repro

The steps to reproduce this are fairly difficult, and I don't know enough of the OpenSSL API, but I can give the manual steps I took to get here. Netflix is trying to enable TLSv1.3 on some of its servers, but it results in some corrupted SSL connections. This appears most commonly during connection startup, but I think it can happen at any point.

Typical errors look like SSL_ERROR_RX_RECORD_TOO_LONG in Firefox, or ERR_SSL_PROTOCOL_ERROR in Chrome, but this is in fact due to data corruption in Netty. The core issue is a bad interaction in the two overloads ofSslHandler.wrap:

private SSLEngineResult wrap(ByteBufAllocator alloc, SSLEngine engine, ByteBuf in, ByteBuf out)
throws SSLException {
ByteBuf newDirectIn = null;
try {
int readerIndex = in.readerIndex();
int readableBytes = in.readableBytes();
// We will call SslEngine.wrap(ByteBuffer[], ByteBuffer) to allow efficient handling of
// CompositeByteBuf without force an extra memory copy when CompositeByteBuffer.nioBuffer() is called.
final ByteBuffer[] in0;
if (in.isDirect() || !engineType.wantsDirectBuffer) {
// As CompositeByteBuf.nioBufferCount() can be expensive (as it needs to check all composed ByteBuf
// to calculate the count) we will just assume a CompositeByteBuf contains more then 1 ByteBuf.
// The worst that can happen is that we allocate an extra ByteBuffer[] in CompositeByteBuf.nioBuffers()
// which is better then walking the composed ByteBuf in most cases.
if (!(in instanceof CompositeByteBuf) && in.nioBufferCount() == 1) {
in0 = singleBuffer;
// We know its only backed by 1 ByteBuffer so use internalNioBuffer to keep object allocation
// to a minimum.
in0[0] = in.internalNioBuffer(readerIndex, readableBytes);
} else {
in0 = in.nioBuffers();
}
} else {
// We could even go further here and check if its a CompositeByteBuf and if so try to decompose it and
// only replace the ByteBuffer that are not direct. At the moment we just will replace the whole
// CompositeByteBuf to keep the complexity to a minimum
newDirectIn = alloc.directBuffer(readableBytes);
newDirectIn.writeBytes(in, readerIndex, readableBytes);
in0 = singleBuffer;
in0[0] = newDirectIn.internalNioBuffer(newDirectIn.readerIndex(), readableBytes);
}
for (;;) {
ByteBuffer out0 = out.nioBuffer(out.writerIndex(), out.writableBytes());
SSLEngineResult result = engine.wrap(in0, out0);
in.skipBytes(result.bytesConsumed());
out.writerIndex(out.writerIndex() + result.bytesProduced());
switch (result.getStatus()) {
case BUFFER_OVERFLOW:
out.ensureWritable(engine.getSession().getPacketBufferSize());
break;
default:
return result;
}
}
} finally {
// Null out to allow GC of ByteBuffer
singleBuffer[0] = null;
if (newDirectIn != null) {
newDirectIn.release();
}
}
}

At the bottom, on line 1043, If the engine.wrap call results in a BUFFER_OVERFLOW, it resize the buffer to try again. The bug here is that when this happens, the original data that was attempted to be written gets lost. In my case, the first 68 or so bytes discarded, leaving a partial response to be written out. This is the first half of the bug.

The second part is the setup to this bug, and is the other wrap overload:

https://github.com/netty/netty/blob/netty-4.1.45.Final/handler/src/main/java/io/netty/handler/ssl/SslHandler.java#L801-L882

On line 821, the call to allocateOutNetBuf attempts to create a buffer large enough to hold the write. In my case, the readable data is 46 bytes. This results in a call to ReferenceCountedOpenSslEngine.calculateMaxLengthForWrap(), which adds 22 bytes of extra headroom, resulting in a 68 byte buffer. This is normally the correct amount if session tickets are not enabled. When they are, several hundred additional bytes are needed. These session ticket bytes are included by the OpenSSL library, and don't appear to be account for by Netty.

This series of events is most common when turning on TLSv1.3, HTTP/2, and Session Tickets. In my experimentation, the SSL handshake actually succeeds, but then crashes shortly after. The sequence of events looks like:

  1. Call openssl s_client -tls1_3 -connect 127.0.0.1:7006 -alpn h2 -debug -msg
  2. SSL Handshake proceeds successfully, resulting in the client logging a successful handshake[1].
  3. On the server side, SslHandler.setHandshakeSuccess is invoked, and fires the event up the pipeline.
  4. ApplicationProtocolNegotiationHandler captures the handshake event, sees h2 has been picked, and installs Http2FrameCodec, and invokes handlerAdded.
  5. This causes the frame window to be written and flushed, along with the initial settings.
  6. It calls all the way back down into the wrap() function as mentioned above, trying to wrap the 46 bytes of application data.
  7. In a normal, non session ticket case, this sends the initial 46 bytes, which looks something like [2]. Note the initial 17 03 03, which correctly indicates this is application data.
  8. In the failure case, Session Tickets get added in by the SSL library, the response is too big, causing engine.wrap to return -1. When attempted again, the session tickets are written out, but they are missing the 5 byte header that identifies them.
  9. OpenSSL ends up writing bogus 450 byte packet, which the client reads, misinterprets as a bad length, and closes the connection.

I was able to force this to succeed by manually growing the out buffer in the wrap call to a very large size. This allows the initial engine.wrap to succeed, send the session ticket through, following by the application data. When this happens, openssl prints out Post-Handshake New Session Ticket arrived.

@normanmaurer I'm really not sure how to fix this, I have packet captures of most of this, with the failure case, the success case, and the non-session ticket case. I don't know enough of the OpenSSL API to make a call on what should happen.

[1]:

SSL handshake has read 3475 bytes and written 304 bytes
Verification: OK
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
ALPN protocol: h2
Early data was not sent
Verify return code: 0 (ok)
---

[2]

read from 0x7fed4e50e0b0 [0x7fed4f00c803] (5 bytes => 5 (0x5))
0000 - 17 03 03 00 3f                                    ....?
<<< ??? [length 0005]
    17 03 03 00 3f
read from 0x7fed4e50e0b0 [0x7fed4f00c808] (63 bytes => 63 (0x3F))
0000 - 3c 0f 39 fe 64 c2 26 c5-8d 97 29 f0 ea ba ee aa   <.9.d.&...).....
0010 - 0f 58 d1 1b 3d 67 3a 36-3d 52 8f f6 ec ae a6 2f   .X..=g:6=R...../
0020 - 04 78 a4 16 97 40 40 2f-f2 58 ec a2 eb bd d3 24   .x...@@/.X.....$
0030 - 60 33 88 8c 8d 77 f2 6f-5e 9f ec 29 2d e3 2f      `3...w.o^..)-./
<<< TLS 1.3 [length 0001]
    17
@normanmaurer
Copy link
Member

@carl-mastrangelo thanks a lot of the detailed description... I will look into it.

@normanmaurer normanmaurer added this to the 4.1.46.Final milestone Feb 18, 2020
@normanmaurer
Copy link
Member

@carl-mastrangelo is this netty-tcnative-boringssl-static ?

@carl-mastrangelo
Copy link
Member Author

@normanmaurer yes. I haven't tried with the dynamic artifacts.

@normanmaurer
Copy link
Member

@carl-mastrangelo ok cool... Thanks for verifying...

@davidben we use SSL_max_seal_overhead to calculate the maximal overhead for encrypting. Does this not include what is needed for the session tickets here ? Or do we need to add some extra call ?

@davidben
Copy link

Ah yeah, we should probably be incorporating any of the pending handshake bytes into that. I didn't get to looking at this today, but I'll poke at it tomorrow. (Mostly I need to check whether anything else uses SSL_max_seal_overhead and would want the current behavior.)

Note this will mean that SSL_max_seal_overhead will change over the course of the connection, depending on what random incidental messages are queued up (NewSessionTicket at the start and then the occasional KeyUpdate, if anyone ever sent them). Would that be fine for you?

@davidben
Copy link

davidben commented Feb 19, 2020

(This probably changed for you because we used to send NewSessionTicket during the server handshake, but now we defer it to the first SSL_write. Depending on the sizes of tickets, sizes of transport buffers, and exact I/O patterns, sending it eagerly could result in deadlocks or spurious TCP resets and we wanted to surprising behaviors or random cliffs like that.)

@normanmaurer
Copy link
Member

@davidben yes this would be fine :) please ping me once you have a fix and I will see how we can adopt it. Unfortunately we tracking the chromium-stable branch so it may take some time before we can use it :(

@carl-mastrangelo
Copy link
Member Author

@davidben It looks like SSL_write is designed to return -1 when the buffer is not big enough. I think there is another bug here, because a subsequent write seems to lose bytes.

Put another way, It seems like the code is design to handle this case, but isn't.

@davidben
Copy link

Netty does a number of odd things with assumptions about write overheads in order to implement the SSLEngine API, which is probably where the dropping of data comes from.

(Really the API Netty wants isn't SSL_write or BIOs in the first place but alas we've yet to have the time to build the BIO-less API.)

@davidben
Copy link

(We quite extensibly test things assuming a single-byte write buffer so it would be quite surprising if data was getting dropped in BoringSSL.)

@carl-mastrangelo
Copy link
Member Author

If that's the case, this issue is 2 bugs:

  1. Netty is calling SSL methods out of order, resulting in some dropped bytes.
  2. BoringSSL needs to return the ticket-adjusted size for SSL_max_seal_overhead.

Fixing the BoringSSL issue will mask the first one.

@carl-mastrangelo
Copy link
Member Author

carl-mastrangelo commented Feb 21, 2020

cc: @kyagna

@hyperxpro
Copy link
Contributor

hyperxpro commented Feb 23, 2020

I know this my question is out the frame but I couldn't find any answer for it.

How to enable Session Tickets?

@normanmaurer
Copy link
Member

@normanmaurer
Copy link
Member

@davidben is there a "maximum" number of bytes per SessionTicket that we can reserve here ?

@Lukasa
Copy link
Contributor

Lukasa commented Feb 24, 2020

I think reserving a maximum number of bytes here is unwise. The problem that has to be addressed is that Netty is assuming a behaviour from BoringSSL that doesn't match what it actually does.

Netty seems to be assuming that SSL_write will write only the TLS record encapsulating the application data, and nothing else. Based on what @davidben has said above, that's not true: it will also potentially emit other queued records. While in principle Netty could enumerate what those records are and produce a worst-case bound, it's not the right design pattern: that behaviour can change without warning, and at any time.

I think Netty has to work to break the assumption it's making here: it's not reflective of what BoringSSL actually does.

@normanmaurer
Copy link
Member

@Lukasa @davidben I think the "real problem" is that for whatever reason we seem to "loose" data. Netty should just try to expand the buffer if it not fits in and try again. But something is not working as expected here in this case :/

@normanmaurer
Copy link
Member

@carl-mastrangelo ok good news is that I can reproduce it... Now the fun begins. Will keep you posted .

@normanmaurer
Copy link
Member

@davidben @Lukasa @carl-mastrangelo ok update here... it is a netty bug and I have a fix ready, just working on a unit test now. So no surprise here, BoringSSL works as expected :)

@normanmaurer normanmaurer self-assigned this Feb 26, 2020
normanmaurer added a commit that referenced this issue Feb 26, 2020
…renceCountedOpenSslEngine.wrap(...)

Motivation:

We did not correctly account for produced bytes when SSL_write(...) returns -1 in all cases. This could lead to lost data and so a corrupt SSL connection.

Modifications:

Always ensure we calculate the produced bytes correctly

Result:

Fixes #10041
@normanmaurer
Copy link
Member

Fixed by #10063 ...

normanmaurer added a commit that referenced this issue Feb 27, 2020
#10063)


Motivation:

We did not correctly account for produced bytes when SSL_write(...) returns -1 in all cases. This could lead to lost data and so a corrupt SSL connection.

Modifications:

- Always ensure we calculate the produced bytes correctly
- Add unit tests

Result:

Fixes #10041
@normanmaurer
Copy link
Member

@davidben we fixed a bug in netty that solves the problem for us but I would still be interested to hear if you have any plans to adjust SSL_max_seal_overhead to include the tickets etc.

normanmaurer added a commit that referenced this issue Feb 27, 2020
#10063)

Motivation:

We did not correctly account for produced bytes when SSL_write(...) returns -1 in all cases. This could lead to lost data and so a corrupt SSL connection.

Modifications:

- Always ensure we calculate the produced bytes correctly
- Add unit tests

Result:

Fixes #10041
@carl-mastrangelo
Copy link
Member Author

@normanmaurer thanks!

ihanyong pushed a commit to ihanyong/netty that referenced this issue Jul 31, 2020
netty#10063)

Motivation:

We did not correctly account for produced bytes when SSL_write(...) returns -1 in all cases. This could lead to lost data and so a corrupt SSL connection.

Modifications:

- Always ensure we calculate the produced bytes correctly
- Add unit tests

Result:

Fixes netty#10041
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants