Better-behaved HBONE pooling #931

bleggett · 2024-04-16T17:56:24Z

This works around several issues found in #864 relating to clients (loadgen mostly) that open many connections very rapidly.

hyper's pooling library has rather basic scheduling controls, to the point where we can rather easily overwhelm "send_request" such that it never resolves. Tweaking server/client streamlimits does not work around this, as the pooling scheduler does not respect them when selecting pooled streams, and will simply allow a single conn to get backed up.
hyper's pooling library doesn't really have the concept of multiple HTTP/2 connections to the same dest, which makes sense as the HTTP/2 spec doesn't recommend this (the spec says SHOULD NOT instead of MUST NOT, which in spec-speak is "we can't stop you, I guess")
Racing to create connections creates problems in scenarios where a lot of connection attempts are made at once due to lack of throttling - many connections can actually be established or "cross the finish line at the same time", when we only need one, and this noise can compromise the ability of the local or dest ztunnel to respond to other workloads.

Notes:

By design, this pool does not manage Hyper client state, nor is it tightly coupled with it. All it does is hold references to Hyper clients that can be reused, and evict its own held references to those Hyper clients when they cannot be reused. The rest is managing async access.
By design, this pool is designed to create backpressure when one client opens many connections to the same destination at the same time. It does this by locking connection-mutative operations per src/dest key.
The only write lock should be on that inner src/dest key. The only operations inside the pool that must be ordered/force ordering are 1. Ensuring that we have a key writelock in the outer map. 2. Locking the inner key writelock. The rest should not demand any sort of ordering to be safe.
This pool does not decrement stream counts - when a connection hits its max stream count, the pool pops the reference and removes it from the rotation, leaving the connection to die when whatever is using it drops all the refs, and will create a new connection on the next go round. This leads to simpler accounting, while still maintaining the protections against connection storms that we need.

istio-testing · 2024-04-16T17:56:28Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

src/proxy/pool.rs

ilrudie

I might be missing something but it seems like correctly resolving hash collisions is critical for the pool's functionality and not yet implemented or tested.

src/proxy/pool.rs

bleggett · 2024-04-17T23:51:04Z

Added a form of conn timeout to evict connections from the pool that haven't been used in a while.

Once evicted, they will be closed by hyper when clients close, due to streamcount || refcount being == 0

bleggett · 2024-04-18T20:26:38Z

Switched to a proper concurrent hashmap that actually has a decent pedigree, which should allow for nonblocking reads (at the cost of using a little more memory, TANSTAAFL)

ilrudie

I'm not fully through the PR but this seemed like it was worth commenting on early perhaps so here goes:

src/config.rs

src/proxy/pool.rs

ilrudie · 2024-04-18T22:59:51Z

src/proxy/pool.rs

+    // and has no actual hyper/http/connection logic.
+    connected_pool: Arc<pingora_pool::ConnectionPool<ConnClient>>,
+    // this must be an atomic/concurrent-safe list-of-locks, so we can lock per-key, not globally, and avoid holding up all conn attempts
+    established_conn_writelock: HashMap<u64, Option<Arc<Mutex<()>>>>,


Thinking more about this it feels, really non-rusty. I suppose the issue pingora and there's not, at present, a very good way around this though so I don't have a suggestion.

What does? The use of locks?

Just locks which don't wrap the thing they're locking...

Ah, yep. It's not very cosmetic, agree.

But the kind of locking we have to do is not something the pools out there will do for us, and pingora_pool has some use, at least.

The important thing about a Mutex is what holds it, what's inside is sort of moot. The mutex here is almost a semaphore arguably, which is why it feels weird.

Its moot unless we want to fork or contribute to pingora (or I guess lock the entire pool, which we don't care to do) but the point of wrapping the mutex around the thing being accessed is more than cosmetic. It ties holding the lock and accessing the thing together in a way that we can't do with pingora. We'll need to be more careful since we can't do take advantage of that.

If we wanted to make this simpler later, I would advocate we throw out pingora and implement our own inner data structure from scratch, what we want to do is non-standard and pingora is not likely to want or need what we would have to change to make it work like this - nor is any other "generic" HTTP2 pool.

Which is why for now I just use it and then do the other stuff outside.

src/proxy/pool.rs

howardjohn

Ill review more closely tomorrow, only had a chance for a quick look. As such, many of my comments are probably not accurate

src/config.rs

src/proxy/pool.rs

howardjohn · 2024-04-18T23:36:24Z

src/proxy/pool.rs

+            // After waiting, we found an available conn, but for whatever reason couldn't use it.
+            // (streamcount maxed, etc)
+            // Start a new one, clone the underlying client, return a copy, and put the other back in the pool.
+            None => {


If we have multiple waiters, can they all end up here at once and spam a bunch of new connections?

No, because we only get a NONE here if we found a conn, checked it out, and it was at maxed streamcount.

(and we don't put it back in in that case, we leave it popped to die a natural death once streamcount + refcount == 0)

If we popped it, no one else can.

howardjohn · 2024-04-19T19:45:53Z

I am seeing unexpected behavior. Open up 99 connections at once, I got 9 connections. Only 1 of them closes after the idle tieout, the rest are zombies. Also if I then open up 99 more, I get the same - 8 more zombies, 16 total.

src/proxy/pool.rs

bleggett · 2024-04-22T15:42:06Z

I am seeing unexpected behavior. Open up 99 connections at once, I got 9 connections. Only 1 of them closes after the idle tieout, the rest are zombies. Also if I then open up 99 more, I get the same - 8 more zombies, 16 total.

I may have broken it with recent commits then, and the tests aren't sufficient/needs more tests - test_pool_100_clients_singleconn should/must/needs to catch this.

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

rig flakes Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

bleggett · 2024-04-26T02:27:27Z

/test test

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

bleggett · 2024-04-26T03:17:53Z

The hyper bug in http: fix connections that are never closed hyperium/hyper#3647 is worked around for now with server keepalives. This will do until we take a bumped hyper, and we need server keepalives anyway, as a safety valve to keep misbehaving clients (even if they are ours) from hanging connections.
I am not 100% happy with the test rig - it is crude, it is synthetic, but it does the job and makes this that much harder to break, and it's certainly covering more behavior than before.

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

istio-testing · 2024-04-26T17:44:56Z

In response to a cherrypick label: new pull request created: #983

bleggett · 2024-04-26T17:45:01Z

(ty vm to @howardjohn for tracking down that hyper bug, hyper code gives me the willies - that's HTTP/2's fault, not hyper's fault)

istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Apr 16, 2024

istio-testing added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 16, 2024

howardjohn reviewed Apr 16, 2024

View reviewed changes

ilrudie reviewed Apr 17, 2024

View reviewed changes

src/proxy/pool.rs Outdated Show resolved Hide resolved

src/proxy/pool.rs Outdated Show resolved Hide resolved

src/proxy/pool.rs Outdated Show resolved Hide resolved

src/proxy/pool.rs Outdated Show resolved Hide resolved

istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Apr 17, 2024

bleggett force-pushed the bleggett/fix-pooling branch 3 times, most recently from f2c21fb to 0a22c61 Compare April 17, 2024 23:47

bleggett force-pushed the bleggett/fix-pooling branch from 440f533 to 37a586f Compare April 18, 2024 00:14

istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Apr 18, 2024

bleggett marked this pull request as ready for review April 18, 2024 20:24

bleggett requested a review from a team as a code owner April 18, 2024 20:24

bleggett changed the title ~~WIP: Better-behaved HBONE pooling~~ Better-behaved HBONE pooling Apr 18, 2024

istio-testing removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Apr 18, 2024

bleggett force-pushed the bleggett/fix-pooling branch from f30bd94 to b9cf358 Compare April 18, 2024 20:50

bleggett requested review from howardjohn and ilrudie April 18, 2024 20:53

ilrudie reviewed Apr 18, 2024

View reviewed changes

howardjohn reviewed Apr 18, 2024

View reviewed changes

bleggett force-pushed the bleggett/fix-pooling branch from 8cb7925 to a8224bf Compare April 19, 2024 17:00

bleggett requested review from howardjohn and ilrudie April 19, 2024 17:09

howardjohn reviewed Apr 19, 2024

View reviewed changes

src/proxy/pool.rs Outdated Show resolved Hide resolved

ilrudie mentioned this pull request Apr 20, 2024

iterate on close connections which violate policy after updates #798

Open

3 tasks

bleggett force-pushed the bleggett/fix-pooling branch from a8224bf to ca913b6 Compare April 22, 2024 15:45

bleggett added 14 commits April 25, 2024 22:12

No, it's not

9ec19b2

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

Make this a bit simpler

c2f9015

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

Separate out the connspawner

bde5b45

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

Tidy logging a bit

22840ae

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

Add serverside keepalive

f1b102a

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

fixup

795b4ae

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

Just for kicks

e16368a

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

D'oh - use mthread runtime for tests

969b99a

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

Fix none race

bdb1c86

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

Propagate connection establish errors

bd2ca75

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

Cleanup

fb872fd

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

Work around local test server getting overloaded

3105377

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

Move the rest to multi_thread, chill out on iterations, work around test

e713b17

rig flakes Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

Tidy comments

4495fc2

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

bleggett force-pushed the bleggett/fix-pooling branch from b02750a to fcc845f Compare April 26, 2024 02:12

bleggett force-pushed the bleggett/fix-pooling branch from 04dbb34 to 4622532 Compare April 26, 2024 02:52

lints

7e794b6

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

bleggett force-pushed the bleggett/fix-pooling branch from 4622532 to 7e794b6 Compare April 26, 2024 03:06

bleggett requested a review from howardjohn April 26, 2024 03:11

Clarify comment

8737cd7

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>

howardjohn approved these changes Apr 26, 2024

View reviewed changes

istio-testing merged commit a44892b into istio:master Apr 26, 2024
2 of 3 checks passed

istio-testing mentioned this pull request Apr 26, 2024

[release-1.22] Better-behaved HBONE pooling #983

Merged

This was referenced Apr 29, 2024

limit streams per tunnel connection istio/istio#43443

Open

Add keepalives to waypoint HBONE endpoint to mimic ztunnel istio/istio#50737

Merged

howardjohn mentioned this pull request May 1, 2024

h2/ready #996

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better-behaved HBONE pooling #931

Better-behaved HBONE pooling #931

bleggett commented Apr 16, 2024 •

edited

istio-testing commented Apr 16, 2024

ilrudie left a comment

bleggett commented Apr 17, 2024

bleggett commented Apr 18, 2024

ilrudie left a comment

ilrudie Apr 18, 2024

bleggett Apr 18, 2024

ilrudie Apr 19, 2024

bleggett Apr 19, 2024 •

edited

ilrudie Apr 22, 2024

bleggett Apr 24, 2024 •

edited

howardjohn left a comment

howardjohn Apr 18, 2024

bleggett Apr 19, 2024 •

edited

howardjohn commented Apr 19, 2024

bleggett commented Apr 22, 2024 •

edited

bleggett commented Apr 26, 2024

bleggett commented Apr 26, 2024

istio-testing commented Apr 26, 2024

bleggett commented Apr 26, 2024 •

edited

Better-behaved HBONE pooling #931

Better-behaved HBONE pooling #931

Conversation

bleggett commented Apr 16, 2024 • edited

istio-testing commented Apr 16, 2024

ilrudie left a comment

Choose a reason for hiding this comment

bleggett commented Apr 17, 2024

bleggett commented Apr 18, 2024

ilrudie left a comment

Choose a reason for hiding this comment

ilrudie Apr 18, 2024

Choose a reason for hiding this comment

bleggett Apr 18, 2024

Choose a reason for hiding this comment

ilrudie Apr 19, 2024

Choose a reason for hiding this comment

bleggett Apr 19, 2024 • edited

Choose a reason for hiding this comment

ilrudie Apr 22, 2024

Choose a reason for hiding this comment

bleggett Apr 24, 2024 • edited

Choose a reason for hiding this comment

howardjohn left a comment

Choose a reason for hiding this comment

howardjohn Apr 18, 2024

Choose a reason for hiding this comment

bleggett Apr 19, 2024 • edited

Choose a reason for hiding this comment

howardjohn commented Apr 19, 2024

bleggett commented Apr 22, 2024 • edited

bleggett commented Apr 26, 2024

bleggett commented Apr 26, 2024

istio-testing commented Apr 26, 2024

bleggett commented Apr 26, 2024 • edited

bleggett commented Apr 16, 2024 •

edited

bleggett Apr 19, 2024 •

edited

bleggett Apr 24, 2024 •

edited

bleggett Apr 19, 2024 •

edited

bleggett commented Apr 22, 2024 •

edited

bleggett commented Apr 26, 2024 •

edited