Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Design change] Connection-level packet locking #2418

Open
edwardneal opened this issue Mar 19, 2024 · 6 comments
Open

[Design change] Connection-level packet locking #2418

edwardneal opened this issue Mar 19, 2024 · 6 comments
Labels
💡 Enhancement New feature request 🙌 Up-for-Grabs Anyone interested in working on it can ask to be assigned

Comments

@edwardneal
Copy link

This partially comes from an effort to clear the way for PR #2384, but it feeds in to a wider point around use of thread-level locking constructs. I've taken a first look at the physical connection locking, and I think some of these might be adjustable.

Background

Speaking a bit more broadly here, my understanding of the TDS protocol is that the TDS-level packets are atomic - they can't be partially sent, and if a request requires multiple packets to be sent to the server then those packets need to be sent before another request can be started on the same logical connection. MARS multiplexes several logical connections onto one physical connection, interleaving TDS packets for several logical connections on the single physical connection.

As I understand it, a request can be aborted in two ways: setting flags on the next packet to be sent, or sending an Attention packet to cancel the server processing of the request.

At the moment though, we lock at the physical level on SNISslStream and SNINetworkStream using a ConcurrentQueueSemaphore. This protects against multiple callers writing to the streams at the same time, but we never expose the streams directly to clients - with the exception of the SSL negotiation, we abstract it behind the use of an SNIPacket or PacketHandle. We also allocate a new TaskCompletionSource in ConcurrentQueueSemaphore on every read or write, which could also be eliminated if we hoist the lock upwards.

Proposal

I think it'd be clearer and less contentious to move the locking to SNITcpHandle and SNINpHandle, in their Send, SendAsync, Receive and ReceiveAsync methods. At the moment, these methods use two different mechanisms: Send and Receive have the equivalent of one or two Monitor.Enter statements wrapping the stream-level locks, while SendAsync and ReceiveAsync just fall back directly onto the stream-level locks (assuming that a packet is never fragmented.) It's a slightly inconsistent approach, and I'd like to merge this into a single pair of SemaphoreSlims - one for reading, one for writing.

Since according to the TDS protocol datasheets an Attention packet can never interrupt a packet being sent, Attention packets wouldn't be able to opt out of packet-level locking as they currently can. I don't think this capability is actually used though - the stream-level ConcurrentQueueSemaphore already guarantees that it'll run into one lock that it can't bypass, even when it bypasses the existing Send lock.

Besides the above, I think this might also improve the persistent issue around thread pool blocking. Packet transmission's naturally on the hot path, and the use of Monitor.Enter (whether via a lock statement or explicitly) can block the entire thread. My guess is that the thread pool blocking isn't primarily caused by an excess of work, or from a long period of network waits - I think its root cause is actually that when the thread pool capacity is X threads, frequently sending >X packets at once introduces a condition where every thread in the pool except for one is blocked on the same thread-level lock, and this contributes to the reason why there's not enough capacity to process the write callback and release the lock. This would explain why MARS makes the problem so much worse, and why widening the thread pool improves matters. If my guess is right, removing a frequently-called thread-level lock might have an impact here.

The actual code change here should be quite small, but given how fundamental it is I'd like to get a wider view on the idea before proceeding. Are there any key landmines here, or is there anything I need to consider on this?

@JRahnama JRahnama added this to Needs triage in SqlClient Triage Board via automation Mar 20, 2024
@Wraith2
Copy link
Contributor

Wraith2 commented Mar 20, 2024

We also allocate a new TaskCompletionSource in ConcurrentQueueSemaphore on every read or write, which could also be eliminated if we hoist the lock upwards.

No we don't.

// try sync wait with 0 which will not block to see if we need to do an async wait
if (_semaphore.Wait(0, cancellationToken))
{
return Task.CompletedTask;
}
else

We only create a TCS if we need one to wait on something and we quite rarely need to wait.

At the moment though, we lock at the physical level on SNISslStream and SNINetworkStream using a ConcurrentQueueSemaphore. This protects against multiple callers writing to the streams at the same time, but we never expose the streams directly to clients

The logical layer multiplexes multiple logical handles on top of a single physical handle so it's entirely possible and theoretically normal for users to be operating on commands or readers which are connected through logical handles to a single physical handle. We currently synchronize at the last possible point which allows the maximum scope for users to work in parallel. Moving that up could have observable behavioural changes, you'd have to try it and see. In order to detect if any change you make could cause a change you'd have to have tests which completely characterize the current possible interaction patterns and we don't have that.

if a request requires multiple packets to be sent to the server then those packets need to be sent before another request can be started on the same logical connection

apart from attention packets which are out of band. each packet is still atomic but out of band packets go straight to the top of the queue.

Attention packet can never interrupt a packet being sent, Attention packets wouldn't be able to opt out of packet-level locking as they currently can. I don't think this capability is actually used though

It should be how cancellation is implemented. IIRC there is an out-of-band flag on the packet which indicates that it can skip any queued packets and be sent next.

I think it'd be clearer and less contentious to move the locking to SNITcpHandle and SNINpHandle,

What workloads have you used to get contention information? I'd be interested to see what situations cause contention. When I've profiled fairly standard single-thread single-user scenarios I've not found contention to locking to be causes of performance issues other than in connection pooling.

Are there any key landmines here, or is there anything I need to consider on this?

Yup, behavioural changes in multithreaded code. Sure it's only the one mine but it's a big explosion and it's utter hell to debug.

I'm interested in how the low level locking got your attention above the parser lock. I usually consider parser locking and the interaction between internal async and sync paths through the parser to be the really complex part of the async problem. I started looking at ways to implement async OpenTransaction and found i couldn't identify a way to do it.

@edwardneal
Copy link
Author

Thanks @Wraith2 - you're right about the TCS allocation, I'd misread that point. I agree that contention on the underlying transport's pretty unlikely. I'm stating the obvious here, but I expect it'd be triggered by high required throughput over a high-latency or low-bandwidth connection; that'll always be slow, but if we can ensure that the packets are sent in-order at the layer above, we can remove the lock on the stream and entirely eliminate the allocation and lock contention in this case.

The logical layer multiplexes multiple logical handles on top of a single physical handle so it's entirely possible and theoretically normal for users to be operating on commands or readers which are connected through logical handles to a single physical handle. We currently synchronize at the last possible point which allows the maximum scope for users to work in parallel. Moving that up could have observable behavioural changes, you'd have to try it and see. In order to detect if any change you make could cause a change you'd have to have tests which completely characterize the current possible interaction patterns and we don't have that.

We do, and I've been looking at MARS a bit here. MARS also generates its own lock contention which is difficult to isolate from that of the underlying connection. I'm testing a few adjustments to try to reduce that and get a clear benchmark.

I'm not sure I agree with the idea of synchronising at the last possible point though. We currently guarantee that reads and writes to the stream are going to happen under two mutexes, but as you've said: attention packets go to the head of the queue anyway. The transport won't know about that, so we've immediately got to implement synchronisation on the layer above anyway - and if we're already syncing on the logical handles, there's an implicit guarantee that the physical handles are also synchronised.

This isn't currently the case: while SNITcpHandle.Send implements the normal synchronisation process, SendAsync just performs an immediate write to the stream. That's not a complicated change to make... but if it introduces more thread-level locks on task-level async, it's going to narrow the thread pool and hurt performance.

apart from attention packets which are out of band. each packet is still atomic but out of band packets go straight to the top of the queue.

It should be how cancellation is implemented. IIRC there is an out-of-band flag on the packet which indicates that it can skip any queued packets and be sent next.

Yes, I agree. There are two ways to cancel a request: if all of the SNI packets have been sent, an Attention packet will be marked as out-of-band and immediately sent to the server to cancel the request. If not all of the SNI packets have been sent, the next one can have the relevant status bits set, and no further packets need to be sent. That logic's handled within TdsParserStateObject.WritePacket.

I'm interested in how the low level locking got your attention above the parser lock. I usually consider parser locking and the interaction between internal async and sync paths through the parser to be the really complex part of the async problem.

The parser lock did grab my attention and I agree that it needs to be changed, but it's also shared between the managed and the native SNI implementations, and between the Core and Framework projects. I don't know whether the TDS parser has behavioural parity between the two projects, and the relevant files don't diff well enough for me to find out.

For me to look at the parser lock, I think we'd need to have merged TdsParserStateObject, TdsParser and SqlCommand. This would probably mean also switching the .NET Framework project over to a TdsParserStateObjectManaged implementation first. I'm open to helping with that, but it's not a quick process.

I'm thinking about the low-level locking in the context of end-to-end async support, lining up the sync primitives (i.e. making sure that they yield the thread to the next task rather than blocking the entire thread pool thread.) I'm also sanity-checking the places where we lock to see how they line up with the TDS spec, which is where this issue comes from.

More broadly around the async problem: when I try to work inwards from the public-facing methods then I run into TdsParser and the parser lock immediately, and I find the callbacks become difficult to reason on. I find it's simpler to work upwards from the physical transport because the managed implementation only exists in the .NET Core project, and it's built on Stream.ReadAsync (which I trust not to have any async-over-sync issues!) The TdsParserStateObject interface also works as a reasonable layer of abstraction to mask TdsParser (and its differing implementations between Core and Framework) even if that interface means the sync-over-async issues are pushed up to it...

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 21, 2024

If you want to make changes in the managed mars implementation you may want to start with my rewrite that got reverted #1357 it passed all the testing that we have but something was amiss.

All the async stuff in this library is hard to reason about. Merging the codebases which we've been working towards for years is required to make that effort even worth starting sensibly.

What is is about the io level locks that is causing problems? Are they simply offensive somehow?

@edwardneal
Copy link
Author

Thanks, I'll take a look at that to see how it works.

The IO locks are part of a larger set of ideas/work, they're just a useful place to start: they're built atop .NET, rather than on another layer of the library, they're restricted to the .NET Core projects (so won't come into conflict with the .NET Framework one) and they're close enough to the bare protocol that I can reason on it more easily and refer to the TDS protocol documentation. The actual goal is to try to unpick some of the extra complexity in the async area of the library, and I think the most sensible approach is to start working upwards from the physical layer layer-by-layer, cleaning up thread-level sync primitives, making sure that any locks are in the right place, adding any ValueTask-based methods required and so on. At the moment the lock placement at the physical layer doesn't precisely line up with the spec, so if this is masking any possible synchronisation bugs then I'd like to find and fix them in the existing logic before starting to replace the existing async logic with the ValueTask-based methods.

I understand your point about the codebase merge, and I've got a number of PRs open which should deal with most of the low-hanging fruit. There'll be a second round of AE-related classes once .NET Standard is dropped, which I think leaves about a dozen of the largest files.

@DavoudEshtehari DavoudEshtehari moved this from Needs triage to Ideas for Future in SqlClient Triage Board Mar 26, 2024
@DavoudEshtehari DavoudEshtehari added 💡 Enhancement New feature request 🙌 Up-for-Grabs Anyone interested in working on it can ask to be assigned and removed untriaged labels Mar 26, 2024
@edwardneal
Copy link
Author

I've tested a few ideas for the last few days, but not had much success. I don't think it's going to be practical to lift the stream-level lock up to the packet - the SNIHandle interface expects ReceiveAsync and SendAsync to return SNI_SUCCESS_IO_PENDING, then to perform the IO in a background task. I could remove the lock from the streams and then manually daisy-chain tasks together for the lock entry/exit, but it's largely pointless - reading and writing the packets only calls ReadAsync/WriteAsync on the stream once, so it'd just make the code more complex and make no difference to functionality.

I'm going to test a slightly more impactful approach:

  • Replace the Monitor.(Try)Enter and locks with ConcurrentQueueSemaphore. This'll still block the thread, but it'll at least do so explicitly.
  • Check to see if SendAsync's locking for out-of-band packets needs to align with Send's.
  • Add the more modern task-based SendAsync and ReceiveAsync methods to the SNIHandle implementation which don't call the async IO handlers and which only block the task while they wait. Even though nothing will make use of these, it'll lay the foundation for later work.
  • Take a closer look at the MARS handles and connections. A lot of the locks here make sense, but I think some parts can be made lockless.

Separately I've also collected a few performance improvements for the connection opening/pre-login process along the way. I need to benchmark these properly, but I'll submit a separate PR as general performance enhancements.

@edwardneal
Copy link
Author

After more testing, I'm going to wait for the codebase merge to be completed before making the lower-level locking changes.

I have a reasonable first draft which made MARS handles and connections almost completely lockless, and this pulled the locking for all SNIHandle-derived classes into a spec-consistent state. This failed performance testing though: SqlDataReader.ReadAsync would sometimes take significantly longer on MARS connections (where it previously didn't.) Removing the locks made it more likely that threads could make progress, and this made it more likely that TdsParserStateObject.ReadSniSyncOverAsync would be called, which eventually blocked the thread pool for long enough that the command timed out. This was particularly common for reading (n)varchars, as would be expected.

The rest of the performance benchmarking was fairly promising: a lot of the execution time variations subsided, and the execution times for the DataTypeReaderRunner and DataTypeReaderAsyncRunner benchmarks reached parity (shifting from 17ms to 11ms on my laptop on one execution.) Memory usage was very variable, largely as a result of extra state objects/task continuations being allocated.

To go any further, the async-over-sync issue in TdsParserStateObject needs to be removed: without this, any synchronisation improvements will just make thread pool exhaustion more likely in upstream code. Since that can't be cleanly addressed without the codebases being merged, I'll put this on hold until that's done.

@Wraith2 - I think this highlights that although working upwards from the physical-level locking can have some pretty significant performance improvements, your approach of working downwards from the parser lock is lower risk. Once the existing codebase merge-related PRs are merged, I'll help with TdsParser so we can start working through the async enhancements and unblock this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💡 Enhancement New feature request 🙌 Up-for-Grabs Anyone interested in working on it can ask to be assigned
Projects
SqlClient Triage Board
  
Ideas for Future
Development

No branches or pull requests

4 participants