New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PubSub Subscriber with StreamingPull Suddenly Stops Regularly #11793
Comments
Can you please share a minimal but complete project that would reproduce the problem? A console application should be fine. Note I understand your theory requires something specific to happen to trigger the issue, so I understand this repro project won't be able to consistently reproduce the problem, that's fine, still I'd expect it would reproduce it if we run it long enough. We need to see exactly how your code looks like, for instance, you mention |
my apologies, I meant StartAsync. Will follow up with a sample console app. |
Thanks for the code and the detailed explanation. One reamining question is which Google.Cloud.PubSub.V1 version are you using, exactly. I'll try to get to this later today or tomorrow. |
3.7.0 Thanks! We deployed the change to OrderingKey yesterday, usually takes a day to see the StreamingPull drop off. Will be checking for the remainder of the week to see if it follows the same pattern as before. |
It seems we are still in the same situation even after changing the OrderingKey to file prefix "FileTypeA", "FileTypeB", ... instead of "FileTypeA_20240228" "FileTypeA_20240229", "FileTypeB_20240228", "FileTypeB_20240229", ... Every time the daily batch messages are processed (approx 9 individual files per day with 5 different file prefixes, originating from a batch of files uploaded to bucket), there are fewer open streaming pulls until eventually they stop altogether. If it helps, some of the message processing requires a ModifyAckDeadline due to taking longer than 10 mins. Perhaps this is part of the problem, perhaps not. |
Thanks for the extra info. I'll take a look as soon as I can but just for expectations it will probably be early next week. |
Assigning to @jskeet as he's more familiar with this library than I am. |
Okay, I'm looking into this now - I'm afraid my experience with diagnosing Pub/Sub issues is that there'll be a lot of back and forth with various questions and tests before we get to the root problem. Thanks so much for all the info you've already provided. I can explain one thing off the bat: the reason that Now, to understand your situation a bit better:
All of this is just trying to play divide-and-conquer at the moment (and giving me enough information to reproduce the issue myself, which will massively reduce the back-and-forth). I don't have any concrete ideas about what's happening at the moment - but if I can reproduce the issue myself, I can easily add a load of logging etc. I'm going to try a simple repro to start with of just starting a single client on my Windows box, using the same topic configuration that you've got, and occasionally adding messages to it, with a mixture of "needs modify ack deadline" and "ack quickly". |
Okay, my simplest tests didn't show anything - which is pretty much what I expected. I'll try leaving the code running for multiple hours (if I can remember not to shut my laptop down!) before moving onto the next scenarios... |
Nearly 24 hours later, the subscriber is still running fine, and the number of open streaming pull requests is stable (goes up and down a bit, as expected, but returns to the right level). I've sent large and small batches of messages, with a mixture of different simulated processing times. No joy reproducing the problem yet :( (I'll stop this test now, as it's not getting anywhere.) |
Dropping in to add a very similar case we're encountering. We've got a few .NET apps on GKE that use the Pub/Sub queues. They're fairly high-volume queues - processing ~110k messages a day, peaking at around 3-4k messages an hour. It works fine on weekdays, but on weekends the number of messages published drops down a lot and we might get 1-2 an hour. We recently updated the PubSub NuGet package from So far our workaround is to just restart it, but we are looking at something more automated if we can't land on a fix in the upstream library. I hope this additional context might help with determining the cause of the issue. |
@Mihier-Roy: Are you able to reproduce the issue outside GKE? Given the mention of Docker boundaries in the linked NodeJS issue, I'm wondering whether I might be able to reproduce this in a "simple" Docker container. (Rather than having to deploy to GKE.) Will give it a try. Any more information you can provide would be really useful - see my earlier comments for the sort of thing I'm interested in. (If you could let us know your topic configuration too, that'd be great. Things like message ordering and exactly-once delivery have some interesting effects in terms of how the clients behave.) |
@jskeet: I haven't tried to re-produce outside of GKE just yet, but I'll take a stab at it today/tomorrow! I hope it's something we can reproduce on a Docker container. Some additional info:
As for topic configuration:
|
Getting a break at last from the day-to-day to respond @jskeet - your name is familiar, I think I've read your writing or watched some videos in the past. Anyhow, here's some info you requested and some findings of our own: We're definitely running in linux containers with .NET 6 runtime. mcr.microsoft.com/dotnet/aspnet:6.0 in fact. Thanks for clarifying about the Build vs Run context for the cancellation token, it seems relevant in our workaround. As we poured over the code, that token was quite difficult to track down. There are many cancellation tokens in play and tokens created from others IIRC (it was many lines of code ago). We found a workaround that seems stable. In a shared library, we've had a method that sets up the subscriber.
Previously:
Because if the process was being shut down gracefully - during deployment, for example - we'd want the subscriber to shut down gracefully as well. We now have the following implementation in which we no longer pass the ambient cancellationToken to BuildAsync.
However, since StartAsync doesn't take a cancellation token and we're awaiting, there doesn't seem to be a way to tell the subscriber client to stop. Our consumer is a BackgroundService
Background service is registered as a Singleton service
As I said, this is in a library that we use for several services to wire up subscribers so it uses generics. A more concrete implementation might be easier to debug. Our fix works, so...I think what was happening is this: Since we were passing the I'm thinking something inside somewhere was switching that stoppingToken.IsCancellationRequested == true and our re-subscribe loop was opened causing it to stop altogether. This is speculation on my part, but it led to a solution that has been working for some time now. However... something isn't quite right with this scenario. What's causing it to stop? This single change to NOT pass that token into BuildAsync fixed the problem. one error was logged recently which may (or not) be related, this was after our fix was in place. We don't retain logs long enough to have any from before our fix.
|
@philvmx1: Your new code is still somewhat confusing. Firstly, the cancellation token here is pointless: var subscriberCancellationToken = new CancellationToken();
var subscriber = await subscriberClientBuilder.BuildAsync(subscriberCancellationToken); That's just equivalent to not passing in a cancellation token: var subscriber = await subscriberClientBuilder.BuildAsync(); Secondly, this comment is pretty misleading in its current location: // DO NOT USE THE CANCELLATION TOKEN THAT IS PASSED INTO THIS METHOD.
// We need a new cancellation token because the SubscriberClient can cancel the token itself. If the
// SubscriberClient cancels, we want to restart the subscription. The caller of ProcessMessages should re-invoke this
// method when it completes. It looks like you're talking about the cancellation token passed into
The task returned by If you want a kind of Note that in your original code, when you were passing |
@philvmx1: Separately, regarding this:
That looks like it could very well be related. That certainly looks like it's a single channel failing - which would then (if I'm reading the code correctly) stop the whole Of course, that just means that the next question is why you're seeing that exception. That's where it may be GKE-specific. I've tried a long-running task in plain Docker (not running on GCP) and it's working fine for me. I'd still be really interested in knowing whether you can reproduce the error at all when not running in GKE. Additionally, if you're collecting metrics about when you're restarting the |
@jskeet I realized that the comment is not correct now that you've cleared up that the cancellation token in BuildAsync is not passed to the built client. It doesn't quite explain why it seems to have solve our problem. There must be something else. I'll look closely if anything else changed. What I can't quite understand is how the client doesn't reconnect on its own after the other side hangs up.Or if it can't, why it wouldn't throw, in which case we would catch, log an error (which we never did see in the logs or this would have been much clearer to us) and StartAsync again - even in the old code. |
I believe that's what it is doing, effectively, by making the task returned by I would expect errors like this to usually indicate a fatal problem - something we wouldn't want to just retry, or at least not for long. (It's possible that there's already retry going on, although we normally don't retry for streaming calls.) If the application wants to retry the whole subscription, that's up to the app (and it sounds like you're doing that). It's hard to say anything more until we can reproduce the failure though. |
If StartAsync completed in any state, we should expect to see the Warning logged at the end of the following (our log level is >= Warning).
|
So how did you capture the log you included in #11793 (comment)? I'd assumed that was already from awaiting StartAsync. Note that the code you've shown above wouldn't log a warning if Rather than getting snippets of code in various places, I'd really like to go back to a minimal but complete example which actually demonstrates the problem. It's not clear to me what you observed with the code in #11793 (comment). Did you actually run that code and see the problem, or was it just meant to be sample code to show how you set things up? (It's really important that we end up with shared code which does definitely demonstrate the problem.) |
Error should have been logged in the BackgroundWorker in #11793 (comment) via Understood about the working sample, I am working on creating something that models the problem with much less abstraction. I do have test environments to run them in. What I don't have is a lot of dedicated time for this since I'm running a project with a fixed deadline. So I have to squeeze in time to try things, build isolated test code, and dig for additional information. Digging into metrics in DataDog, found that each GRPC POST last 15 minutes until error. Interestingly, the number of errors tapers off over time. As does the number of requests. minimal problem model code I have so far:
csproj
|
@philvmx1: Any updates with more logging? |
@jskeet I had it running in a dev env until Friday when trunk deployed over it. Logging in the dev env showed a lot of detail but no issues. Even then, it didn't model prod very well since we weren't doing any daily file drops there. We haven't seen the issue in production since Feb 21, which has been good for us but not for understanding the issue at hand. Not understanding what caused the instability makes it difficult to accept that it's stable going forward. At the moment, we don't have any other information. The only notable change RE Pub/Sub is going from OrderingKey FileName (FileType__.txt) to OrderingKey FilePrefix (FileType). That made its way into prod on Feb 29. Is there something in the OrderingKey algo that could cause streaming pulls to cease pulling as the number of OrderingKeys increases? I made some prelim tests on that theory in dev but I used random ordering keys not TestFile_20240101_1010.txt, TestFile_20240102_1001.txt, ... |
No, I wouldn't expect so - although I can't say I'm an expert on the precise details of all the options in Pub/Sub. I'm still hopeful that this was really due to the perpetual retry loop we can currently get into if auth fails with an error of "internal" (which we're still discussing internally, in terms of the best fix). |
Any more information on this? I'm still hopeful that this is due to the perpetual retry on certain types of auth failure - which we're tracking internally. |
This is the first part of addressing googleapis#11793. The next step will be to allow a custom retry predicate to be provided, which can examine the previous exceptions etc, with a default which will fail after a certain time, or after multiple internal failures for example. (The list of exceptions here could easily turn into a map of from "status code to number of exceptions".) We may also want to implement something similar for Publisher operations, but that can be done separately.
This is the first part of addressing googleapis#11793. The next step will be to allow a custom retry predicate to be provided, which can examine the previous exceptions etc, with a default which will fail after a certain time, or after multiple internal failures for example. (The list of exceptions here could easily turn into a map of from "status code to number of exceptions".) We may also want to implement something similar for Publisher operations, but that can be done separately.
This is the first part of addressing googleapis#11793. The next step will be to allow a custom retry predicate to be provided, which can examine the previous exceptions etc, with a default which will fail after a certain time, or after multiple internal failures for example. (The list of exceptions here could easily turn into a map of from "status code to number of exceptions".) We may also want to implement something similar for Publisher operations, but that can be done separately.
This is the first part of addressing #11793. The next step will be to allow a custom retry predicate to be provided, which can examine the previous exceptions etc, with a default which will fail after a certain time, or after multiple internal failures for example. (The list of exceptions here could easily turn into a map of from "status code to number of exceptions".) We may also want to implement something similar for Publisher operations, but that can be done separately.
It's been awhile, but I have new information regarding this issue. We put OrderingKey and OnlyOnce delivery on an existing pull subscriber. This is now causing the subscription to stop processing. As this one sees much higher volume of messages, this happens much more quickly. We're thinking it's related exceptions being bubbled up to StartAsync. Documentation for StartAsync says exceptions are handled as if the reply was a NACK. We're going to try returning NACK explicitly instead of bubbling the exception to see if that resolves the issue for us. |
Can you clarify exactly what you mean by "exceptions being bubbled up to StartAsync"? Do you mean that the subscriber task (returned by StartAsync) ends up being faulted? (Very similarly, when you say the subscription "stops processing" do you mean it just hangs like before, or that the task returned by StartAsync fails?) |
Clarifications:
stripped down version of code that describes the scenario:
|
Sorry, I have a correction to make to the above comment - turns out we ARE returning NACK from our handler rather than rethrowing.
|
I'm confused - if the task is faulted, then it's not hanging like before - because before, the task didn't end up faulted, it just retried forever. Note that if the task is faulted, your while loop would end up throwing. |
Sorry, it's not faulting like I thought it was. Digging deeper into our own abstraction, we're catching, logging, and returning NACK in the Func passed to StartAsync. |
Okay... rather than me trying to assemble a complete picture from the multiple comments, could you add a new comment with a reasonably full description of what you're observing? I think that'll be a better way forward. (Note that this will probably be my last comment for the day - I stopped actual work a couple of hours ago.) |
To clarify/correct myself even further: Our subscriber is hanging when we reply with NACK frequently when using message ordering and only once delivery. After 15 attempts with NACKs on each, these messages go to DLQ. There is a strong correlation with going to DLQ and halting pulls. |
Able to repro! set up program so it NACKs all messages. create a topic. publish a message with any ordering key used previously...not pulled! Did additional testing with a simple project that can repro the issue. The heart of the logic is this:
Then, in web console I publish messages with OrderingKey either "a1" or "b1". Once a "b1" is NACK'd all 5 tries, it goes to DLQ but no other "b1" can be pulled by neither the locally running app NOR the web console. When I stop the locally running app, the web console almost immediately pulls the "b1". It seems like the PubSubClient is not "releasing" the OrderingKey when the message has been NACKd max times and gets sent to the DLQ. How would it? Does the client have a way to know that the message will go to the DLQ this time? I tried throwing an exception instead of returning NACK, but the client just swallows that and treats it as a NACK which gets us into the same place again anyway. Additionally... if I publish multiple OrderingKeys that will NACK, such as b1, b2, b3, b4, b5, b6, ..., b10. It eventually just stops receiving new messages. This is happening sometime after b5 or b6. When I restart, it will start getting the new messages. UPDATE: I removed Only Once Delivery and it no longer stalled. So NACK + OrderingKey + Only Once is the combo that causes it to stall. |
Right, I've managed to reproduce that. I'm going to try without the dead-lettering now, just to see if that's relevant. Once I've got a really minimal example, I'll create a PR with code that anyone can run. |
#12961 is the repro. Sample output:
The last "recoverable error" part happens multiple times, but I don't think that's related to the issue. (I'll track that internally with a separate issue.) The important part is that we don't get the "Nack 2" message. |
(I won't be able to spend any more time on this today, but it's good that we've managed to reproduce it.) |
Okay, having played with this a bit more and given the subscriber a bit longer, I see the following pattern:
I usually see that the specific ordering key is "blocked" for about 2 minutes after the final nack - but I've seen longer and shorter times too. (And sometimes the gap between delivery attempts is on the order of a second - sometimes it's longer.) While I haven't seen any documentation for this (and I'll ask the Pub/Sub team to confirm that this is the intended behavior and document it), I don't think the client is doing anything wrong here. Now, this issue has gone through quite a few iterations - but does this description explain everything you're seeing? |
We weren't seeing a recovery like you're seeing, even after days. But then again we several to hundreds of OrderingKeys rather than just 1. What happens if you use more than one OrderingKey such as "nack1" "nack2" "nack3" ... and up to "nack6"? It seems to have made a difference when there are more than 4 or 5. |
Thanks for the suggestion - I'll give that a go (although it might be on Monday). |
NP. IIRC, it stopped getting "ack" too after that. |
Could you clarify precisely what you mean by that? |
I mean that it stopped receiving messages entirely after messages with several OrderKeys went to DLQ within a short window (order of minutes perhaps). I didn't necessarily dig into this since I have some deadlines coming up very soon. However, there was some very strange behaviors when sending a combination of "ack" and "nack" with different OrderingKey and some mixed in without an OrderingKey. I was basically doing exploratory testing at the time. We saw different things like the same message was handles again even though it was already ACK'd and things like that. I would suggest sending OrderingKeys randomly chosen from a set of "ack1" - "ack6", "nack1" - "nack6", and "" (no ordering key). |
Just tried with 10 ordering keys, and I got the same behavior as before:
Will try with a mixture of acks and nacks (and no ordering key)... I'd rather not get into randomness though. |
This is a consistent problem for us and many others. My theory is that the server side hangs up in the middle of a keepalive and RpcException has IsRpcCancellation on the line linked below which shuts down the subscription.
Since we expect StartNew to continue to run and RPC connection issues to be resolved, this causes our workers to stop silently leaving us dead in the water.
google-cloud-dotnet/apis/Google.Cloud.PubSub.V1/Google.Cloud.PubSub.V1/SubscriberClientImpl.SingleChannel.cs
Line 232 in 2e2659b
The text was updated successfully, but these errors were encountered: