Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PubSub Kafka] Dapr didn't reconcile missing subscription. Manual restart of dapr was required. #3350

Closed
olitomlinson opened this issue Feb 10, 2024 · 11 comments
Labels
kind/bug Something isn't working stale

Comments

@olitomlinson
Copy link

olitomlinson commented Feb 10, 2024

runtime 1.12.2 (yes - I'm aware that 1.12.3 contains a subscriber fix, but I'm pretty sure that isn't the issue here)
EKS


  1. On Feb 9th, I noticed that one of our Apps wasn't working as expected, upon further investigation we found that a particular Consumer Group (ds-application.authz) was missing 1 (out of 3) Topic subscriptions in Kafka.

ds-application.authz Consumer Group was expected to be subscribed to the following Topics

ds-applicationdatabricks-created - subscribed as expected
ds-applicationrealm-created - subscribed as expected
ds-applicationrealm-user-created - missing

note : The consuming service only has 1 replica/pod, and there are only 1 partitions in the 3 Topics above, so its a really simple set-up.


Just a note that the ds-applicationrealm-user-created Topic hasn't always been missing from the Consumer Group

I have logs from the 26th and 29th of January where I can see messages being processed from the ds-applicationrealm-user-created Topic as expected

time="2024-01-29T19:11:59.061442928Z" level=debug msg="Processing Kafka message: ds-applicationrealm-user-created/0/3 [key=]" app_id=authz component="general-purpose-pubsub (pubsub.kafka/v1)" instance=sciplat-svc-authz-deployment-664d59564c-228hh scope=dapr.contrib type=log ver=1.12.2


  1. My first thought was to check the sidecar logs to ensure that dapr was aware of the Topic (even if was missing in the Consumer Group in Kafka). It did. Many logs have all three Topics listed as expected, an example :

time="2024-02-04T12:21:26.071093619Z" level=debug msg="client/metadata fetching metadata for [ds-applicationdatabricks-created ds-applicationrealm-created ds-applicationrealm-user-created] from broker luma-kafka-kafka-bootstrap.dotm-services:9092
" app_id=authz component="general-purpose-pubsub (pubsub.kafka/v1)" instance=sciplat-svc-authz-deployment-664d59564c-228hh scope=dapr.contrib type=log ver=1.12.2


  1. Given that other Apps in the system which make use of the same PubSub component were operating correctly, there was little more I could do here. No point sending a canary message given that the Consumer Group had lost its subscription.

Therefore, I restarted the deployment at approx 11:47am on the 9th

image

Immediately, ds-applicationrealm-user-created Topic joined the Consumer Group, as expected

image

Attached below are daprd logs from the sidecar for the last 7 days. Therefore logs are present before and after the redeployment.

extract-2024-02-10T20_29_06.226Z.csv


Things of interest

Thing 1

only after I did the redeployment (and normal service resumed), I see a specific log which says :

"Subscribed and listening to topics: [ds-applicationrealm-created ds-applicationrealm-user-created ds-applicationdatabricks-created]"

This log does not appear in the days before the restart, which, in itself is not a problem, as my log retention doesn't go back far enough to when the sidecar started up earlier in January.

However, it does go to show that what ever code pathway generated this log is essential to recovering the subscription.

image

Thing 2

Broken Pipe

could be a red herring

You will see lots of "broken pipe" and "timeout" errors coming from dapr related to kafka - However, even when I restarted the deployment, the errors continued, even though messages are now being processed. Therefor I'm 99% sure that this is red herring.

See graph below, which shows the broken "pipe errors" still occurring after restart.

image
@olitomlinson olitomlinson added the kind/bug Something isn't working label Feb 10, 2024
@olitomlinson olitomlinson changed the title [PubSub] Dapr didn't reconcile missing kafka subscriber. Manual restart of dapr was required. [PubSub Kafka] Dapr didn't reconcile missing subscription. Manual restart of dapr was required. Feb 10, 2024
@KrylixZA
Copy link

Is this potentially related to dapr/docs#3707?

@KrylixZA
Copy link

I have seen this behaviour many times on my services that subscribe to multiple Kafka topics. We worked around the problem by creating a single Dapr component for each topic. This guaranteed that there was no misbehaviour of the consumer groups and running instances of the application.

@olitomlinson
Copy link
Author

Is this potentially related to dapr/docs#3707?

Hmmm I don't think it is - On my Topics we only have 1 partition (for now), and just one replica of the consuming service, so the scenario is super super simple.

At first glance, the issue you raised in the Doc repo is touching on a more complex matter. I could be wrong of course!

@KrylixZA
Copy link

Is this potentially related to dapr/docs#3707?

Hmmm I don't think it is - On my Topics we only have 1 partition (for now), and just one replica of the consuming service, so the scenario is super super simple.

At first glance, the issue you raised in the Doc repo is touching on a more complex matter. I could be wrong of course!

Nope, you're correct. If it's one replica and 1 partition, it's definitely not related.

@olitomlinson
Copy link
Author

@KrylixZA Either way, I really appreciate you taking the time and effort to connect the dots!

@berndverst
Copy link
Member

Are these dynamic subscriptions in code, or are these created declaratively (yaml)? That could make a huge difference for this issue. It might then be that the runtime does not consistently provide subscription information to each sidecar.

I'd encourage you to experiment with both approaches to see if this changes things for reproducing the problem.

My expectation is that declarative pubsub might be more error prone.

@olitomlinson
Copy link
Author

I know we had the conversation on Discord, but I just want to add it here too for complete.


Yes, we are using declarative subscriptions, but I don't believe this is related to the issue, as I can see the missing Topic is echoed out in the logs, and those logs are emitted from component-contrib

example :

time="2024-02-04T12:21:26.071093619Z" level=debug msg="client/metadata fetching metadata for [ds-applicationdatabricks-created ds-applicationrealm-created ds-applicationrealm-user-created] from broker luma-kafka-kafka-bootstrap.dotm-services:9092
" app_id=authz component="general-purpose-pubsub (pubsub.kafka/v1)" instance=sciplat-svc-authz-deployment-664d59564c-228hh scope=dapr.contrib type=log ver=1.12.2

Copy link

github-actions bot commented Apr 7, 2024

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Apr 7, 2024
@olitomlinson
Copy link
Author

Bump

@github-actions github-actions bot removed the stale label Apr 7, 2024
Copy link

github-actions bot commented May 7, 2024

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label May 7, 2024
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as pinned, good first issue, help wanted or triaged/resolved. Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

3 participants