New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PubSub Kafka] Dapr didn't reconcile missing subscription. Manual restart of dapr was required. #3350
Comments
Is this potentially related to dapr/docs#3707? |
I have seen this behaviour many times on my services that subscribe to multiple Kafka topics. We worked around the problem by creating a single Dapr component for each topic. This guaranteed that there was no misbehaviour of the consumer groups and running instances of the application. |
Hmmm I don't think it is - On my Topics we only have 1 partition (for now), and just one replica of the consuming service, so the scenario is super super simple. At first glance, the issue you raised in the Doc repo is touching on a more complex matter. I could be wrong of course! |
Nope, you're correct. If it's one replica and 1 partition, it's definitely not related. |
@KrylixZA Either way, I really appreciate you taking the time and effort to connect the dots! |
Are these dynamic subscriptions in code, or are these created declaratively (yaml)? That could make a huge difference for this issue. It might then be that the runtime does not consistently provide subscription information to each sidecar. I'd encourage you to experiment with both approaches to see if this changes things for reproducing the problem. My expectation is that declarative pubsub might be more error prone. |
I know we had the conversation on Discord, but I just want to add it here too for complete. Yes, we are using declarative subscriptions, but I don't believe this is related to the issue, as I can see the missing Topic is echoed out in the logs, and those logs are emitted from component-contrib example : time="2024-02-04T12:21:26.071093619Z" level=debug msg="client/metadata fetching metadata for [ds-applicationdatabricks-created ds-applicationrealm-created ds-applicationrealm-user-created] from broker luma-kafka-kafka-bootstrap.dotm-services:9092 |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions. |
Bump |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as pinned, good first issue, help wanted or triaged/resolved. Thank you for your contributions. |
runtime 1.12.2 (yes - I'm aware that 1.12.3 contains a subscriber fix, but I'm pretty sure that isn't the issue here)
EKS
ds-application.authz
) was missing 1 (out of 3) Topic subscriptions in Kafka.ds-application.authz
Consumer Group was expected to be subscribed to the following Topicsds-applicationdatabricks-created
- subscribed as expectedds-applicationrealm-created
- subscribed as expectedds-applicationrealm-user-created
- missingnote : The consuming service only has 1 replica/pod, and there are only 1 partitions in the 3 Topics above, so its a really simple set-up.
Just a note that the
ds-applicationrealm-user-created
Topic hasn't always been missing from the Consumer GroupI have logs from the 26th and 29th of January where I can see messages being processed from the
ds-applicationrealm-user-created
Topic as expectedtime="2024-01-29T19:11:59.061442928Z" level=debug msg="Processing Kafka message: ds-applicationrealm-user-created/0/3 [key=]" app_id=authz component="general-purpose-pubsub (pubsub.kafka/v1)" instance=sciplat-svc-authz-deployment-664d59564c-228hh scope=dapr.contrib type=log ver=1.12.2
time="2024-02-04T12:21:26.071093619Z" level=debug msg="client/metadata fetching metadata for [ds-applicationdatabricks-created ds-applicationrealm-created ds-applicationrealm-user-created] from broker luma-kafka-kafka-bootstrap.dotm-services:9092
" app_id=authz component="general-purpose-pubsub (pubsub.kafka/v1)" instance=sciplat-svc-authz-deployment-664d59564c-228hh scope=dapr.contrib type=log ver=1.12.2
Therefore, I restarted the deployment at approx 11:47am on the 9th
Immediately,
ds-applicationrealm-user-created
Topic joined the Consumer Group, as expectedAttached below are daprd logs from the sidecar for the last 7 days. Therefore logs are present before and after the redeployment.
extract-2024-02-10T20_29_06.226Z.csv
Things of interest
Thing 1
only after I did the redeployment (and normal service resumed), I see a specific log which says :
"Subscribed and listening to topics: [ds-applicationrealm-created ds-applicationrealm-user-created ds-applicationdatabricks-created]"
This log does not appear in the days before the restart, which, in itself is not a problem, as my log retention doesn't go back far enough to when the sidecar started up earlier in January.
However, it does go to show that what ever code pathway generated this log is essential to recovering the subscription.
Thing 2
Broken Pipe
could be a red herring
You will see lots of "broken pipe" and "timeout" errors coming from dapr related to kafka - However, even when I restarted the deployment, the errors continued, even though messages are now being processed. Therefor I'm 99% sure that this is red herring.
See graph below, which shows the broken "pipe errors" still occurring after restart.
The text was updated successfully, but these errors were encountered: