[azeventhubs] Calling `Run` again on an already errored `Processor` causes a panic #22785

PaulBernier · 2024-04-26T19:03:16Z

Bug Report

azure-sdk-for-go/sdk/messaging/azeventhubs/processor.go

Line 290 in aae700f

close(p.runCalled)
github.com/Azure/azure-sdk-for-go/sdk/messaging/azeventhubs v1.1.0
go version go1.22.0 darwin/amd64

When a Processor.Run return with an error, I thought I could just restart that processor by calling Run again. Doing so result in a panic:

panic: close of closed channel

goroutine 46 [running]:
github.com/Azure/azure-sdk-for-go/sdk/messaging/azeventhubs.(*Processor).initNextClientsCh(0xc0002cd380, {0xf1e1668?, 0xc0000be6e0?})
        /Users/pbernier/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/messaging/azeventhubs@v1.1.0/processor.go:290 +0x145
github.com/Azure/azure-sdk-for-go/sdk/messaging/azeventhubs.(*Processor).runImpl(0xc0002cd380, {0xf1e1668, 0xc0000be6e0})
        /Users/pbernier/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/messaging/azeventhubs@v1.1.0/processor.go:249 +0xc7
github.com/Azure/azure-sdk-for-go/sdk/messaging/azeventhubs.(*Processor).Run(0xc0001c4700?, {0xf1e1668, 0xc0000be6e0})
        /Users/pbernier/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/messaging/azeventhubs@v1.1.0/processor.go:228 +0x1d

What did you expect or want to happen?

Maybe I missed it but I did not see in doc/comments that a Processor was not supposed to be re-used, I want to confirm that the right way to "restart" is to create a new Processor?
Rather than an uncaught panic a dedicated error should be produced
Ideally the recovery process could be as simple as:

for {
	if err := processor.Run(processorCtx); err != nil {
		r.logger.Error("Processor error", zap.Error(err))
		continue
	}
	break
}

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-26T19:03:55Z

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jfggdl.

PaulBernier · 2024-04-29T17:55:27Z

A bit more context on the why I even want to restart the processor: I implemented a AWS DynamoDB checkpoint store, and the calls can be throttled for some time until autoscaling kicks in. CheckpointStore failures will propagate back to the processor causing it to exit. But those errors are recoverable, as the checkpoint store will be able to recover. I know that the Processor could also fail on non-recoverable errors, such as an incorrect ConsumerGroup for instance.

So the crux of what I am trying to solve here is really be able to restart the processor on CheckpointStore errors, if you can think of any proper handling of this.

richardpark-msft · 2024-04-29T18:27:59Z

@PaulBernier, I'm looking at this to see how difficult it would be as it doesn't seem unreasonable at all.

Now, for any CheckpointStore, you'll probably just want to add a more forgiving retry policy. It's not really something you need or want to do at the Event Hubs level. Does the AWS DynamoDB client have a configurable retry policy? In Azure we have a RetryPolicy that can get passed into our clients:

azure-sdk-for-go/sdk/azcore/policy/policy.go

Lines 87 to 88 in 03baedf

    
           type RetryOptions struct { 
        
           	// MaxRetries specifies the maximum number of attempts a failed operation will be retried

PaulBernier · 2024-04-29T20:40:45Z

Yes AWS DynamoDB client have retries, but sometimes it can take more time than we would like to adjust. I agreed with the direction of making the checkpointing more robust. But I still want to have a fail-safe approach, and restarting the Processor would be the last resort mean to revive the consumption process.

For now I am implementing that last resort restart by recreating a new Processor, and that's fine, just more verbose/tricky than my ideal Processor restart:

for {
	if err := processor.Run(processorCtx); err != nil {
		r.logger.Error("Processor error", zap.Error(err))
		time.Sleep(20 * time.Seconds)
		continue
	}
	break
}

The panic I pointed out is the main thing I'd expect to be fixed from this issue, any other improvement would just be a bonus :)

richardpark-msft · 2024-05-03T23:39:59Z

@PaulBernier, the fix here was just to make the Processor indicate it's single-use only - once stopped, it can't be restarted.

It'll return proper errors now, with the release that's coming out next week.

jhendrixMSFT removed question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention This issue is responsible by Azure service team. labels Apr 26, 2024

jhendrixMSFT assigned richardpark-msft Apr 26, 2024

github-actions bot added the needs-team-triage This issue needs the team to triage. label Apr 26, 2024

jhendrixMSFT removed the needs-team-triage This issue needs the team to triage. label Apr 26, 2024

richardpark-msft closed this as completed in a61ff96 May 3, 2024

richardpark-msft reopened this May 3, 2024

richardpark-msft closed this as completed May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[azeventhubs] Calling `Run` again on an already errored `Processor` causes a panic #22785

[azeventhubs] Calling `Run` again on an already errored `Processor` causes a panic #22785

PaulBernier commented Apr 26, 2024 •

edited

github-actions bot commented Apr 26, 2024

PaulBernier commented Apr 29, 2024

richardpark-msft commented Apr 29, 2024

PaulBernier commented Apr 29, 2024 •

edited

richardpark-msft commented May 3, 2024

[azeventhubs] Calling Run again on an already errored Processor causes a panic #22785

[azeventhubs] Calling Run again on an already errored Processor causes a panic #22785

Comments

PaulBernier commented Apr 26, 2024 • edited

Bug Report

github-actions bot commented Apr 26, 2024

PaulBernier commented Apr 29, 2024

richardpark-msft commented Apr 29, 2024

PaulBernier commented Apr 29, 2024 • edited

richardpark-msft commented May 3, 2024

[azeventhubs] Calling `Run` again on an already errored `Processor` causes a panic #22785

[azeventhubs] Calling `Run` again on an already errored `Processor` causes a panic #22785

PaulBernier commented Apr 26, 2024 •

edited

PaulBernier commented Apr 29, 2024 •

edited