Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customizable HTTP Server Actions (Sink) Retry Policies #12869

Open
jandillmann opened this issue Apr 12, 2024 · 2 comments
Open

Customizable HTTP Server Actions (Sink) Retry Policies #12869

jandillmann opened this issue Apr 12, 2024 · 2 comments
Labels

Comments

@jandillmann
Copy link

What would you like to be added or enhanced?

When using an HTTP Server Action (Sink), the response from the custom backend (connector, HTTP server) can be either successful (2xx response status code), fail (4xx or 5xx response status code), or time out.

For non successful responses, only when the HTTP server does not reply at all or returns a 429 Too Many Requests error the action is retried, otherwise the message is discarded and counts as a failure. Other than the timeout and health check interval, there is no option to customize this behaviour.

It would be helpful to further customize this behaviour, for example:

  • By specifying which response codes or ranges should be viewed as successful, to be retried, or as a failure (and be discarded)
  • And/or by specifying a separate health check endpoint and response codes, which determines whether the HTTP server is online and accepting requests, or not available and the messages should be retried later.

Why is this needed?

The HTTP endpoint I'm working with is behind a KrakenD API gateway. KrakenD returns a 500 Internal Server Error when the backend is temporarily not available, and there does not seem to be a way to customize this (Lua scripts only work when the backend does send a response).

@zmstone
Copy link
Member

zmstone commented Apr 14, 2024

Hi @jandillmann
I think 503 is a good indication of temporary error, we could even consider it a bug if retry is not done for this error code.
In general, one should never retry 500, however I agree that we should probably add an option like "Always Retry (N times)".
i.e. retry attempts counter should not exceed N.

In stream/event sourcing, a so called poison-pill message is very common to happen, which typically means such a message may cause server to raise an exception or even crash.
When there are multiple consumers (http servers in case of HTTP sink) behind a load-balancer (or gateway), without a limit, the retries of a poison-pill message may eventually bring down all the consumers.

@jandillmann
Copy link
Author

Yes, I agree, 500s should not be retried indefinitely. But a configurable small number of retries might be ok for my use case – it is usually only once per day or even less that a request does not go through. Retrying 429 and 503 sounds like a reasonable choice for me, too.

As for KrakenD returning a 500 instead of a more suitable 503 (or have it configurable), I will create an issue there.

kjellwinblad added a commit to kjellwinblad/emqx that referenced this issue Apr 25, 2024
Previously, if an HTTP request received a 503 (Service Unavailable)
status, it was marked as a failure without retrying. This has now been
fixed so that the request is retried a configurable number of times.

Fixes:
https://emqx.atlassian.net/browse/EMQX-12217
emqx#12869 (partly)
kjellwinblad added a commit to kjellwinblad/emqx that referenced this issue Apr 25, 2024
Previously, if an HTTP request received a 503 (Service Unavailable)
status, it was marked as a failure without retrying. This has now been
fixed so that the request is retried a configurable number of times.

Fixes:
https://emqx.atlassian.net/browse/EMQX-12217
emqx#12869 (partly)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants