Customizable HTTP Server Actions (Sink) Retry Policies #12869

jandillmann · 2024-04-12T11:06:24Z

What would you like to be added or enhanced?

When using an HTTP Server Action (Sink), the response from the custom backend (connector, HTTP server) can be either successful (2xx response status code), fail (4xx or 5xx response status code), or time out.

For non successful responses, only when the HTTP server does not reply at all or returns a 429 Too Many Requests error the action is retried, otherwise the message is discarded and counts as a failure. Other than the timeout and health check interval, there is no option to customize this behaviour.

It would be helpful to further customize this behaviour, for example:

By specifying which response codes or ranges should be viewed as successful, to be retried, or as a failure (and be discarded)
And/or by specifying a separate health check endpoint and response codes, which determines whether the HTTP server is online and accepting requests, or not available and the messages should be retried later.

Why is this needed?

The HTTP endpoint I'm working with is behind a KrakenD API gateway. KrakenD returns a 500 Internal Server Error when the backend is temporarily not available, and there does not seem to be a way to customize this (Lua scripts only work when the backend does send a response).

zmstone · 2024-04-14T06:51:46Z

Hi @jandillmann
I think 503 is a good indication of temporary error, we could even consider it a bug if retry is not done for this error code.
In general, one should never retry 500, however I agree that we should probably add an option like "Always Retry (N times)".
i.e. retry attempts counter should not exceed N.

In stream/event sourcing, a so called poison-pill message is very common to happen, which typically means such a message may cause server to raise an exception or even crash.
When there are multiple consumers (http servers in case of HTTP sink) behind a load-balancer (or gateway), without a limit, the retries of a poison-pill message may eventually bring down all the consumers.

jandillmann · 2024-04-14T07:59:26Z

Yes, I agree, 500s should not be retried indefinitely. But a configurable small number of retries might be ok for my use case – it is usually only once per day or even less that a request does not go through. Retrying 429 and 503 sounds like a reasonable choice for me, too.

As for KrakenD returning a 500 instead of a more suitable 503 (or have it configurable), I will create an issue there.

Previously, if an HTTP request received a 503 (Service Unavailable) status, it was marked as a failure without retrying. This has now been fixed so that the request is retried a configurable number of times. Fixes: https://emqx.atlassian.net/browse/EMQX-12217 emqx#12869 (partly)

jandillmann added the Feature label Apr 12, 2024

kjellwinblad mentioned this issue Apr 25, 2024

fix(HTTP connector): retry on 503 Service Unavailable response #12932

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Customizable HTTP Server Actions (Sink) Retry Policies #12869

Customizable HTTP Server Actions (Sink) Retry Policies #12869

jandillmann commented Apr 12, 2024

zmstone commented Apr 14, 2024

jandillmann commented Apr 14, 2024

Customizable HTTP Server Actions (Sink) Retry Policies #12869

Customizable HTTP Server Actions (Sink) Retry Policies #12869

Comments

jandillmann commented Apr 12, 2024

What would you like to be added or enhanced?

Why is this needed?

zmstone commented Apr 14, 2024

jandillmann commented Apr 14, 2024