lb_try_interval/lb_try_duration do not pick up new backends on config reload #4442

simonw · 2021-11-25T01:06:39Z

Given the following Caddyfile:

{
    auto_https off
}
:80 {
    reverse_proxy localhost:8003 {
        lb_try_duration 30s
        lb_try_interval 1s
    }
}

I use caddy run and run a separate server on port 8003 (I'm using datasette -p 8003 here) and it proxies correctly. If I shut down my 8003 server and try to hit http://localhost/ I get the desired behaviour - my browser spins for up to 30s, and if I restart my 8003 server during that time the request is proxied through and returned from the backend.

What I'd really like to be able to do though is to start up a new server on another port (actually in production on another IP/port combination) and have traffic resume against the new server.

So I tried editing the Caddyfile to use localhost:8004 instead, started up my backend on port 8004 then used caddy reload to load in the new configuration... and my request to port 80 continued to spin. It appears Caddy didn't notice that there was now a new configuration for the backend for this reverse_proxy.

It would be really cool if that lb_try_interval/lb_try_duration feature could respond to updated configurations and seamlessly forward paused traffic to the new backend.

(This started as a Twitter conversation: https://twitter.com/mholt6/status/1463656086360051714)

The text was updated successfully, but these errors were encountered:

mholt · 2021-11-25T01:19:26Z

Thanks for the interesting feature request. First I've heard of one like this! As mentioned on Twitter, I'm currently refactoring the reverse proxy so maybe you're in luck. But I'll have to design a solution first and see if it'll fit.

In the meantime, do you know the new backend addresses ahead of time? You could possibly configure all the future backends that could possibly be used, along with a few lines of health check configuration to ensure those get skipped from the rotation until they're available.

The way it works now: Config reloads are graceful, but they're also isolated, meaning that they won't affect anything from a previous config, including active connections. So Caddy does "notice" that there's a new backend configuration, but it won't upset existing connections.

simonw · 2021-11-25T01:27:41Z

The way I'm currently planning on building this I don't think I'll know the backend addresses ahead of time, sadly.

My current plan is to build with Kubernetes. I want to be able to do the following:

Launch a pod, and have pod1.example.com route traffic to that pod
Shut down that pod for replacement, such that any HTTP requests to pod1.example.com "pause" awaiting an available backend
Launch the replacement pod, which could have a totally different internal IP and port within the Kubernetes cluster
Reconfigure Caddy such that pod1.example.com points to the new pod
... and ideally such that any paused HTTP requests are load-balanced through to the new backend, seamlessly, with the users only experiencing a few extra seconds of load time

simonw · 2021-11-25T01:39:29Z

Oh I've just had the most horrible idea... could something like this work?

{
    auto_https off
}
pod1.example.com:80 {
    reverse_proxy pod1-actual.example.com:80 {
        lb_try_duration 30s
        lb_try_interval 1s
    }
}

pod1-actual.example.com:80 {
    reverse_proxy localhost:8003
}

The idea being that I can change the definition of pod1-actual.example.com and refresh the configuration and get the behaviour I'm looking for!

francislavoie · 2021-11-25T01:42:59Z

If you feel like watching Matt talk about it for 50 minutes 😅 he streamed the beginning of the work on refactoring the upstreams logic https://youtu.be/hj7yzXb11jU

tl;dw, right now the upstreams are a static list, but the plan is to make it possible to have a dynamic list of upstreams, and the source could be whatever (you could write a custom module to provide the list on the fly, via SRV, or maybe fetch from HTTP and cache it for a few seconds, I dunno, whatever you like).

Point of note @mholt for this to work though, it would need to fetch a list of upstreams on every retry iteration and not just once before the loop.

Oh I've just had the most horrible idea... could something like this work?

The idea being that I can change the definition of pod1-actual.example.com and refresh the configuration and get the behaviour I'm looking for!

Hmm, I don't think so, because Caddy still needs to load a new config, and the new one won't take effect while there's still pending requests. Any config changes change the entire server's config, it's not targetted.

simonw · 2021-11-25T01:43:35Z

I tried that locally using this:

{
    auto_https off
}
:80 {
    reverse_proxy :8991 {
        lb_try_duration 30s
        lb_try_interval 1s
    }
}
:8991 {
    reverse_proxy localhost:8003
}

But sadly it doesn't seem to work - when I shut down the 8003 server I start getting errors straight away, with a log line that looks like this:

2021/11/25 01:42:17.987 ERROR http.log.error dial tcp [::1]:8003: connect: connection refused {"request": {"remote_addr": "127.0.0.1:60384", "proto": "HTTP/1.1", "method": "GET", "host": "localhost", "uri": "/", "headers"

simonw · 2021-11-25T01:44:25Z

(I bet I could get this to work by running two instances of Caddy though...)

francislavoie · 2021-11-25T01:48:42Z

FWIW, most of the time, lb_try_duration works best when there's more than one upstream configured (at least two). You could use lb_policy first which would make all requests go to the first upstream, then if the first goes down, it would route requests to a second backup upstream while you fix the first and bring it back online, then you're free to update your backup, etc. Doing it in steps makes it smoother, with little to no perceptible lag for the users.

(I bet I could get this to work by running two instances of Caddy though...)

Yeah that's true, probably.

mholt · 2021-11-25T02:40:20Z

I wonder if an API endpoint that just adds/removes backends without a config reload could be helpful.

The other piece of this would be we'd have to get the list of upstreams in each iteration of the for loop instead of just once before the for loop. Or maybe we'd only have to do that if the list of upstreams changed since that iteration started. Hmmm.

simonw · 2021-11-25T03:55:45Z

I wonder if an API endpoint that just adds/removes backends without a config reload could be helpful.

I would definitely find that useful!

princemaple · 2021-11-25T07:35:48Z

Hmm, silly question: why not route through services instead of directly to pods?

simonw · 2021-11-25T16:22:22Z

Hmm, silly question: why not route through services instead of directly to pods?

I'm still getting my head around Kubernetes terminology, but the root challenge I have here is that because my applications use SQLite for persistent data the pods/containers/whatever are stateful: I can't do the normal Kubernetes thing of bringing up a new version, serving traffic from both versions for a few seconds and scaling down the old version.

Even if I wasn't using such an unconventional persistence mechanism is still be interested in holding traffic like this - for things like running expensive database schema changes where having ten seconds without any traffic can help me finish a major upgrade without any visible downtime.

princemaple · 2021-11-25T20:54:08Z

You can still do the same if you use a service on top of your pods. The only difference is that K8s would be the one resolving the IP & port for you. Caddy can simply proxy to the service address and not knowing which pod.

When you bring down your only pod, the service is effectively down, and caddy can do its normal retry until a new pod is online, i.e. the service is back up.

simonw · 2021-11-25T21:02:36Z

Oh I see what you mean! Yes that's a fantastic idea, I shall try that.

princemaple · 2021-11-25T21:23:45Z

The issue is still valid though. You'd want to proxy directly to pods in many scenarios, so you can take advantage of lb_policy and others.

Also get upstreams at every retry loop iteration instead of just once before the loop. See #4442.

mholt · 2021-12-13T20:50:46Z

I've implemented the getting of upstreams "per retry" in #4470. The actual API endpoint to adjust the upstreams specifically will have to come in a future PR.

Also get upstreams at every retry loop iteration instead of just once before the loop. See #4442.

* reverseproxy: Begin refactor to enable dynamic upstreams Streamed here: https://www.youtube.com/watch?v=hj7yzXb11jU * Implement SRV and A/AAA upstream sources Also get upstreams at every retry loop iteration instead of just once before the loop. See #4442. * Minor tweaks from review * Limit size of upstreams caches * Add doc notes deprecating LookupSRV * Provision dynamic upstreams Still WIP, preparing to preserve health checker functionality * Rejigger health checks Move active health check results into handler-specific Upstreams. Improve documentation regarding health checks and upstreams. * Deprecation notice * Add Caddyfile support, use `caddy.Duration` * Interface guards * Implement custom resolvers, add resolvers to http transport Caddyfile * SRV: fix Caddyfile `name` inline arg, remove proto condition * Use pointer receiver * Add debug logs Co-authored-by: Francis Lavoie <lavofr@gmail.com>

gc-ss · 2022-03-28T04:35:55Z

I've implemented the getting of upstreams "per retry" in #4470. The actual API endpoint to adjust the upstreams specifically will have to come in a future PR.

I would like to understand the impact of #4470 wrt @simonw OP.

So, yes, Kubernetes service would be a good, albeit extremely heavyweight way to solve the problem he has. That said, I would argue that in the Kubernetes world, cert-manager, Istio/Envoy etc would provide all the benefits (and more) of Caddy.

In a simpler world, where I do not run Kubernetes, and thus Caddy is extremely valuable, I am picturing how 4470 helps here:

@simonw updates another SRV record to announce localhost:8004
lb_policy first which had noticed localhost:8003 has since gone down, now notices localhost:8004 is up and redirects traffic there
The reason why this works is because @simonw had preemptively updated SRV records to announce localhost:8001-8010 on Caddy startup, assuming he knows his (new) services would open those ports in the present and future (my suggestion to him, to try, if that can be the case). The LB automagically will ignore ports that arn't reachable yet but once they do become reachable, will redirect traffic there

Let me know if I'm missing any detail

mholt · 2022-03-28T05:02:54Z

@gc-ss Yeah, so there's a couple ways to handle those kinds of situations:

If you know the addresses when you load the config, you can just put them all in the config and the load balancer will skip the unavailable backends (once it notices they are unavailable; this often happens pretty quickly but in some traffic loads this is not ideal).
You can have the dynamic upstreams feature do a SRV lookup at every single iteration (i.e. no caching) and then simply add the host to the SRV when it is up, and remove it when it is down. You don't need to pre-load them into the SRV records.

(Although, upon inspection, I didn't actually implement a way to disable the caching of the SRV results, but you can set a Refresh value of 1 nanosecond which will probably do the trick in the meantime.)

francislavoie · 2022-07-14T05:11:35Z

@simonw does the Dynamic Upstreams feature solve your usecase? https://caddyserver.com/docs/caddyfile/directives/reverse_proxy#dynamic-upstreams

Also worth noting, we just merged #4756 which adds lb_retries, i.e. an amount of retries to perform. Probably not necessarily useful for you here, but I wanted to mention it because this issue is related to retries.

I think we can probably close this issue now.

mholt added the feature ⚙️ New feature or request label Nov 25, 2021

mholt added this to the v2.5.0 milestone Dec 2, 2021

mholt added a commit that referenced this issue Dec 8, 2021

Implement SRV and A/AAA upstream sources

5e872e7

Also get upstreams at every retry loop iteration instead of just once before the loop. See #4442.

francislavoie pushed a commit that referenced this issue Feb 21, 2022

Implement SRV and A/AAA upstream sources

a30b14e

Also get upstreams at every retry loop iteration instead of just once before the loop. See #4442.

mholt modified the milestones: v2.5.0, 2.x Mar 8, 2022

mholt closed this as completed Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lb_try_interval/lb_try_duration do not pick up new backends on config reload #4442

lb_try_interval/lb_try_duration do not pick up new backends on config reload #4442

simonw commented Nov 25, 2021

mholt commented Nov 25, 2021

simonw commented Nov 25, 2021 •

edited

simonw commented Nov 25, 2021

francislavoie commented Nov 25, 2021

simonw commented Nov 25, 2021

simonw commented Nov 25, 2021

francislavoie commented Nov 25, 2021 •

edited

mholt commented Nov 25, 2021

simonw commented Nov 25, 2021

princemaple commented Nov 25, 2021

simonw commented Nov 25, 2021

princemaple commented Nov 25, 2021

simonw commented Nov 25, 2021

princemaple commented Nov 25, 2021

mholt commented Dec 13, 2021

gc-ss commented Mar 28, 2022 •

edited

mholt commented Mar 28, 2022

francislavoie commented Jul 14, 2022

lb_try_interval/lb_try_duration do not pick up new backends on config reload #4442

lb_try_interval/lb_try_duration do not pick up new backends on config reload #4442

Comments

simonw commented Nov 25, 2021

mholt commented Nov 25, 2021

simonw commented Nov 25, 2021 • edited

simonw commented Nov 25, 2021

francislavoie commented Nov 25, 2021

simonw commented Nov 25, 2021

simonw commented Nov 25, 2021

francislavoie commented Nov 25, 2021 • edited

mholt commented Nov 25, 2021

simonw commented Nov 25, 2021

princemaple commented Nov 25, 2021

simonw commented Nov 25, 2021

princemaple commented Nov 25, 2021

simonw commented Nov 25, 2021

princemaple commented Nov 25, 2021

mholt commented Dec 13, 2021

gc-ss commented Mar 28, 2022 • edited

mholt commented Mar 28, 2022

francislavoie commented Jul 14, 2022

simonw commented Nov 25, 2021 •

edited

francislavoie commented Nov 25, 2021 •

edited

gc-ss commented Mar 28, 2022 •

edited