Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lb_try_interval/lb_try_duration do not pick up new backends on config reload #4442

Closed
simonw opened this issue Nov 25, 2021 · 18 comments
Closed
Labels
feature ⚙️ New feature or request
Milestone

Comments

@simonw
Copy link

simonw commented Nov 25, 2021

Given the following Caddyfile:

{
    auto_https off
}
:80 {
    reverse_proxy localhost:8003 {
        lb_try_duration 30s
        lb_try_interval 1s
    }
}

I use caddy run and run a separate server on port 8003 (I'm using datasette -p 8003 here) and it proxies correctly. If I shut down my 8003 server and try to hit http://localhost/ I get the desired behaviour - my browser spins for up to 30s, and if I restart my 8003 server during that time the request is proxied through and returned from the backend.

What I'd really like to be able to do though is to start up a new server on another port (actually in production on another IP/port combination) and have traffic resume against the new server.

So I tried editing the Caddyfile to use localhost:8004 instead, started up my backend on port 8004 then used caddy reload to load in the new configuration... and my request to port 80 continued to spin. It appears Caddy didn't notice that there was now a new configuration for the backend for this reverse_proxy.

It would be really cool if that lb_try_interval/lb_try_duration feature could respond to updated configurations and seamlessly forward paused traffic to the new backend.

(This started as a Twitter conversation: https://twitter.com/mholt6/status/1463656086360051714)

@mholt mholt added the feature ⚙️ New feature or request label Nov 25, 2021
@mholt
Copy link
Member

mholt commented Nov 25, 2021

Thanks for the interesting feature request. First I've heard of one like this! As mentioned on Twitter, I'm currently refactoring the reverse proxy so maybe you're in luck. But I'll have to design a solution first and see if it'll fit.

In the meantime, do you know the new backend addresses ahead of time? You could possibly configure all the future backends that could possibly be used, along with a few lines of health check configuration to ensure those get skipped from the rotation until they're available.

The way it works now: Config reloads are graceful, but they're also isolated, meaning that they won't affect anything from a previous config, including active connections. So Caddy does "notice" that there's a new backend configuration, but it won't upset existing connections.

@simonw
Copy link
Author

simonw commented Nov 25, 2021

The way I'm currently planning on building this I don't think I'll know the backend addresses ahead of time, sadly.

My current plan is to build with Kubernetes. I want to be able to do the following:

  • Launch a pod, and have pod1.example.com route traffic to that pod
  • Shut down that pod for replacement, such that any HTTP requests to pod1.example.com "pause" awaiting an available backend
  • Launch the replacement pod, which could have a totally different internal IP and port within the Kubernetes cluster
  • Reconfigure Caddy such that pod1.example.com points to the new pod
  • ... and ideally such that any paused HTTP requests are load-balanced through to the new backend, seamlessly, with the users only experiencing a few extra seconds of load time

@simonw
Copy link
Author

simonw commented Nov 25, 2021

Oh I've just had the most horrible idea... could something like this work?

{
    auto_https off
}
pod1.example.com:80 {
    reverse_proxy pod1-actual.example.com:80 {
        lb_try_duration 30s
        lb_try_interval 1s
    }
}

pod1-actual.example.com:80 {
    reverse_proxy localhost:8003
}

The idea being that I can change the definition of pod1-actual.example.com and refresh the configuration and get the behaviour I'm looking for!

@francislavoie
Copy link
Member

If you feel like watching Matt talk about it for 50 minutes 😅 he streamed the beginning of the work on refactoring the upstreams logic https://youtu.be/hj7yzXb11jU

tl;dw, right now the upstreams are a static list, but the plan is to make it possible to have a dynamic list of upstreams, and the source could be whatever (you could write a custom module to provide the list on the fly, via SRV, or maybe fetch from HTTP and cache it for a few seconds, I dunno, whatever you like).

Point of note @mholt for this to work though, it would need to fetch a list of upstreams on every retry iteration and not just once before the loop.

Oh I've just had the most horrible idea... could something like this work?

The idea being that I can change the definition of pod1-actual.example.com and refresh the configuration and get the behaviour I'm looking for!

Hmm, I don't think so, because Caddy still needs to load a new config, and the new one won't take effect while there's still pending requests. Any config changes change the entire server's config, it's not targetted.

@simonw
Copy link
Author

simonw commented Nov 25, 2021

I tried that locally using this:

{
    auto_https off
}
:80 {
    reverse_proxy :8991 {
        lb_try_duration 30s
        lb_try_interval 1s
    }
}
:8991 {
    reverse_proxy localhost:8003
}

But sadly it doesn't seem to work - when I shut down the 8003 server I start getting errors straight away, with a log line that looks like this:

2021/11/25 01:42:17.987 ERROR http.log.error dial tcp [::1]:8003: connect: connection refused {"request": {"remote_addr": "127.0.0.1:60384", "proto": "HTTP/1.1", "method": "GET", "host": "localhost", "uri": "/", "headers"

@simonw
Copy link
Author

simonw commented Nov 25, 2021

(I bet I could get this to work by running two instances of Caddy though...)

@francislavoie
Copy link
Member

francislavoie commented Nov 25, 2021

FWIW, most of the time, lb_try_duration works best when there's more than one upstream configured (at least two). You could use lb_policy first which would make all requests go to the first upstream, then if the first goes down, it would route requests to a second backup upstream while you fix the first and bring it back online, then you're free to update your backup, etc. Doing it in steps makes it smoother, with little to no perceptible lag for the users.

(I bet I could get this to work by running two instances of Caddy though...)

Yeah that's true, probably.

@mholt
Copy link
Member

mholt commented Nov 25, 2021

I wonder if an API endpoint that just adds/removes backends without a config reload could be helpful.

The other piece of this would be we'd have to get the list of upstreams in each iteration of the for loop instead of just once before the for loop. Or maybe we'd only have to do that if the list of upstreams changed since that iteration started. Hmmm.

@simonw
Copy link
Author

simonw commented Nov 25, 2021

I wonder if an API endpoint that just adds/removes backends without a config reload could be helpful.

I would definitely find that useful!

@princemaple
Copy link

Hmm, silly question: why not route through services instead of directly to pods?

@simonw
Copy link
Author

simonw commented Nov 25, 2021

Hmm, silly question: why not route through services instead of directly to pods?

I'm still getting my head around Kubernetes terminology, but the root challenge I have here is that because my applications use SQLite for persistent data the pods/containers/whatever are stateful: I can't do the normal Kubernetes thing of bringing up a new version, serving traffic from both versions for a few seconds and scaling down the old version.

Even if I wasn't using such an unconventional persistence mechanism is still be interested in holding traffic like this - for things like running expensive database schema changes where having ten seconds without any traffic can help me finish a major upgrade without any visible downtime.

@princemaple
Copy link

You can still do the same if you use a service on top of your pods. The only difference is that K8s would be the one resolving the IP & port for you. Caddy can simply proxy to the service address and not knowing which pod.

When you bring down your only pod, the service is effectively down, and caddy can do its normal retry until a new pod is online, i.e. the service is back up.

@simonw
Copy link
Author

simonw commented Nov 25, 2021

Oh I see what you mean! Yes that's a fantastic idea, I shall try that.

@princemaple
Copy link

The issue is still valid though. You'd want to proxy directly to pods in many scenarios, so you can take advantage of lb_policy and others.

@mholt mholt added this to the v2.5.0 milestone Dec 2, 2021
mholt added a commit that referenced this issue Dec 8, 2021
Also get upstreams at every retry loop iteration instead of just once
before the loop. See #4442.
@mholt
Copy link
Member

mholt commented Dec 13, 2021

I've implemented the getting of upstreams "per retry" in #4470. The actual API endpoint to adjust the upstreams specifically will have to come in a future PR.

francislavoie pushed a commit that referenced this issue Feb 21, 2022
Also get upstreams at every retry loop iteration instead of just once
before the loop. See #4442.
mholt added a commit that referenced this issue Mar 7, 2022
* reverseproxy: Begin refactor to enable dynamic upstreams

Streamed here: https://www.youtube.com/watch?v=hj7yzXb11jU

* Implement SRV and A/AAA upstream sources

Also get upstreams at every retry loop iteration instead of just once
before the loop. See #4442.

* Minor tweaks from review

* Limit size of upstreams caches

* Add doc notes deprecating LookupSRV

* Provision dynamic upstreams

Still WIP, preparing to preserve health checker functionality

* Rejigger health checks

Move active health check results into handler-specific Upstreams.

Improve documentation regarding health checks and upstreams.

* Deprecation notice

* Add Caddyfile support, use `caddy.Duration`

* Interface guards

* Implement custom resolvers, add resolvers to http transport Caddyfile

* SRV: fix Caddyfile `name` inline arg, remove proto condition

* Use pointer receiver

* Add debug logs

Co-authored-by: Francis Lavoie <lavofr@gmail.com>
@mholt mholt modified the milestones: v2.5.0, 2.x Mar 8, 2022
@gc-ss
Copy link

gc-ss commented Mar 28, 2022

I've implemented the getting of upstreams "per retry" in #4470. The actual API endpoint to adjust the upstreams specifically will have to come in a future PR.

I would like to understand the impact of #4470 wrt @simonw OP.

So, yes, Kubernetes service would be a good, albeit extremely heavyweight way to solve the problem he has. That said, I would argue that in the Kubernetes world, cert-manager, Istio/Envoy etc would provide all the benefits (and more) of Caddy.

In a simpler world, where I do not run Kubernetes, and thus Caddy is extremely valuable, I am picturing how 4470 helps here:

  1. @simonw updates another SRV record to announce localhost:8004
  2. lb_policy first which had noticed localhost:8003 has since gone down, now notices localhost:8004 is up and redirects traffic there
  3. The reason why this works is because @simonw had preemptively updated SRV records to announce localhost:8001-8010 on Caddy startup, assuming he knows his (new) services would open those ports in the present and future (my suggestion to him, to try, if that can be the case). The LB automagically will ignore ports that arn't reachable yet but once they do become reachable, will redirect traffic there

Let me know if I'm missing any detail

@mholt
Copy link
Member

mholt commented Mar 28, 2022

@gc-ss Yeah, so there's a couple ways to handle those kinds of situations:

  • If you know the addresses when you load the config, you can just put them all in the config and the load balancer will skip the unavailable backends (once it notices they are unavailable; this often happens pretty quickly but in some traffic loads this is not ideal).
  • You can have the dynamic upstreams feature do a SRV lookup at every single iteration (i.e. no caching) and then simply add the host to the SRV when it is up, and remove it when it is down. You don't need to pre-load them into the SRV records.

(Although, upon inspection, I didn't actually implement a way to disable the caching of the SRV results, but you can set a Refresh value of 1 nanosecond which will probably do the trick in the meantime.)

@francislavoie
Copy link
Member

@simonw does the Dynamic Upstreams feature solve your usecase? https://caddyserver.com/docs/caddyfile/directives/reverse_proxy#dynamic-upstreams

Also worth noting, we just merged #4756 which adds lb_retries, i.e. an amount of retries to perform. Probably not necessarily useful for you here, but I wanted to mention it because this issue is related to retries.

I think we can probably close this issue now.

@mholt mholt closed this as completed Jul 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature ⚙️ New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants