Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay in Caddy using changes to reverse_proxy upstream #6195

Closed
paularlott opened this issue Mar 27, 2024 · 9 comments · Fixed by #6202 · May be fixed by quic-go/quic-go#4407 or #6213
Closed

Delay in Caddy using changes to reverse_proxy upstream #6195

paularlott opened this issue Mar 27, 2024 · 9 comments · Fixed by #6202 · May be fixed by quic-go/quic-go#4407 or #6213
Labels
upstream ⬆️ Relates to some dependency of this project

Comments

@paularlott
Copy link

I'm writing/written a plugin for Caddy which watches a Nomad / Consul cluster and builds ingress routes to our cluster based on the data, basically it generates a collection of reverse_proxy statements to the running containers.

However I've noticed that after reloading the Caddyfile existing clients continue to use the previous configurations for 2 - 3 minutes or until the browser cache is cleared.

Initially I thought this was something I'd done wrong in my code, so I've extracted the minimum Caddyfile and manually applied it against a stock version of Caddy (2.7.6 running in Docker) with the same results.

Initially I might have a reverse_proxy block such as

reverse_proxy {
      dynamic srv {
	name php-pool-1.service.consul
        refresh 10s
        dial_timeout 5s
      }
      import reverseProxyConfig
      transport http {
        keepalive off
      }
    }

After a deployment the php-pool-1 name changes e.g.

reverse_proxy {
      dynamic srv {
	name php-pool-2.service.consul
        refresh 10s
        dial_timeout 5s
      }
      import reverseProxyConfig
      transport http {
        keepalive off
      }
    }

If I start browsing the site I can see Caddy logging that it's querying DNS for php-pool-1.service.consul, which is correct.

If I now reload with a new Caddyfile to change the upstream to php-pool-2.service.consul then I can see Caddy logging that it's querying DNS for php-pool-1.service.consul and not the new php-pool-2.service.consul, it will keep using the old upstream until either I clear the browser cache or wait 2 - 3 minutes, after which it moves to php-pool-2.service.consul.

If I use a different browser to connect directly after the reload then that connection is using php-pool-2.service.consul, which is what I expect.

The minimal Caddyfile I have is:

{
  admin localhost:2019

  log {
    output stdout
    level DEBUG
    format console
  }

  grace_period 3s
  debug
}

(reverseProxyConfig) {
  header_up +X_FORWARDED_PORT 443
  lb_policy least_conn
  lb_try_duration 5s
  lb_try_interval 250ms
  fail_duration 2s
  unhealthy_status 5xx
}

(logsConfig) {
  log {
    output stdout
    level DEBUG
    format console
  }
}

# Health Check
:8080 {
  respond /health-check OK 200
}

*.fortix.systems {
  import logsConfig
  encode zstd gzip

  @wildcard_86 host ingresstest.fortix.systems
  handle @wildcard_86 {
    reverse_proxy {
      dynamic srv {
	name php-pool-46db85dd489abc89543ff3ff288c7975.service.consul
        refresh 10s
        dial_timeout 5s
      }
      import reverseProxyConfig
      transport http {
        keepalive off
      }
    }
  }

  handle {
    abort
  }
}

From the documentation, https://caddyserver.com/docs/caddyfile/options#grace-period if I include grace_period then it should force connection close after in my case 3s, but it doesn't seem to be happening so I'm guessing I'm missing something obvious here.

Thank you for any guidance

@mholt
Copy link
Member

mholt commented Mar 27, 2024

Hmmm, thanks for the report. I'll try to look into this soon.

@mholt
Copy link
Member

mholt commented Mar 28, 2024

What is in your logs before, during, and after a config change?

I would expect the server socket to shut down after the grace period and sever existing connections to clients.

As for the dynamic upstreams, they are global and persist through config reloads, but keyed by the SRV address, so if those change the old ones should no longer be used I think.

@paularlott
Copy link
Author

I've attached a dump of the logs log.gz, I started Caddy with

docker run --rm --network host --privileged --cap-add NET_ADMIN -v /root/config/Caddyfile:/etc/caddy/Caddyfile -v /root/config/data:/data caddy:2.7.6-alpine > log 2> error_log

Accessed a page on the site, then entered the container, modified the Caddyfile and did caddy reload --config /etc/caddy/Caddyfile

The initial service record was php-pool-dda10003a747409ee137207837a8eae7.service.consul and I changed it to xxx-php-pool-dda10003a747409ee137207837a8eae7.service.consul

Reload happened at 2024/03/28 03:58:58.357

Old record was used until 2024/03/28 04:00:04.674, at that point it failed as in this case the upstream doesn't exist, but it doesn't matter if the upstream exists or not. Sometimes it'll hold it longer, but usually 2 - 3 minutes.

Caddyfile used:

{
  admin localhost:2019

  log {
    output stdout
    level DEBUG
    format console
  }

  shutdown_delay 1s
  grace_period 3s
  debug
}

# Snippets

(reverseProxyConfig) {
  header_up +X_FORWARDED_PORT 443
  lb_policy least_conn
  lb_try_duration 5s
  lb_try_interval 250ms
  fail_duration 2s
  unhealthy_status 5xx

	health_uri /
	health_interval 10s
	health_timeout 5s
	health_status 5xx
}

(logsConfig) {
  log {
    output stdout
    level DEBUG
    format console
  }
}

# Health Check
:8080 {
  respond /health-check OK 200
}

ingresstest.fortix.systems {
  import logsConfig
  encode zstd gzip

    reverse_proxy {
      dynamic srv {
        name php-pool-dda10003a747409ee137207837a8eae7.service.consul
        refresh 10s
        dial_timeout 5s
      }
      import reverseProxyConfig
      transport http {
        versions 2
        keepalive off
	keepalive_interval 5s
	keepalive_idle_conns 2
        read_buffer 16KiB
        write_buffer 16KiB

      }
  }

}

Thanks

@WeidiDeng
Copy link
Member

From the log I found out all the requests are HTTP/3.0. It's due to old h3 connections being active after configuration reload. I'll see what I can do with quic-go, which caddy depends for http3.

@WeidiDeng WeidiDeng added the upstream ⬆️ Relates to some dependency of this project label Mar 29, 2024
@WeidiDeng
Copy link
Member

@paularlott Can you try xcaddy build close-h3-connections to see if it's fixed?

@paularlott
Copy link
Author

@WeidiDeng I've only had a chance for a quick test so far, but it's looking like that resolves the issues, thank you.

@mholt
Copy link
Member

mholt commented Mar 29, 2024

@WeidiDeng You are a WIZARD 🎉 🧙‍♂️ can't wait to review this 😃

@WeidiDeng
Copy link
Member

@paularlott Do you mind giving this patch a go? This properly closes http3 servers without interrupting ongoing requests unless a grace period is reached.

@paularlott
Copy link
Author

@WeidiDeng I've just built it and deployed it as the ingress controller for some test sites and it seems to be following the changes to the upstream SRV records as it should. I'll leave it running so we can give it a good test.

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream ⬆️ Relates to some dependency of this project
Projects
None yet
3 participants