Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lb_policy 'first' never skipping to second upstream #4432

Closed
FunDeckHermit opened this issue Nov 22, 2021 · 4 comments
Closed

lb_policy 'first' never skipping to second upstream #4432

FunDeckHermit opened this issue Nov 22, 2021 · 4 comments
Labels
question ❔ Help is being requested

Comments

@FunDeckHermit
Copy link

FunDeckHermit commented Nov 22, 2021

I have the following config:

*.example.com {
    tls {
        on_demand
    }
    reverse_proxy {
        to 10.14.0.6:800 10.14.0.5:8080
        lb_policy first
        lb_try_duration 50ms
        lb_try_interval 50ms
    }
}

The first upstream is sometimes gone for periods of time and I would like it to use the second as a fallback. Whatever settings I use or whatever duration/interval I choose (including 0ms) it never forwards to 10.14.0.5:8080.

The browser just displays "HTTP Error 502"

@francislavoie
Copy link
Member

See #4245 for an explanation of how it works.

You need to set a long enough lb_try_duration for Caddy to have a chance to try more than once, plus you'll need to set dial_timeout to be short enough to be less than the try duration to give up trying the first upstream before stopping.

You should also probably enable passive and/or active health checks which can let you avoid needing to rely on the dial_timeout on every request (so your first backend is skipped immediately if it's still likely to be down according to health checks).

@francislavoie francislavoie added the question ❔ Help is being requested label Nov 22, 2021
@FunDeckHermit
Copy link
Author

FunDeckHermit commented Nov 22, 2021

Ah, that explains a lot.

Using #4245 as a base still doesn't select the next upstream. I have tried every combination of dial_timeout, lb_try_duration and lb_try_interval imaginable. Does the on_demand TLS interfere with the timeout?

# Still not working example
*.example.com {
    tls {
        on_demand
    }
    reverse_proxy {
        transport http {
            dial_timeout 2s
        }
        to 10.14.0.6:800 10.14.0.5:8080
        lb_policy first
        lb_try_duration 10s
        lb_try_interval 4s
    }
}

@francislavoie
Copy link
Member

francislavoie commented Nov 23, 2021

Does the on_demand TLS interfere with the timeout?

No, that's orthogonal.

Please enable the debug global option and watch the logs when making a request, to see which backends it's trying.

I think you probably have to enable passive health checks via fail_duration for it to work, though, cause the lb_policy first might never discount the first upstream as unhealthy otherwise. I'll need to double-check.

@FunDeckHermit
Copy link
Author

Adding fail_duration and fiddling with the settings I got it working with these settings:

#Working example
*.example.com {
    tls {
        on_demand
    }
    reverse_proxy {
        transport http {
            dial_timeout 600ms
        }
        to 10.14.0.6 10.14.0.5:8080
        lb_policy first
        lb_try_duration 3s
        lb_try_interval 1s
        fail_duration 20s
    }
}

Not sure about the optimum values but this feels quite snappy to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question ❔ Help is being requested
Projects
None yet
Development

No branches or pull requests

2 participants