lb_policy 'first' never skipping to second upstream #4432

FunDeckHermit · 2021-11-22T12:52:25Z

I have the following config:

*.example.com {
    tls {
        on_demand
    }
    reverse_proxy {
        to 10.14.0.6:800 10.14.0.5:8080
        lb_policy first
        lb_try_duration 50ms
        lb_try_interval 50ms
    }
}

The first upstream is sometimes gone for periods of time and I would like it to use the second as a fallback. Whatever settings I use or whatever duration/interval I choose (including 0ms) it never forwards to 10.14.0.5:8080.

The browser just displays "HTTP Error 502"

The text was updated successfully, but these errors were encountered:

francislavoie · 2021-11-22T15:14:32Z

See #4245 for an explanation of how it works.

You need to set a long enough lb_try_duration for Caddy to have a chance to try more than once, plus you'll need to set dial_timeout to be short enough to be less than the try duration to give up trying the first upstream before stopping.

You should also probably enable passive and/or active health checks which can let you avoid needing to rely on the dial_timeout on every request (so your first backend is skipped immediately if it's still likely to be down according to health checks).

FunDeckHermit · 2021-11-22T15:53:07Z

Ah, that explains a lot.

Using #4245 as a base still doesn't select the next upstream. I have tried every combination of dial_timeout, lb_try_duration and lb_try_interval imaginable. Does the on_demand TLS interfere with the timeout?

# Still not working example
*.example.com {
    tls {
        on_demand
    }
    reverse_proxy {
        transport http {
            dial_timeout 2s
        }
        to 10.14.0.6:800 10.14.0.5:8080
        lb_policy first
        lb_try_duration 10s
        lb_try_interval 4s
    }
}

francislavoie · 2021-11-23T08:51:23Z

Does the on_demand TLS interfere with the timeout?

No, that's orthogonal.

Please enable the debug global option and watch the logs when making a request, to see which backends it's trying.

I think you probably have to enable passive health checks via fail_duration for it to work, though, cause the lb_policy first might never discount the first upstream as unhealthy otherwise. I'll need to double-check.

FunDeckHermit · 2021-11-23T14:07:37Z

Adding fail_duration and fiddling with the settings I got it working with these settings:

#Working example
*.example.com {
    tls {
        on_demand
    }
    reverse_proxy {
        transport http {
            dial_timeout 600ms
        }
        to 10.14.0.6 10.14.0.5:8080
        lb_policy first
        lb_try_duration 3s
        lb_try_interval 1s
        fail_duration 20s
    }
}

Not sure about the optimum values but this feels quite snappy to me.

francislavoie added the question ❔ Help is being requested label Nov 22, 2021

FunDeckHermit closed this as completed Nov 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lb_policy 'first' never skipping to second upstream #4432

lb_policy 'first' never skipping to second upstream #4432

FunDeckHermit commented Nov 22, 2021 •

edited

francislavoie commented Nov 22, 2021

FunDeckHermit commented Nov 22, 2021 •

edited

francislavoie commented Nov 23, 2021 •

edited

FunDeckHermit commented Nov 23, 2021

lb_policy 'first' never skipping to second upstream #4432

lb_policy 'first' never skipping to second upstream #4432

Comments

FunDeckHermit commented Nov 22, 2021 • edited

francislavoie commented Nov 22, 2021

FunDeckHermit commented Nov 22, 2021 • edited

francislavoie commented Nov 23, 2021 • edited

FunDeckHermit commented Nov 23, 2021

FunDeckHermit commented Nov 22, 2021 •

edited

FunDeckHermit commented Nov 22, 2021 •

edited

francislavoie commented Nov 23, 2021 •

edited