Fail-safe scale-up when metrics plugin returns no data #622

erulabs · 2023-03-08T19:36:56Z

You may be aware that datadog is experiencing a large outage today. This means that nomad-autoscaler, when using datadog as a source, is unable to collect any metrics.

There is currently an on_error option to either ignore or fail (ignore moving on to other checks and fail stopping all actions). It seem to me a 3rd option here seems reasonable, perhaps called scale.

The idea would be that if a check fails to get any metrics, it could be set to on_error = "scale" which would consider the check active. In this example, if datadog goes offline or no metrics are reported, the nomad-autoscaler would trigger a scale-up and add additional instances (according to our delta in this example).

policy {
  check "api_needs_uppies" {
    source = "datadog"
    on_error = "scale"
    query = "ewma_3(avg:api.concurrent_request_per_container.poll)"
    query_window = "5m"
    group = "concurrent_reqs"
    strategy "threshold" {
      lower_bound = 3
      delta = 10
    }
  }
  check "api_needs_downies" {
    source = "datadog"
    on_error = "ignore"
    query = "ewma_3(avg:api.concurrent_request_per_container.poll)"
    query_window = "10m"
    group = "concurrent_reqs"
    strategy "threshold" {
      upper_bound = 1.5
      delta = -4
    }
  }
}

The end-result is that if our metrics become unavailable, we fail-safe and scale up towards our max.

Any interest in this feature? I will probably write this for our purposes over at @classdojo, but I suspect it would be a good mainline feature for nomad-autoscaler in general!

The text was updated successfully, but these errors were encountered:

lgfa29 · 2023-03-10T21:44:57Z

Thanks for the suggestion and PR @erulabs!

Having an APM outage does sound like a huge pain....But I'm not sure what the safe option would be in this case 🤔

Since this is (hopefully) an expected situation, I worry that it may not be immediately obvious what the consequence of using on_error = "scale" may be, causing an avalanche of scaling ups when your APM does go down.

It would also be good to have some way of distinguishing between a long-term failure vs. a blip. We may not want to fully scale on first error.

This all leads me to believer that we would need a more advanced way of configuring this, but I don't have any good ideas right now 😞

The first thing that pops in my head would be a special check block that is used when all previous checks fail. So something like:

policy {
  check "api_needs_uppies" {
    # ...
  }

  check "api_needs_downies" {
    # ...
  }

  error_handler {
    failures_before_triggering = 10
    strategy "fixed-value" {
      value = 5
    }
  }
}

That block doesn't take any source or query value and just returns a specific count (or any other strategy that doesn't rely on an APM), but it's kind of a weird one 😅

@jrasell do you have any thoughts on this one?

erulabs · 2023-03-13T17:09:13Z

@lgfa29 Interesting idea! I think failures_before_triggering would be nice, but IMO scaling up even on a blip of missing metrics is not the worst-case scenario - one assumes a sane cooldown and max would prevent a run-away scale-up.

Certainly your solution is more elegant, although, would you be able to say "error_handler -> strategy -> scale-up by delta/percentage" with that pattern? Re: "We may not want to fully scale on first error", my idea was that on_error="scale" would only scale by the selected strategy, as if the check returned in-range. IE: Using delta or percentage, a failed metric check wouldn't fully scale, it would just scale once-per-failure.

I think I still prefer the simpler fail-safe "scale-up-on-error" - although I must admit it might be somewhat enterprise-y!

lgfa29 · 2023-03-17T23:20:14Z

one assumes a sane cooldown and max would prevent a run-away scale-up.

For a handful of policies I would agree, but my concern is for deployments that have several jobs, created by different teams, where the sum of max values may be larger than supported. The APM is the common piece among them and if it goes down we can have cascading failures.

would you be able to say "error_handler -> strategy -> scale-up by delta/percentage" with that pattern?

Not right now, but I think it's a fairly simple strategy to implement 🤔

It would be similar to fixed-value but take a relative value, like a delta or percentage as you mentioned and multiply or add those to the current count.

So the policy would look something like this:

policy {
  check "api_needs_uppies" {
    # ...
  }

  check "api_needs_downies" {
    # ...
  }

  error_handler {
    failures_before_triggering = 10
    strategy "relative-value" {
      delta = 1 # Add 1 new instance if all checks fail.
    }
  }
}

Depending on your evaluation interval and how long the outage takes this this policy can give you a more controlled scale up and maybe help with the thundering herd problem.

    |
max |_ _ _ _ _ _ _ _  ________________
    |      __________|                |
    |_____|                           |__________
    |     
    +----------------------------------------------
          |___________________________|
                      APM down

Would this match the approach you were thinking about?

erulabs · 2023-05-02T20:54:30Z

@lgfa29 Yes, I believe the "error_handler" trigger would be exactly what we're looking for - in order to "fail open" as it were. What would it take to implement this - I'd love to help!

Because we're hosted on AWS, and because the outage that triggered this investigation was datadog, we're also looking into https://github.com/lob/nomad-autoscaler-cloudwatch-apm

erulabs mentioned this issue Mar 9, 2023

add ScalingPolicyOnErrorScale #623

Open

lgfa29 added stage/needs-discussion type/enhancement theme/policy-eval Policy broker, workers and evaluation labels Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail-safe scale-up when metrics plugin returns no data #622

Fail-safe scale-up when metrics plugin returns no data #622

erulabs commented Mar 8, 2023 •

edited

lgfa29 commented Mar 10, 2023

erulabs commented Mar 13, 2023 •

edited

lgfa29 commented Mar 17, 2023

erulabs commented May 2, 2023

Fail-safe scale-up when metrics plugin returns no data #622

Fail-safe scale-up when metrics plugin returns no data #622

Comments

erulabs commented Mar 8, 2023 • edited

lgfa29 commented Mar 10, 2023

erulabs commented Mar 13, 2023 • edited

lgfa29 commented Mar 17, 2023

erulabs commented May 2, 2023

erulabs commented Mar 8, 2023 •

edited

erulabs commented Mar 13, 2023 •

edited