Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail-safe scale-up when metrics plugin returns no data #622

Open
erulabs opened this issue Mar 8, 2023 · 4 comments
Open

Fail-safe scale-up when metrics plugin returns no data #622

erulabs opened this issue Mar 8, 2023 · 4 comments
Labels

Comments

@erulabs
Copy link

erulabs commented Mar 8, 2023

You may be aware that datadog is experiencing a large outage today. This means that nomad-autoscaler, when using datadog as a source, is unable to collect any metrics.

There is currently an on_error option to either ignore or fail (ignore moving on to other checks and fail stopping all actions). It seem to me a 3rd option here seems reasonable, perhaps called scale.

The idea would be that if a check fails to get any metrics, it could be set to on_error = "scale" which would consider the check active. In this example, if datadog goes offline or no metrics are reported, the nomad-autoscaler would trigger a scale-up and add additional instances (according to our delta in this example).

policy {
  check "api_needs_uppies" {
    source = "datadog"
    on_error = "scale"
    query = "ewma_3(avg:api.concurrent_request_per_container.poll)"
    query_window = "5m"
    group = "concurrent_reqs"
    strategy "threshold" {
      lower_bound = 3
      delta = 10
    }
  }
  check "api_needs_downies" {
    source = "datadog"
    on_error = "ignore"
    query = "ewma_3(avg:api.concurrent_request_per_container.poll)"
    query_window = "10m"
    group = "concurrent_reqs"
    strategy "threshold" {
      upper_bound = 1.5
      delta = -4
    }
  }
}

The end-result is that if our metrics become unavailable, we fail-safe and scale up towards our max.

Any interest in this feature? I will probably write this for our purposes over at @classdojo, but I suspect it would be a good mainline feature for nomad-autoscaler in general!

@lgfa29
Copy link
Contributor

lgfa29 commented Mar 10, 2023

Thanks for the suggestion and PR @erulabs!

Having an APM outage does sound like a huge pain....But I'm not sure what the safe option would be in this case 🤔

Since this is (hopefully) an expected situation, I worry that it may not be immediately obvious what the consequence of using on_error = "scale" may be, causing an avalanche of scaling ups when your APM does go down.

It would also be good to have some way of distinguishing between a long-term failure vs. a blip. We may not want to fully scale on first error.

This all leads me to believer that we would need a more advanced way of configuring this, but I don't have any good ideas right now 😞

The first thing that pops in my head would be a special check block that is used when all previous checks fail. So something like:

policy {
  check "api_needs_uppies" {
    # ...
  }

  check "api_needs_downies" {
    # ...
  }

  error_handler {
    failures_before_triggering = 10
    strategy "fixed-value" {
      value = 5
    }
  }
}

That block doesn't take any source or query value and just returns a specific count (or any other strategy that doesn't rely on an APM), but it's kind of a weird one 😅

@jrasell do you have any thoughts on this one?

@erulabs
Copy link
Author

erulabs commented Mar 13, 2023

@lgfa29 Interesting idea! I think failures_before_triggering would be nice, but IMO scaling up even on a blip of missing metrics is not the worst-case scenario - one assumes a sane cooldown and max would prevent a run-away scale-up.

Certainly your solution is more elegant, although, would you be able to say "error_handler -> strategy -> scale-up by delta/percentage" with that pattern? Re: "We may not want to fully scale on first error", my idea was that on_error="scale" would only scale by the selected strategy, as if the check returned in-range. IE: Using delta or percentage, a failed metric check wouldn't fully scale, it would just scale once-per-failure.

I think I still prefer the simpler fail-safe "scale-up-on-error" - although I must admit it might be somewhat enterprise-y!

@lgfa29
Copy link
Contributor

lgfa29 commented Mar 17, 2023

one assumes a sane cooldown and max would prevent a run-away scale-up.

For a handful of policies I would agree, but my concern is for deployments that have several jobs, created by different teams, where the sum of max values may be larger than supported. The APM is the common piece among them and if it goes down we can have cascading failures.

would you be able to say "error_handler -> strategy -> scale-up by delta/percentage" with that pattern?

Not right now, but I think it's a fairly simple strategy to implement 🤔

It would be similar to fixed-value but take a relative value, like a delta or percentage as you mentioned and multiply or add those to the current count.

So the policy would look something like this:

policy {
  check "api_needs_uppies" {
    # ...
  }

  check "api_needs_downies" {
    # ...
  }

  error_handler {
    failures_before_triggering = 10
    strategy "relative-value" {
      delta = 1 # Add 1 new instance if all checks fail.
    }
  }
}

Depending on your evaluation interval and how long the outage takes this this policy can give you a more controlled scale up and maybe help with the thundering herd problem.

    |
max |_ _ _ _ _ _ _ _  ________________
    |      __________|                |
    |_____|                           |__________
    |     
    +----------------------------------------------
          |___________________________|
                      APM down

Would this match the approach you were thinking about?

@erulabs
Copy link
Author

erulabs commented May 2, 2023

@lgfa29 Yes, I believe the "error_handler" trigger would be exactly what we're looking for - in order to "fail open" as it were. What would it take to implement this - I'd love to help!

Because we're hosted on AWS, and because the outage that triggered this investigation was datadog, we're also looking into https://github.com/lob/nomad-autoscaler-cloudwatch-apm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants