Autoscaler underprovisions for uneven low latency traffic #15000

Peilun-Li · 2024-03-11T22:32:14Z

Ask your question here:

Hi community, we have a potentially skewed low latency traffic targeting a CPU-bound knative service. With concurrency-based autoscaling, we are seeing a high p90+ latency. After we manually increase min-scale to an overprovisioned level, the p90+ latency goes back to the normal level. We suspect this might indicate an underprovision of autoscaler, and want to understand the reasons and explore potential solutions.

Hypothetical traffic pattern & example service settings:

We receive one request every 10ms. Plus, at the start tick of each second, we receive 10 requests in parallel.
The service is CPU-bound and can only process one request at a time (i.e. containerConcurrency=1). Additional requests have to wait in queue. Each request takes 10ms to process.

Expected behavior: autoscaler scales the service up to 11 (or higher considering the target utilization percentage)
Actual behavior: autoscaler underprovisions the service and higher p90+ latency.

We studied the autoscaler logic for concurrency based metric a bit and here's our understanding (definitely correct us if we are wrong): the way autoscaler tracks concurrency is actually AverageConcurrency). Using the above hypothetical traffic example, for each seconds:

// https://github.com/knative/serving/blob/main/vendor/knative.dev/networking/pkg/http/stats/request.go#L96-L104 
func (s *RequestStats) compute(now time.Time) {
	if durationSinceChange := now.Sub(s.lastChange); durationSinceChange > 0 {
		durationSecs := durationSinceChange.Seconds()
		s.secondsInUse += durationSecs // this will be 1 second after accumulation 
		s.computedConcurrency += s.concurrency * durationSecs // this will be 11*0.01+10*0.01+...+2*0.01+(1*0.01)*90=65*0.01+90*0.01=1.55
		s.computedProxiedConcurrency += s.proxiedConcurrency * durationSecs
		s.lastChange = now
	}
}

// https://github.com/knative/serving/blob/main/vendor/knative.dev/networking/pkg/http/stats/request.go#L144-L147
	if s.secondsInUse > 0 {
		report.AverageConcurrency = s.computedConcurrency / s.secondsInUse // this will be 1.55
		report.AverageProxiedConcurrency = s.computedProxiedConcurrency / s.secondsInUse
	}

With that (AverageConcurrency=1.55) it looks like autoscaler will try to scale up to 2, even if we have a peak concurrency of 11, i.e., autoscaler underprovisions if from the perspective of peak concurrency (but certainly makes sense for average concurrency)

Questions:

Is our above understanding correct?
I understand that average concurrency is desired in most cases in providing a good balance, but curious if there's any way in this case we can make it more reactive to such low-latency uneven traffic pattern. Ideally if we can have some toggle set on a per-service/revision basis to tune the sensitiveness of the concurrency metric, e.g., if with both average concurrency and peak concurrency reported, potentially a config ratio could help to tune autoscaling sensitiveness

autoscaler concurrency = (1-sensitivenss_ratio) * average_concurrency + sensitivenss_ratio * peak_concurrency

TIA for any insights and help!

The text was updated successfully, but these errors were encountered:

skonto · 2024-05-20T17:29:40Z

Hi @Peilun-Li, the KPA autoscaler scrapes QP pods every 2 secs and each QP reports its metrics every 1 sec. It is true that the concurrency metric is calculated over the 1 sec on average. Also the autoscaler takes into consideration the proxied requests from the activator by subtracting them from the final value. The autoscaler calculates the desired pod count based on some window (panic or stable) and assigns a bucket per scrape done (for 60 secs of a stable window that means 30 buckets). Then it calculates a window average to decide the metric to be used (there is an option for a weighted one too). Thus, if you don't keep a concurrency level for enough time within each reporting period you will not see the replicas you expect, that is because the existing replicas served the traffic.
With containerConcurrency set to 1 the target will be 0.7 (utilization factor is 70%) and then you will need to have a reported concurrency level of ~77 to see 11 replicas created.

I suspect one way to deal with the above scenario is to use rps as a metric since it is calculated as a rate over time (independently of how requests arrived within the 1sec reporting period). For example in the above workload you have 110 rps. You could then have:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: autoscale-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/target: "14.29"
        autoscaling.knative.dev/metric: "rps"
    spec:
      containers:
...

The above will assign 10 rps per replica (14.29*0.7~=10 or you could change the utilization factor to 100% and set target to 10).
Here is what I observed with a constant rate of 110 rps (expected):

{"severity":"DEBUG","timestamp":"2024-05-20T13:52:02.062853698Z","logger":"autoscaler","caller":"scaling/autoscaler.go:190","message":"For metric rps observed values: stable = 110.000; panic = 110.000; target = 10.003 Desired StablePodCount = 11, PanicPodCount = 11, ReadyEndpointCount = 11, MaxScaleUp = 11000, MaxScaleDown = 5","commit":"38e22f9-dirty","knative.dev/key":"default/autoscale-go-00001"}

Could you try the above with your use case and see if that helps. Also in any testing done it would be helpful to enable debug level for autoscaler pod and report the logs (they have valuable info for how autoscaler behaves).

Peilun-Li · 2024-05-23T05:59:45Z

Thanks for the context and idea @skonto , yeah I think an RPS metric would help and we can try that, but meanwhile I feel it comes with two pain points:

The way that we calculate an "accurate" rps target depends on that we know the actual traffic pattern (down to the same concurrency at milliseconds level, as used in the hypothetical traffic example), while in reality that fine-grained level of traffic pattern can fluctuate and hard to measure. So usually we need to perform lots of iterations on the autoscaling target for the RPS metric to work effectively, also may need to keep a close watch in case traffic pattern changes then we need to re-tune the autoscaling target.
This approach itself can be viewed as a consistent "overprovisioning", e.g., in the hypothetical example each service replica could ideally handle 100 rps (if with even traffic distribution) while we only assign them a traffic of 10 rps. So imagine if our traffic pattern is of uneven distribution (as the hypothetical traffic pattern) during daytime, and even distribution during nighttime (e.g., if we don't have the Plus, at the start tick of each second, we receive 10 requests in parallel. at nighttime). Then mostly during nighttime we'll be overprovisioning unnecessarily.

Great suggestion on enabling debug logging for autoscaler, will try that :)

Peilun-Li added the kind/question Further information is requested label Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaler underprovisions for uneven low latency traffic #15000

Autoscaler underprovisions for uneven low latency traffic #15000

Peilun-Li commented Mar 11, 2024 •

edited

skonto commented May 20, 2024

Peilun-Li commented May 23, 2024

Autoscaler underprovisions for uneven low latency traffic #15000

Autoscaler underprovisions for uneven low latency traffic #15000

Comments

Peilun-Li commented Mar 11, 2024 • edited

Ask your question here:

skonto commented May 20, 2024

Peilun-Li commented May 23, 2024

Peilun-Li commented Mar 11, 2024 •

edited