Benchmark

This benchmark focus on testing KFServing performance with and without Knative queue proxy/activator on the request path.

Knative queue proxy does the following for the KFServing main container.
- Enforces concurrency level for the pod
- Emit metrics for autoscaling(KPA)
- Timeout enforcement
- Readiness probe
- Queue limiting
- Distributed tracing
- Graceful shutdown handling
Knative activator buffers the requests while pods are scaled down to zero and report metrics to autoscaler. The activator also effectively acts as a load balancer which distributes the load across all the pods as they become available in a way that does not overload them with regards to their concurrency settings. So it protects the app from burst so you do not see messages queuing in the user pods.

Environment Setup

K8S: v1.14.10-gke.36(8 nodes n1-standard)
Istio: 1.1.6
Knative: 0.11.2
KFServing: master(with fix for kserve#844)

Note that v1.14.10-gke.36 suffers the CFS throttling bug, and 1.15.11-gke.15 includes the CFS throttling fix.

Benchmarking

Results on KFServing SKLearn Iris Example

Create InferenceService

kubectl apply -f ./sklearn.yaml

Create the input vegeta configmap

kubectl apply -f ./sklearn_vegeta_cfg.yaml

Create the benchmark job using vegeta Note that you can configure pod anti-affinity to run vegeta on a different node on which the inference pod is running.

kubectl create -f ./sk_benchmark.yaml

CC=8 With queue proxy and activator on the request path

Create an InferenceService with ContainerCurrency(cc) set to 8 which is equal to the number of cores on the node.

apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  default:
    predictor:
      parallelism: 8 # CC=8
      sklearn:
        storageUri: "gs://kfserving-samples/models/sklearn/iris"

QPS/Replicas	mean	p50	p95	p99	Success Rate
5/s minReplicas=1	6.213ms	5.915ms	6.992ms	7.615ms	100%
50/s minReplicas=1	5.738ms	5.608ms	6.483ms	6.801ms	100%
500/s minReplicas=1	4.083ms	3.743ms	4.929ms	5.642ms	100%
1000/s minReplicas=1	398.562ms	5.95ms	2.945s	3.691s	100%

Raw Kubernetes Service(Without queue proxy and activator on the request path)

Update the SKLearn Iris InferenceService with following yaml to use HPA

apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
  annotations:
    autoscaling.knative.dev/class: hpa.autoscaling.knative.dev
    autoscaling.knative.dev/metric: cpu
    autoscaling.knative.dev/target: "80"
spec:
  default:
    predictor:
      sklearn:
        storageUri: "gs://kfserving-samples/models/sklearn/iris"

kubectl apply -f ./sklearn_hpa.yaml

Setup virtual service to go directly to the private service to bypass the Knative Activator and queue-proxy, change the benchmark test target url host to sklearn-iris-raw.default.svc.cluster.local.

apiVersion: v1
kind: Service
metadata:
  name: sklearn-iris-raw
spec:
  externalName: cluster-local-gateway.istio-system.svc.cluster.local
  sessionAffinity: None
  type: ExternalName
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: sklearn-iris-raw
spec:
  gateways:
  - knative-serving/cluster-local-gateway
  hosts:
  - sklearn-iris-raw.default.svc.cluster.local
  http:
  - match:
    - authority:
        regex: ^sklearn-iris-raw\.default(\.svc(\.cluster\.local)?)?(?::\d{1,5})?$
      gateways:
      - knative-serving/cluster-local-gateway
      uri:
        regex: ^/v1/models/[\w-]+(:predict)?
    route:
    - destination:
        host: sklearn-iris-predictor-default-xt264-private.default.svc.cluster.local #this is the private service to user container
        port:
          number: 80
      weight: 100

QPS/Replicas	mean	p50	p95	p99	Success Rate
5/s Replicas=1	2.673ms	2.381ms	4.352ms	5.966ms	100%
50/s Replicas=1	2.188ms	2.117ms	2.684ms	3.02ms	100%
500/s Replicas=1	1.376ms	1.283ms	1.713ms	2.205ms	100%
1000/s Replicas=1	7.969s	8.658s	16.669s	20.307s	93.72%

So you can see that queue-proxy and activator adds 2-3 millisecond overhead, but you get the advantage of KPA and smart load balancing. For this example we do not see much benefits because the request takes only 1-2 ms to process, however you can see the obvious advantage when request volume goes to 1000/s and KPA reacts faster and performs better than HPA.

Results on KFServing with TFServing Flower Example

Create InferenceService

kubectl apply -f ../docs/samples/tensorflow/tensorflow.yaml

Create the input vegeta configmap

kubectl apply -f ./tf_vegeta_cfg.yaml

Create the benchmark job using vegeta Note that you can configure pod anti-affinity to run vegeta on a different node on which the inference pod is running.

kubectl create -f ./tf_benchmark.yaml

CC=0

Create InferenceService with default ContainerConcurrency set to 0 which is unlimited concurrency, activator in this case just pass through and you would still expect requests queued on user container in case of request overload.

apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "flowers-sample"
spec:
  default:
    predictor:
      tensorflow:
        storageUri: "gs://kfserving-samples/models/tensorflow/flowers
        resources:
          requests:
            cpu: "4"
            memory: 2Gi
          limits:
            cpu: "4"
            memory: 2Gi

kubectl apply -f ./tf_flowers.yaml

QPS/Replicas	mean	p50	p95	p99	Success Rate
1/s minReplicas=1	110.54ms	110.343ms	116.116ms	117.298ms	100%
5/s minReplicas=1	133.272ms	131.242ms	148.195ms	153.291ms	100%
10/s minReplicas=1	946.376ms	127.961ms	4.635s	6.934s	100%

CC=1

Create InferenceService with ContainerConcurrency set to 1, activator respects container queue limit 1 so that requests do not get queued on user pods and activator chooses to route the requests to the pods which have capacity.

apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "flowers-sample"
spec:
  default:
    predictor:
      parallelism: 1 #CC=1
      tensorflow:
        storageUri: "gs://kfserving-samples/models/tensorflow/flowers
        resources:
          requests:
            cpu: "4"
            memory: 2Gi
          limits:
            cpu: "4"
            memory: 2Gi

QPS/Replicas	mean	p50	p95	p99	Success Rate
1/s minReplicas=1	103.766ms	102.869ms	111.559ms	116.577ms	100%
5/s minReplicas=1	117.456ms	117.117ms	122.346ms	126.139ms	100%
10/s minReplicas=1	702.249ms	111.289ms	3.469s	3.831s	100%

So here you can see that with CC=1, when you send one request at a time the latency does not make much different with CC=0 or CC=1. However when you send more concurrent requests you start to notice pronounced result when CC=1 because activator starts to take effect and you will observe better tail latency at p95 and p99 thanks to Knative activator smarter load balancing than random load balancing.

Raw Kubernetes Service(Without queue proxy and activator)

apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "flowers-sample-hpa"
  annotations:
    autoscaling.knative.dev/class: hpa.autoscaling.knative.dev
    autoscaling.knative.dev/metric: cpu
    autoscaling.knative.dev/target: "60"
spec:
  default:
    predictor:
      tensorflow:
        storageUri: "gs://kfserving-samples/models/tensorflow/flowers
        resources:
          requests:
            cpu: "4"
            memory: 2Gi
          limits:
            cpu: "4"
            memory: 2Gi

Setup virtual service to bypass the knative proxy and update vegeta config target URL to http://flowers-sample-raw.default.svc.cluster.local/v1/models/flowers-sample-hpa:predict

apiVersion: v1
kind: Service
metadata:
  name: flowers-sample-raw
  namespace: default
spec:
  externalName: cluster-local-gateway.istio-system.svc.cluster.local
  sessionAffinity: None
  type: ExternalName
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: flowers-sample-raw
spec:
  gateways:
  - knative-serving/cluster-local-gateway
  hosts:
  - flowers-sample-raw.default.svc.cluster.local
  http:
  - match:
    - authority:
        regex: ^flowers-sample-raw\.default(\.svc(\.cluster\.local)?)?(?::\d{1,5})?$
      gateways:
      - knative-serving/cluster-local-gateway
      uri:
        regex: ^/v1/models/[\w-]+(:predict)?
    route:
    - destination:
        host: flowers-sample-hpa-predictor-default-95bbz-private.default.svc.cluster.local #this is the private service to user container
        port:
          number: 80
      weight: 100

QPS/Replicas	mean	p50	p95	p99	Success Rate
1/s Replicas=1	129.143ms	112.853ms	118.143ms	128.557ms	100%
5/s Replicas=1	127.947ms	127.549ms	132.171ms	135.801ms	100%
10/s Replicas=1	5.461s	5.087s	12.992s	14.587s	100%

This experiment runs the InferenceService using HPA with average target utilization 80% of CPU and calls directly to Kubernetes Service bypassing the Knative queue proxy and activator. You can see that KPA reacts faster with the load and performs better than HPA for both low latency and high latency requests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Benchmark

Environment Setup

Benchmarking

Results on KFServing SKLearn Iris Example

CC=8 With queue proxy and activator on the request path

Raw Kubernetes Service(Without queue proxy and activator on the request path)

Results on KFServing with TFServing Flower Example

CC=0

CC=1

Raw Kubernetes Service(Without queue proxy and activator)

Files

README.md

Latest commit

History

README.md

File metadata and controls

Benchmark

Environment Setup

Benchmarking

Results on KFServing SKLearn Iris Example

CC=8 With queue proxy and activator on the request path

Raw Kubernetes Service(Without queue proxy and activator on the request path)

Results on KFServing with TFServing Flower Example

CC=0

CC=1

Raw Kubernetes Service(Without queue proxy and activator)