Skip to content

Latest commit

 

History

History
1498 lines (958 loc) · 52.1 KB

grpc.rst

File metadata and controls

1498 lines (958 loc) · 52.1 KB

Serving with gRPC

This guide will demonstrate advanced features that BentoML offers for you to get started with gRPC:

  • First-class support for custom gRPC Servicer <guides/grpc:Mounting Servicer>, custom interceptors <guides/grpc:Mounting gRPC Interceptors>, handlers.
  • Seemlessly adding gRPC support to existing Bento.

This guide will also walk you through tradeoffs of serving with gRPC, as well as recommendation on scenarios where gRPC might be a good fit.

Requirements: This guide assumes that you have basic knowledge of gRPC and protobuf. If you aren't familar with gRPC, you can start with gRPC quick start guide.

For quick introduction to serving with gRPC, see Intro to BentoML <tutorial:Tutorial: Intro to BentoML>

Get started with gRPC in BentoML

We will be using the example from the quickstart<tutorial:Tutorial: Intro to BentoML> to demonstrate BentoML capabilities with gRPC.

Requirements

BentoML supports for gRPC are introduced in version 1.0.6 and above.

Install BentoML with gRPC support with pip:

» pip install -U "bentoml[grpc]"

Thats it! You can now serve your Bento with gRPC via bentoml serve-grpc <reference/cli:serve-grpc> without having to modify your current service definition 😃.

» bentoml serve-grpc iris_classifier:latest --production

Using your gRPC BentoService

There are two ways to interact with your gRPC BentoService:

  1. Use tools such as fullstorydev/grpcurl, fullstorydev/grpcui: The server requires reflection <grpc/grpc/blob/master/doc/server-reflection.md> to be enabled for those tools to work. Pass in --enable-reflection to enable reflection:

    » bentoml serve-grpc iris_classifier:latest --production --enable-reflection

    Open a different terminal and use one of the following:

  2. Use one of the below client implementations <guides/grpc:Client Implementation> to send test requests to your BentoService.

Client Implementation

Note

All of the following client implementations are available on GitHub <bentoml/BentoML/tree/main/grpc-client/>.

<br />

From another terminal, use one of the following client implementation to send request to the gRPC server:

Note

gRPC comes with supports for multiple languages. In the upcoming sections we will demonstrate two workflows of generating stubs and implementing clients:

  • Using bazel_ to manage and isolate dependencies (recommended)
  • A manual approach using protoc its language-specific plugins

Python

We will create our Python client in the directory ~/workspace/iris_python_client/:

» mkdir -p ~/workspace/iris_python_client
» cd ~/workspace/iris_python_client

Create a client.py file with the following content:

../../../grpc-client/python/client.py

Go

Requirements: Make sure to install the prerequisites before using Go.

We will create our Golang client in the directory ~/workspace/iris_go_client/:

» mkdir -p ~/workspace/iris_go_client
» cd ~/workspace/iris_go_client

Using bazel (recommended)

Define a WORKSPACE_ file:

WORKSPACE

./snippets/grpc/go/WORKSPACE.snippet.bzl

Followed by defining a BUILD_ file:

BUILD

./snippets/grpc/go/BUILD.snippet.bzl

Using protoc and language-specific plugins

Create a Go module:

» go mod init iris_go_client && go mod tidy

Add the following lines to ~/workspace/iris_go_client/go.mod:

require github.com/bentoml/bentoml/grpc/v1alpha1 v0.0.0-unpublished

replace github.com/bentoml/bentoml/grpc/v1alpha1 v0.0.0-unpublished => ./github.com/bentoml/bentoml/grpc/v1alpha1

By using replace directive, we ensure that Go will know where our generated stubs to be imported from. (since we don't host the generate gRPC stubs on pkg.go.dev 😄)

Here is the protoc command to generate the gRPC Go stubs:

» protoc -I. -I thirdparty/protobuf/src  \
         --go_out=. --go_opt=paths=import \
         --go-grpc_out=. --go-grpc_opt=paths=import \
         bentoml/grpc/v1alpha1/service.proto

Then run the following to make sure the generated stubs are importable:

» pushd github.com/bentoml/bentoml/grpc/v1alpha1
» go mod init v1alpha1 && go mod tidy
» popd

Create a client.go file with the following content:

../../../grpc-client/go/client.go

C++

Requirements: Make sure follow the instructions to install gRPC and Protobuf locally.

We will create our C++ client in the directory ~/workspace/iris_cc_client/:

» mkdir -p ~/workspace/iris_cc_client
» cd ~/workspace/iris_cc_client

Using bazel (recommended)

Define a WORKSPACE_ file:

WORKSPACE

./snippets/grpc/cpp/WORKSPACE.snippet.bzl

Followed by defining a BUILD_ file:

BUILD

./snippets/grpc/cpp/BUILD.snippet.bzl

Using protoc and language-specific plugins

Here is the protoc command to generate the gRPC C++ stubs:

» protoc -I . -I ./thirdparty/protobuf/src \
         --cpp_out=. --grpc_out=. \
         --plugin=protoc-gen-grpc=$(which grpc_cpp_plugin) \
         bentoml/grpc/v1alpha1/service.proto

Create a client.cpp file with the following content:

../../../grpc-client/cpp/client.cc

Java

Requirements: Make sure to have JDK>=7.

Optional: follow the instructions <grpc/grpc-java/tree/master/compiler> to install protoc plugin for gRPC Java if you plan to use protoc standalone.

Note

Feel free to use any Java build tools of choice (Maven, Gradle, Bazel, etc.) to build and run the client you find fit.

In this tutorial we will be using bazel_.

We will create our Java client in the directory ~/workspace/iris_java_client/:

» mkdir -p ~/workspace/iris_java_client
» cd ~/workspace/iris_java_client

Create the client Java package (com.client.BentoServiceClient):

» mkdir -p src/main/java/com/client

Using bazel (recommended)

Define a WORKSPACE_ file:

WORKSPACE

./snippets/grpc/java/WORKSPACE.snippet.bzl

Followed by defining a BUILD_ file:

BUILD

./snippets/grpc/java/BUILD.snippet.bzl

Using others build system

One simply can't manually running javac to compile the Java class, since there are way too many dependencies to be resolved.

Provided below is an example of how one can use gradle to build the Java client.

» gradle init --project-dir .

The following build.gradle should be able to help you get started:

../../../grpc-client/java/build.gradle

To build the client, run:

» ./gradlew build

Proceed to create a src/main/java/com/client/BentoServiceClient.java file with the following content:

../../../grpc-client/java/src/main/java/com/client/BentoServiceClient.java

On running protoc standalone (optional)

Here is the protoc command to generate the gRPC Java stubs if you need to use protoc standalone:

» protoc -I . \
         -I ./thirdparty/protobuf/src \
         --java_out=./src/main/java \
         --grpc-java_out=./src/main/java \
         bentoml/grpc/v1alpha1/service.proto

Kotlin

Requirements: Make sure to have the prequisites to get started with grpc/grpc-kotlin.

Optional: feel free to install Kotlin gRPC codegen <grpc/grpc-kotlin/blob/master/compiler/README.md> in order to generate gRPC stubs if you plan to use protoc standalone.

To bootstrap the Kotlin client, feel free to use either gradle or maven to build and run the following client code.

In this example, we will use bazel_ to build and run the client.

We will create our Kotlin client in the directory ~/workspace/iris_kotlin_client/, followed by creating the client directory structure:

» mkdir -p ~/workspace/iris_kotlin_client
» cd ~/workspace/iris_kotlin_client
» mkdir -p src/main/kotlin/com/client

Using bazel (recommended)

Define a WORKSPACE_ file:

WORKSPACE

./snippets/grpc/kotlin/WORKSPACE.snippet.bzl

Followed by defining a BUILD_ file:

BUILD

./snippets/grpc/kotlin/BUILD.snippet.bzl

Using others build system

One simply can't manually compile all the Kotlin files, since there are way too many dependencies to be resolved.

Provided below is an example of how one can use gradle to build the Kotlin client.

» gradle init --project-dir .

The following build.gradle.kts should be able to help you get started:

../../../grpc-client/kotlin/build.gradle.kts

To build the client, run:

» ./gradlew build

Proceed to create a src/main/kotlin/com/client/BentoServiceClient.kt file with the following content:

../../../grpc-client/kotlin/src/main/kotlin/com/client/BentoServiceClient.kt

On running protoc standalone (optional)

Here is the protoc command to generate the gRPC Kotlin stubs if you need to use protoc standalone:

» protoc -I. -I ./thirdparty/protobuf/src \
         --kotlin_out ./kotlin/src/main/kotlin/ \
         --grpc-kotlin_out ./kotlin/src/main/kotlin \
         --plugin=protoc-gen-grpc-kotlin=$(which protoc-gen-grpc-kotlin) \
         bentoml/grpc/v1alpha1/service.proto

Node.js

Requirements: Make sure to have Node.js installed in your system.

We will create our Node.js client in the directory ~/workspace/iris_node_client/:

» mkdir -p ~/workspace/iris_node_client
» cd ~/workspace/iris_node_client

Initialize the project and use the following package.json:

../../../grpc-client/node/package.json

Install the dependencies with either npm or yarn:

» yarn install --add-devs

Note

If you are using M1, you might also have to prepend npm_config_target_arch=x64 to yarn command:

» npm_config_target_arch=x64 yarn install --add-devs

Here is the protoc command to generate the gRPC Javascript stubs:

» $(npm bin)/grpc_tools_node_protoc \
         -I . -I ./thirdparty/protobuf/src \
         --js_out=import_style=commonjs,binary:. \
         --grpc_out=grpc_js:js \
         bentoml/grpc/v1alpha1/service.proto

Proceed to create a client.js file with the following content:

../../../grpc-client/node/client.js

Swift

Requirements: Make sure to have the prequisites <grpc/grpc-swift/blob/main/docs/quick-start.md#prerequisites> to get started with grpc/grpc-swift.

We will create our Swift client in the directory ~/workspace/iris_swift_client/:

» mkdir -p ~/workspace/iris_swift_client
» cd ~/workspace/iris_swift_client

We will use Swift Package Manager to build and run the client.

» swift package init --type executable

Initialize the project and use the following Package.swift:

../../../grpc-client/swift/Package.swift

Here is the protoc command to generate the gRPC Swift stubs:

» protoc -I. -I ./thirdparty/protobuf/src \
         --swift_out=Sources --swift_opt=Visibility=Public \
         --grpc-swift_out=Sources --grpc-swift_opt=Visibility=Public \
         --plugin=protoc-gen-grpc-swift=$(which protoc-gen-grpc-swift) \
         bentoml/grpc/v1alpha1/service.proto

Proceed to create a Sources/BentoServiceClient/main.swift file with the following content:

../../../grpc-client/swift/Sources/BentoServiceClient/main.swift

PHP

Requirements: Make sure to follow the instructions <grpc/grpc/blob/master/src/php/README.md> to install grpc via either pecl or from source.

Note

You will also have to symlink the built C++ extension to the PHP extension directory for it to be loaded by PHP.

We will then use bazel_, composer to build and run the client.

We will create our PHP client in the directory ~/workspace/iris_php_client/:

» mkdir -p ~/workspace/iris_php_client
» cd ~/workspace/iris_php_client

Create a new PHP package:

» composer init

An example composer.json for the client:

../../../grpc-client/php/composer.json

Here is the protoc command to generate the gRPC swift stubs:

» protoc -I . -I ./thirdparty/protobuf/src \
         --php_out=. \
         --grpc_out=. \
         --plugin=protoc-gen-grpc=$(which grpc_php_plugin) \
         bentoml/grpc/v1alpha1/service.proto

Proceed to create a BentoServiceClient.php file with the following content:

../../../grpc-client/php/BentoServiceClient.php

Bazel instruction for swift, nodejs, python

<br />

Then you can proceed to run the client scripts:

Python

» python -m client

Go

Using bazel (recommended)

» bazel run //:client_go

Using protoc and language-specific plugins

» go run ./client.go

C++

Using bazel (recommended)

» bazel run :client_cc

Using protoc and language-specific plugins

Refer to grpc/grpc for instructions on using CMake and other similar build tools.

Note

See the instructions on GitHub <bentoml/BentoML/tree/main/grpc-client/README.md> for working C++ client.

Java

Using bazel (recommended)

» bazel run :client_java

Using others build system

We will use gradlew to build the client and run it:

» ./gradlew build && \
   ./build/tmp/scripts/bentoServiceClient/bento-service-client

Note

See the instructions on GitHub <bentoml/BentoML/tree/main/grpc-client/README.md> for working Java client.

Kotlin

Using bazel (recommended)

» bazel run :client_kt

Using others build system

We will use gradlew to build the client and run it:

» ./gradlew build && \
   ./build/tmp/scripts/bentoServiceClient/bento-service-client

Note

See the instructions on GitHub <bentoml/BentoML/tree/main/grpc-client/README.md> for working Kotlin client.

Node.js

» node client.js

Swift

» swift run BentoServiceClient

PHP

» php -d extension=/path/to/grpc.so -d max_execution_time=300 BentoServiceClient.php

Additional language support for client implementation

Ruby

Note: Please check out the gRPC Ruby <grpc/grpc/blob/master/src/ruby/README.md#grpc-ruby> for how to install from source. Check out the examples folder <grpc/grpc/blob/master/examples/ruby/README.md#prerequisites> for Ruby client implementation.

.NET

Note: Please check out the gRPC .NET <grpc/grpc-dotnet/tree/master/examples> examples folder for grpc/grpc-dotnet client implementation.

Dart

Note: Please check out the gRPC Dart <grpc/grpc-dart/tree/master/examples> examples folder for grpc/grpc-dart client implementation.

Rust

Note: Currently there are no official gRPC Rust client implementation. Please check out the tikv/grpc-rs as one of the unofficial implementation.

After successfully running the client, proceed to build the bento as usual:

» bentoml build

<br />

Containerize your Bento 🍱 with gRPC support

To containerize the Bento with gRPC features, pass in --enable-features=grpc to bentoml containerize <reference/cli:containerize> to add additional gRPC dependencies to your Bento

» bentoml containerize iris_classifier:latest --enable-features=grpc

--enable-features allows users to containerize any of the existing Bentos with additional features <concepts/bento:Enable features for your Bento> that BentoML provides without having to rebuild the Bento.

Note

--enable-features accepts a comma-separated list of features or multiple arguments.

After containerization, your Bento container can now be used with gRPC:

» docker run -it --rm \
             -p 3000:3000 -p 3001:3001 \
             iris_classifier:6otbsmxzq6lwbgxi serve-grpc --production

Congratulations! You have successfully served, containerized and tested your BentoService with gRPC.


Using gRPC in BentoML

We will dive into some of the details of how gRPC is implemented in BentoML.

Protobuf definition

Let's take a quick look at protobuf definition of the BentoService:

service BentoService {
  rpc Call(Request) returns (Response) {}
}

Expands for current protobuf definition.

v1alpha1

../../../src/bentoml/grpc/v1alpha1/service.proto

As you can see, BentoService defines a simple rpc Call that sends a Request message and returns a Response message.

A Request message takes in:

  • `api_name`: the name of the API function defined inside your BentoService.
  • oneof `content`: the field can be one of the following types:
Protobuf definition IO Descriptor
guides/grpc:Array representation via NDArray`` bentoml.io.NumpyNdarray <reference/api_io_descriptors:NumPy \`\`ndarray\`\`>
guides/grpc:Tabular data representation via DataFrame`` bentoml.io.PandasDataFrame <reference/api_io_descriptors:Tabular Data with Pandas>
guides/grpc:Series representation via Series`` bentoml.io.PandasDataFrame <reference/api_io_descriptors:Tabular Data with Pandas>
guides/grpc:File-like object via File`` bentoml.io.File <reference/api_io_descriptors:Files>
google.protobuf.StringValue_ bentoml.io.Text <reference/api_io_descriptors:Texts>
google.protobuf.Value_ bentoml.io.JSON <reference/api_io_descriptors:Structured Data with JSON>
guides/grpc:Complex payload via Multipart`` bentoml.io.Multipart <reference/api_io_descriptors:Multipart Payloads>
guides/grpc:Compact data format via serialized_bytes`` (See below)

Note

Series is currently not yet supported.

The Response message will then return one of the aforementioned types as result.

<br />

Example: In the quickstart guide<tutorial:Creating a Service>, we defined a classify API that takes in a bentoml.io.NumpyNdarray <reference/api_io_descriptors:NumPy \`\`ndarray\`\`>.

Therefore, our Request message would have the following structure:

Python

./snippets/grpc/python/request.py

Go

./snippets/grpc/go/request.go

C++

./snippets/grpc/cpp/request.cc

Java

./snippets/grpc/java/Request.java

Kotlin

./snippets/grpc/kotlin/Request.kt

Node.js

./snippets/grpc/node/request.js

Swift

./snippets/grpc/swift/Request.swift

Array representation via NDArray

Description: NDArray represents a flattened n-dimensional array of arbitrary type. It accepts the following fields:

  • dtype

    The data type of given input. This is a Enum field that provides 1-1 mapping with Protobuf data types to NumPy data types:

    pb.NDArray.DType numpy.dtype Enum value
    DTYPE_UNSPECIFIED None 0
    DTYPE_FLOAT np.float 1
    DTYPE_DOUBLE np.double 2
    DTYPE_BOOL np.bool_ 3
    DTYPE_INT32 np.int32 4
    DTYPE_INT64 np.int64 5
    DTYPE_UINT32 np.uint32 6
    DTYPE_UINT64 np.uint64 7
    DTYPE_STRING np.str_ 8
  • shape

    A list of int32 that represents the shape of the flattened array. the bentoml.io.NumpyNdarray <reference/api_io_descriptors:NumPy \`\`ndarray\`\`> will then reshape the given payload into expected shape.

    Note that this value will always takes precendence over the shape field in the bentoml.io.NumpyNdarray <reference/api_io_descriptors:NumPy \`\`ndarray\`\`> descriptor, meaning the array will be reshaped to this value first if given. Refer to bentoml.io.NumpyNdarray.from_proto for implementation details.

  • string_values, float_values, double_values, bool_values, int32_values, int64_values, uint32_values, unit64_values

    Each of the fields is a list of the corresponding data type. The list is a flattened array, and will be reconstructed alongside with shape field to the original payload.

    Per request sent, one message should only contain ONE of the aforementioned fields.

    The interaction among the above fields and dtype are as follows:

    • if dtype is not present in the message:
      • All of the fields are empty, then we return a np.empty.
      • We will loop through all of the provided fields, and only allows one field per message.

        If here are more than one field (i.e. string_values and float_values), then we will raise an error, as we don't know how to deserialize the data.

    • otherwise:
      • We will use the provided dtype-to-field map to get the data from the given message.
      DType field
      DTYPE_BOOL bool_values
      DTYPE_DOUBLE double_values
      DTYPE_FLOAT float_values
      DTYPE_INT32 int32_values
      DTYPE_INT64 int64_values
      DTYPE_STRING string_values
      DTYPE_UINT32 uint32_values
      DTYPE_UINT64 uint64_values

    For example, if dtype is DTYPE_FLOAT, then the payload expects to have float_values field.

2

Python API

NumpyNdarray.from_sample(
   np.array([[5.4, 3.4, 1.5, 0.4]])
)

pb.NDArray

ndarray {
  dtype: DTYPE_FLOAT
  shape: 1
  shape: 4
  float_values: 5.4
  float_values: 3.4
  float_values: 1.5
  float_values: 0.4
}

API reference: bentoml.io.NumpyNdarray.from_proto

<br />

Tabular data representation via DataFrame

Description: DataFrame represents any tabular data type. Currently we only support the columns orientation since it is best for preserving the input order.

It accepts the following fields:

  • column_names

    A list of string that represents the column names of the given tabular data.

  • column_values

    A list of Series where Series represents a series of arbitrary data type. The allowed fields for Series as similar to the ones in `NDArray`:

    • one of [string_values, float_values, double_values, bool_values, int32_values, int64_values, uint32_values, unit64_values]

2

Python API

PandasDataFrame.from_sample(
    pd.DataFrame({
      "age": [3, 29],
      "height": [94, 170],
      "weight": [31, 115]
    }),
    orient="columns",
)

pb.DataFrame

dataframe {
  column_names: "age"
  column_names: "height"
  column_names: "weight"
  columns {
    int32_values: 3
    int32_values: 29
  }
  columns {
    int32_values: 40
    int32_values: 190
  }
  columns {
    int32_values: 140
    int32_values: 178
  }
}

API reference: bentoml.io.PandasDataFrame.from_proto

Series representation via Series

Description: Series portrays a series of values. This can be used for representing Series types in tabular data.

It accepts the following fields:

  • string_values, float_values, double_values, bool_values, int32_values, int64_values

    Similar to NumpyNdarray, each of the fields is a list of the corresponding data type. The list is a 1-D array, and will be then pass to pd.Series.

    Each request should only contain ONE of the aforementioned fields.

    The interaction among the above fields and dtype from PandasSeries are as follows:

    • if dtype is not present in the descriptor:
      • All of the fields are empty, then we return an empty pd.Series.
      • We will loop through all of the provided fields, and only allows one field per message.

        If here are more than one field (i.e. string_values and float_values), then we will raise an error, as we don't know how to deserialize the data.

    • otherwise:
      • We will use the provided dtype-to-field map to get the data from the given message.

2

Python API

PandasSeries.from_sample([5.4, 3.4, 1.5, 0.4])

pb.Series

series {
  float_values: 5.4
  float_values: 3.4
  float_values: 1.5
  float_values: 0.4
}

API reference: bentoml.io.PandasSeries.from_proto

<br />

File-like object via File

Description: File represents any arbitrary file type. this can be used to send in any file type, including images, videos, audio, etc.

Note

Currently both bentoml.io.File and bentoml.io.Image are using pb.File

It accepts the following fields:

  • content

    A bytes field that represents the content of the file.

  • Document kind once enum was dropped.
  • Demonstrate python API to protobuf representation

Complex payload via Multipart

Description: Multipart represents a complex payload that can contain multiple different fields. It takes a fields, which is a dictionary of input name to its coresponding bentoml.io.IODescriptor

2

Python API

Multipart(
   meta=Text(),
   arr=NumpyNdarray(
      dtype=np.float16,
      shape=[2,2]
   )
)

pb.Multipart

multipart {
   fields {
      key: "arr"
      value {
         ndarray {
         dtype: DTYPE_FLOAT
         shape: 2
         shape: 2
         float_values: 1.0
         float_values: 2.0
         float_values: 3.0
         float_values: 4.0
         }
      }
   }
   fields {
      key: "meta"
      value {
         text {
         value: "nlp"
         }
      }
   }
}

API reference: bentoml.io.Multipart.from_proto

Compact data format via serialized_bytes

The serialized_bytes field in both Request and Response is reserved for pre-established protocol encoding between client and server.

BentoML leverages the field to improve serialization performance between BentoML client and server. Thus the field is not recommended for use directly.

Mounting Servicer

gRPC service multiplexing <guides/grpc:Demystifying the misconception of gRPC vs. REST> enables us to mount additional custom servicers alongside with BentoService, and serve them under the same port.

import route_guide_pb2
import route_guide_pb2_grpc
from servicer_impl import RouteGuideServicer

svc = bentoml.Service("iris_classifier", runners=[iris_clf_runner])

services_name = [
    v.full_name for v in route_guide_pb2.DESCRIPTOR.services_by_name.values()
]
svc.mount_grpc_servicer(
    RouteGuideServicer,
    add_servicer_fn=add_RouteGuideServicer_to_server,
    service_names=services_name,
)

Serve your service with bentoml serve-grpc <reference/cli:serve-grpc> command:

» bentoml serve-grpc service.py:svc --reload --enable-reflection

Now your RouteGuide service can also be accessed through localhost:3000.

Note

service_names is REQUIRED here, as this will be used for server reflection <grpc/grpc/blob/master/doc/server-reflection.md> when --enable-reflection is passed to bentoml serve-grpc.

Mounting gRPC Interceptors

Inteceptors are a component of gRPC that allows us to intercept and interact with the proto message and service context either before - or after - the actual RPC call was sent/received by client/server.

Interceptors to gRPC is what middleware is to HTTP. The most common use-case for interceptors are authentication, tracing <guides/tracing:Tracing>, access logs, and more.

BentoML comes with a sets of built-in async interceptors to provide support for access logs, OpenTelemetry, and Prometheus.

The following diagrams demonstrates the flow of a gRPC request from client to server:

Interceptor Flow

Since interceptors are executed in the order they are added, users interceptors will be executed after the built-in interceptors.

Users interceptors shouldn't modify the existing headers and data of the incoming Request.

BentoML currently only support async interceptors (via grpc.aio.ServerInterceptor, as opposed to grpc.ServerInterceptor). This is because BentoML gRPC server is an async implementation of gRPC server.

Note

If you are using grpc.ServerInterceptor, you will need to migrate it over to use the new grpc.aio.ServerInterceptor in order to use this feature.

Feel free to reach out to us at #support on Slack

A toy implementation AppendMetadataInterceptor

from __future__ import annotations

import typing as t
import functools
import dataclasses
from typing import TYPE_CHECKING

from grpc import aio

if TYPE_CHECKING:
    from bentoml.grpc.types import Request
    from bentoml.grpc.types import Response
    from bentoml.grpc.types import RpcMethodHandler
    from bentoml.grpc.types import AsyncHandlerMethod
    from bentoml.grpc.types import HandlerCallDetails
    from bentoml.grpc.types import BentoServicerContext


@dataclasses.dataclass
class Context:
    usage: str
    accuracy_score: float


class AppendMetadataInterceptor(aio.ServerInterceptor):
     def __init__(self, *, usage: str, accuracy_score: float) -> None:
         self.context = Context(usage=usage, accuracy_score=accuracy_score)
         self._record: set[str] = set()

     async def intercept_service(
         self,
         continuation: t.Callable[[HandlerCallDetails], t.Awaitable[RpcMethodHandler]],
         handler_call_details: HandlerCallDetails,
     ) -> RpcMethodHandler:
         from bentoml.grpc.utils import wrap_rpc_handler

         handler = await continuation(handler_call_details)

         if handler and (handler.response_streaming or handler.request_streaming):
             return handler

         def wrapper(behaviour: AsyncHandlerMethod[Response]):
             @functools.wraps(behaviour)
             async def new_behaviour(
                request: Request, context: BentoServicerContext
             ) -> Response | t.Awaitable[Response]:
                 self._record.update(
                   {f"{self.context.usage}:{self.context.accuracy_score}"}
                 )
                 resp = await behaviour(request, context)
                 context.set_trailing_metadata(
                    tuple(
                          [
                             (k, str(v).encode("utf-8"))
                             for k, v in dataclasses.asdict(self.context).items()
                          ]
                    )
                 )
                 return resp

             return new_behaviour

         return wrap_rpc_handler(wrapper, handler)

To add your intercptors to existing BentoService, use svc.add_grpc_interceptor:

from custom_interceptor import CustomInterceptor

svc.add_grpc_interceptor(CustomInterceptor)

Note

add_grpc_interceptor also supports partial class as well as multiple arguments interceptors:

multiple arguments

from metadata_interceptor import AppendMetadataInterceptor

svc.add_grpc_interceptor(AppendMetadataInterceptor, usage="NLP", accuracy_score=0.867)

partial method

from functools import partial

from metadata_interceptor import AppendMetadataInterceptor

svc.add_grpc_interceptor(partial(AppendMetadataInterceptor, usage="NLP", accuracy_score=0.867))

Recommendations

gRPC is designed to be high performance framework for inter-service communications. This means that it is a perfect fit for building microservices. The following are some recommendation we have for using gRPC for model serving:

<br />

Demystifying the misconception of gRPC vs. REST

You might stumble upon articles comparing gRPC to REST, and you might get the impression that gRPC is a better choice than REST when building services. This is not entirely true.

gRPC is built on top of HTTP/2, and it addresses some of the shortcomings of HTTP/1.1, such as head-of-line blocking <Head-of-line_blocking>, and HTTP pipelining <HTTP_pipelining>. However, gRPC is not a replacement for REST, and indeed it is not a replacement for model serving. gRPC comes with its own set of trade-offs, such as:

  • Limited browser support: It is impossible to call a gRPC service directly from any browser. You will end up using tools such as gRPCUI <fullstorydev/grpcui> in order to interact with your service, or having to go through the hassle of implementing a gRPC client in your language of choice.
  • Binary protocol format: While Protobuf <protocolbuffers/protobuf> is efficient to send and receive over the wire, it is not human-readable. This means additional toolin for debugging and analyzing protobuf messages are required.
  • Knowledge gap: gRPC comes with its own concepts and learning curve, which requires teams to invest time in filling those knowledge gap to be effectively use gRPC. This often leads to a lot of friction and sometimes increase friction to the development agility.
  • Lack of support for additional content types: gRPC depends on protobuf, its content type are restrictive, in comparison to out-of-the-box support from HTTP+REST.

gRPC on HTTP/2 dives into how gRPC is built on top of HTTP/2, and this article goes into more details on how HTTP/2 address the problem from HTTP/1.1

For HTTP/2 specification, see RFC 7540.

<br />

Should I use gRPC instead of REST for model serving?

Yes and no.

If your organization is already using gRPC for inter-service communications, using your Bento with gRPC is a no-brainer. You will be able to seemlessly integrate your Bento with your existing gRPC services without having to worry about the overhead of implementing grpc-gateway <grpc-ecosystem/grpc-gateway>.

However, if your organization is not using gRPC, we recommend to keep using REST for model serving. This is because REST is a well-known and well-understood protocol, meaning there is no knowledge gap for your team, which will increase developer agility, and faster go-to-market strategy.

<br />

Performance tuning

BentoML allows user to tune the performance of gRPC via bentoml_configuration.yaml <guides/configuration:Configuration> via api_server.grpc.

A quick overview of the available configuration for gRPC:

api_server:
  grpc:
    host: 0.0.0.0
    port: 3000
    max_concurrent_streams: ~
    maximum_concurrent_rpcs: ~
    max_message_length: -1
    reflection:
      enabled: false
    metrics:
      host: 0.0.0.0
      port: 3001

<br />

max_concurrent_streams

Definition: Maximum number of concurrent incoming streams to allow on a HTTP2 connection.

By default we don't set a limit cap. HTTP/2 connections typically has limit of maximum concurrent streams on a connection at one time.

Some notes about fine-tuning max_concurrent_streams

Note that a gRPC channel uses a single HTTP/2 connection, and concurrent calls are multiplexed on said connection. When the number of active calls reaches the connection stream limit, any additional calls are queued to the client. Queued calls then wait for active calls to complete before being sent. This means that application will higher load and long running streams could see a performance degradation caused by queuing because of the limit.

Setting a limit cap on the number of concurrent streams will prevent this from happening, but it also means that you need to tune the limit cap to the right number.

  • If the limit cap is too low, you will sooner or later running into the issue mentioned above.
  • Not setting a limit cap are also NOT RECOMMENDED. Too many streams on a single HTTP/2 connection introduces thread contention between streams trying to write to the connection, packet loss which causes all call to be blocked.

Remarks: We recommend you to play around with the limit cap, starting with 100, and increase if needed.

<br />

maximum_concurrent_rpcs

Definition: The maximum number of concurrent RPCs this server will service before returning RESOURCE_EXHAUSTED status.

By default we set to None to indicate no limit, and let gRPC to decide the limit.

<br />

max_message_length

Definition: The maximum message length in bytes allowed to be received on/can be send to the server.

By default we set to -1 to indicate no limit. Message size limits via this options is a way to prevent gRPC from consuming excessive resources. By default, gRPC uses per-message limits to manage inbound and outbound message.

Some notes about fine-tuning max_message_length

This options sets two values: grpc.max_receive_message_length <grpc/grpc/blob/e8df8185e521b518a8f608b8a5cf98571e2d0925/include/grpc/impl/codegen/grpc_types.h#L153> and grpc.max_send_message_length <grpc/grpc/blob/e8df8185e521b518a8f608b8a5cf98571e2d0925/include/grpc/impl/codegen/grpc_types.h#L159>.

#define GRPC_ARG_MAX_RECEIVE_MESSAGE_LENGTH "grpc.max_receive_message_length"

#define GRPC_ARG_MAX_SEND_MESSAGE_LENGTH "grpc.max_send_message_length"

By default, gRPC sets incoming message to be 4MB, and no limit on outgoing message. We recommend you to only set this option if you want to limit the size of outcoming message. Otherwise, you should let gRPC to determine the limit.

We recommend you to also check out gRPC performance best practice to learn about best practice for gRPC.