Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Inference graph example #3183

Merged
merged 3 commits into from Nov 3, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
127 changes: 127 additions & 0 deletions docs/source/guides/graph.rst
@@ -0,0 +1,127 @@
===============
Inference Graph
===============

Many ML problems require an ensemble of models to work together to solve. BentoML architecture can support any model inference graph natively in its
:ref:`Service APIs <concepts/service:Service and APIs>` definition. Users can define parallel and sequential inference graphs with any control flows
by writing simple Python code. In this guide, we will build a text generation and classification service using model inference graph. The project
source code can be found in the BentoML `inference graph <https://github.com/bentoml/BentoML/tree/main/examples/inference_graph>`_ example.

.. image:: ../_static/img/inference-graph-diagram.png

As illustrated in the diagram above, the service performs the following tasks.

1. Accepts a text input.
2. Passes the input to three text generation models in parallel and receives a list of three generated texts.
3. Passes the list of generated texts to a text classification model iteratively and receives a list of three classification results.
4. Returns the list of generated texts and classification results in a JSON output.

.. code-block:: json
:caption: Sample JSON Output

[
{
"generated": "I have an idea! Please share with, like and subscribe and leave a comment below!\n\nIf you like this post, please consider becoming a patron of Reddit or becoming a patron of the author.",
"score": 0.5149753093719482
},
{
"generated": "I have an idea! One that won't break your heart but will leave you gasping in awe. A book about the history of magic. And because what's better than magic? Some. It's a celebration of our ancient, universal gift of awe.\"\n\nThe result was the \"Vox Populi: A Memoir of the Ancient World\" by E.V. Okello (Ace Books), published in 1999.\n\nIn the past 20 years, Okello, professor of history at Ohio State University and author of such titles as \"The American Imagination\" and \"Walking With Elephants",
"score": 0.502700924873352
},
{
"generated": "I have an idea! I've been wondering what the name of this thing is. What's the point?\" - The Simpsons\n\n\n\"It's bigger, bigger than she needs!\" - SpongeBob SquarePants\n\n\n\"That's a funny thing. It's like my brain is the most gigantic living thing. I just like thinking big.\" - Simpsons\n\n\n\"Ooookay! Here comes Barty-Icarus himself! (pause)\" - A Christmas Tale\n\n\nBackground information Edit\n\nFormal name: Homer's Brain.\n\nHomer's Brain. Special name: Brain.\n\nAppearances Edit",
"score": 0.536346971988678
}
]


Declare Runners
###############

Create :ref:`Runners <concepts/runner:Using Runners>` for the three text generation models and the one text classification model using the ``to_runner`` function.

.. code-block:: python

gpt2_generator = bentoml.transformers.get("gpt2-generation:latest").to_runner()
distilgpt2_generator = bentoml.transformers.get("distilgpt2-generation:latest").to_runner()
distilbegpt2_medium_generator = bentoml.transformers.get("gpt2-medium-generation:latest").to_runner()
bert_base_uncased_classifier = bentoml.transformers.get("bert-base-uncased-classification:latest").to_runner()


Create Service
##############

Create a :ref:`Service <concept/service:Service and APIs>` named ``inference_graph`` and specify the runners created earlier in the ``runners`` argument.

.. code-block:: python

svc = bentoml.Service(
"inference_graph",
runners=[
gpt2_generator,
distilgpt2_generator,
distilbegpt2_medium_generator,
bert_base_uncased_classifier,
],
)

Define API
##########

First, define an async :ref:`API <concepts/service:Service and APIs>` named ``classify_generated_texts`` that accepts a :ref:`Text <reference/api_io_descriptors:Texts>`
input and returns :ref:`JSON <reference/api_io_descriptors:Structured Data with JSON>` output. Second, pass the input simultaneously to all three text generation
runners through ``asyncio.gather`` and receive a list of three generated texts. Using ``asyncio.gather`` and Runner's ``async_run`` allows the inferences to happen
in parallel. Third, pass the list of generated texts to the text classification runner iteratively using a loop to get the classification score of each generated text.
Finally, return the list of generated texts and classification results in a dictionary.

.. tip::

Using asynchronous Service and Runner APIs achives better performance and throughput for IO-intensive workloads.
See :ref:`Sync vs Async APIs <concepts/service:Sync vs Async APIs>` for more details.


.. code-block:: python

@svc.api(input=Text(), output=JSON())
async def classify_generated_texts(original_sentence: str) -> dict:
generated_sentences = [
result[0]["generated_text"]
for result in await asyncio.gather(
gpt2_generator.async_run(
original_sentence,
max_length=MAX_LENGTH,
num_return_sequences=NUM_RETURN_SEQUENCE,
),
distilgpt2_generator.async_run(
original_sentence,
max_length=MAX_LENGTH,
num_return_sequences=NUM_RETURN_SEQUENCE,
),
distilbegpt2_medium_generator.async_run(
original_sentence,
max_length=MAX_LENGTH,
num_return_sequences=NUM_RETURN_SEQUENCE,
),
)
]

results = []
for sentence in generated_sentences:
score = (await bert_base_uncased_classifier.async_run(sentence))[0]["score"]
results.append(
{
"generated": sentence,
"score": score,
}
)

return results


Inference Graph Trace
#####################

The following tracing waterfall graphs demonstrates the execution flow of the inference graph. Note that the three calls to the text generation
runners happen in parallel without blocking each other and the calls to the text classification runner happen sequentially.

.. image:: ../_static/img/inference-graph-trace.png
1 change: 1 addition & 0 deletions docs/source/guides/index.rst
Expand Up @@ -17,6 +17,7 @@ into this part of the documentation.
client
server
configuration
graph
logging
metrics
performance
Expand Down
144 changes: 110 additions & 34 deletions examples/inference_graph/README.md
Expand Up @@ -3,9 +3,10 @@
This is a sample project demonstrating model inference graph of [BentoML](https://github.com/bentoml)
with Huggingface Transformers.

In this project, we will download three pretrained models, save them as three text classification
Transformers pipelines, build a text classification service via an HTTP server, and containerize the
service as a docker image for production deployment.
In this project, we will download and save three pretrained text generation models and a pretrained text classification model
to the model store. We will then build a service that accepts a text input, passes the input to the three text generation models,
classify each generated paragraph with the classification model, and return all three generated paragraphs with their classification
scores. The service will be served via HTTP and containerized as a docker image for production deployment.

### Install Dependencies

Expand All @@ -16,7 +17,7 @@ pip install -r ./requirements.txt

### Model Training

First step, create and save three text classification pipelines from three different BERT models:
Create and save three text generation models and one text classification model.

```bash
import bentoml
Expand All @@ -25,15 +26,34 @@ import transformers

if __name__ == "__main__":
# Create Transformers pipelines from pretrained models
pipeline1 = transformers.pipeline(task="text-classification", model="bert-base-uncased", tokenizer="bert-base-uncased")
pipeline2 = transformers.pipeline(task="text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
pipeline3 = transformers.pipeline(task="text-classification", model="ProsusAI/finbert")
generation_pipeline_1 = transformers.pipeline(
task="text-generation",
model="gpt2",
)
generation_pipeline_2 = transformers.pipeline(
task="text-generation",
model="distilgpt2",
)
generation_pipeline_2 = transformers.pipeline(
task="text-generation",
model="gpt2-medium",
)

# Save models to BentoML local model store
bentoml.transformers.save_model("bert-base-uncased", pipeline1)
bentoml.transformers.save_model("distilbert-base-uncased-finetuned-sst-2-english", pipeline2)
bentoml.transformers.save_model("prosusai-finbert", pipeline3)
classification_pipeline = transformers.pipeline(
task="text-classification",
model="bert-base-uncased",
tokenizer="bert-base-uncased",
)

# Save models to BentoML local model store
m0 = bentoml.transformers.save_model("gpt2-generation", generation_pipeline_1)
m1 = bentoml.transformers.save_model("distilgpt2-generation", generation_pipeline_2)
m2 = bentoml.transformers.save_model(
"gpt2-medium-generation", generation_pipeline_2
)
m3 = bentoml.transformers.save_model(
"bert-base-uncased-classification", classification_pipeline
)
```

This will save the models in the BentoML local model store, new version tags are automatically
Expand All @@ -51,12 +71,12 @@ To verify that the saved model can be loaded correctly, run the following:
```python
import bentoml

pipeline = bentoml.transformers.load_model("bert-base-uncased:latest")
pipeline = bentoml.transformers.load_model("gpt2-generation:latest")

pipeline("You look great today!")
pipeline("I have an idea!")
```

In BentoML, the recommended way of running ML model inference in serving is via Runner, which
In BentoML, the recommended way of running ML model inference in serving is via Runners, which
gives BentoML more flexibility in scheduling the inference computation, batching inference requests,
and taking advantage of hardware resoureces available. Saved models can be loaded as Runner instance as
shown below:
Expand All @@ -65,40 +85,93 @@ shown below:
import bentoml

# Create a Runner instance:
bert_runner = bentoml.transformers.get("bert-base-uncased:latest").to_runner()
bert_runner = bentoml.transformers.get("gpt2-generation:latest").to_runner()

# Runner#init_local initializes the model in current process, this is meant for development and testing only:
bert_runner.init_local()

# This should yield the same result as the loaded model:
bert_runner.run("You look great today!")
bert_runner.run("I have an idea!")
```


### Serving the model

A simple BentoML Service that serves the model saved above look like this:
The service definition below achieves the inference graph logic described above.

First, the we create three text generation runners and one text classification runners with the `to_runner` function
from the models we previously saved. Second, we create a `bentoml.Service` named "inference_graph" with pass in
the four runners instances. Lastly, we create an async `@svc.api` that accepts a `Text` input and `JSON` output. The API
passes the input simultaneously to all three text generation models through `asyncio.gather` and iteratively passes
the generated paragraphs to the text classification model. The API returns all three generated paragraphs and their
corresponding classification scores as a dictionary.

```python
import asyncio
import bentoml

from bentoml.io import Text, JSON
from statistics import median

bert_runner = bentoml.transformers.get("bert-base-uncased:latest").to_runner()
distilbert_runner = bentoml.transformers.get("distilbert-base-uncased-finetuned-sst-2-english:latest").to_runner()
finbert_runner = bentoml.transformers.get("prosusai-finbert:latest").to_runner()
import bentoml
from bentoml.io import JSON
from bentoml.io import Text

gpt2_generator = bentoml.transformers.get("gpt2-generation:latest").to_runner()
distilgpt2_generator = bentoml.transformers.get(
"distilgpt2-generation:latest"
).to_runner()
distilbegpt2_medium_generator = bentoml.transformers.get(
"gpt2-medium-generation:latest"
).to_runner()
bert_base_uncased_classifier = bentoml.transformers.get(
"bert-base-uncased-classification:latest"
).to_runner()

svc = bentoml.Service(
"inference_graph",
runners=[
gpt2_generator,
distilgpt2_generator,
distilbegpt2_medium_generator,
bert_base_uncased_classifier,
],
)


MAX_LENGTH = 128
NUM_RETURN_SEQUENCE = 1

svc = bentoml.Service("inference_graph", runners=[bert_runner, distilbert_runner, finbert_runner])

@svc.api(input=Text(), output=JSON())
async def classify(input_data: str) -> dict:
results = await asyncio.gather(
bert_runner.async_run(input_data),
distilbert_runner.async_run(input_data),
finbert_runner.async_run(input_data),
)
async def classify_generated_texts(original_sentence: str) -> dict:
generated_sentences = [
result[0]["generated_text"]
for result in await asyncio.gather(
gpt2_generator.async_run(
original_sentence,
max_length=MAX_LENGTH,
num_return_sequences=NUM_RETURN_SEQUENCE,
),
distilgpt2_generator.async_run(
original_sentence,
max_length=MAX_LENGTH,
num_return_sequences=NUM_RETURN_SEQUENCE,
),
distilbegpt2_medium_generator.async_run(
original_sentence,
max_length=MAX_LENGTH,
num_return_sequences=NUM_RETURN_SEQUENCE,
),
)
]

results = []
for sentence in generated_sentences:
score = (await bert_base_uncased_classifier.async_run(sentence))[0]["score"]
results.append(
{
"generated": sentence,
"score": score,
}
)

return results
```

Expand All @@ -112,13 +185,15 @@ You may also send request with `curl` command or any HTTP client, e.g.:

```bash
curl -X 'POST' \
'http://127.0.0.1:3000/classify' \
'http://0.0.0.0:3000/classify_generated_texts' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d 'You look great today!'
-H 'Content-Type: text/plain' \
-d 'I have an idea!'
```




### Build Bento for deployment

Bento is the distribution format in BentoML which captures all the source code, model files, config
Expand All @@ -137,6 +212,7 @@ include:
python:
packages:
- transformers
- torch
```

Next, run `bentoml build` from current directory to start the Bento build:
Expand Down