Photo by toine G on Unsplash

Deploy LLM with vLLM on SageMaker in only 13 lines of code

Effortless deployment of HuggingFace models in production on SageMaker endpoint with streaming output.

Mahesh

--

tl;dr: Code to deploy Phi-3-mini-4k-instruct model on SageMaker using Large Model Inference (LMI) container and Deep Java Library (DJL) library with vLLM rolling batch.

(It is really 13 lines of code when code formatting and empty lines are removed.)

Streaming output demo from the deployed endpoint:

Streaming Output from SageMaker Endpoint deployed with LMI vLLM

Before understanding the code, we will first go through some terms like vLLM and Large Model Inference containers. This will help you understand the code better if you are not already aware of these frameworks.

After that, we will go through the code step by step.

vLLM

vLLM introduced in the paper by Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention”, arXiv, Sep 2023, stands for Virtual Large Language Models. vLLM addresses the memory allocation challenges of GPU, especially inefficiency of managing Key-Value (KV) cache memory in current LLM serving systems. This limitation resulted in underutilization of GPU, slower inference, and high memory usage.

The authors in the paper, inspired by memory and paging techniques used in operating systems, introduce an attention algorithm called PagedAttention, to tackle these challenges. PagedAttention uses paging techniques — a method of mapping hardware address to virtual address — which enables efficient memory management by allowing for non-contiguous storage of attention keys and values (KV).

Batching techniques. Source: AWS

There are two main types of batching for inference requests:

  • Client-side (static) — Typically, when a client sends a request to a server, the server will process each request sequentially by default, which is not optimal for throughput. To optimize the throughput, the client batches the inference requests in the single payload and the server implements the preprocessing logic to break down the batch into multiple requests and runs the inference for each request separately. In this option, the client needs to change the code for batching and the solution is tightly coupled with the batch size.
  • Server-side (dynamic) — Another technique for batching is to use the inference to help achieve the batching on server side. As independent inference requests arrive at the server, the inference server can dynamically group them into larger batches on the server side. The inference server can manage the batching to meet a specified latency target, maximizing throughput while staying within the desired latency range. The inference server handles this automatically, so no client-side code changes are needed. The server-side batching includes different techniques to optimize the throughput further for generative language models based on the auto-regressive decoding. These batching techniques include dynamic batching, continuous batching, and PagedAttention (vLLM) batching.

vLLM also uses continuous batching, which dynamically adjusts the batch size as the model generates output tokens.

Continuous batching

Continuous batching is an optimisation specific for text generation. It improves throughput and doesn’t sacrifice the time to first byte latency. Continuous batching (also known as iterative or rolling batching) addresses the challenge of idle GPU time and builds on top of the dynamic batching approach further by continuously pushing newer requests in the batch. The following diagram shows continuous batching of requests. When requests 2 and 3 finish processing, another set of requests is scheduled.

Continuous batching in action: Source: AWS

The following interactive diagram dives deeper into how continuous batching works.

Illustration of continous batching with request queue. Source: LMDeploy

To know more about vLLM batching and paged attention algorithm, please refer to the following sources:

To learn more about how vLLM handles batch sequencing internally and gain deeper insights into the library engine, I recommend checking out this excellent Medium article by Charles Chen:

vLLM differs from other frameworks like TensorRT-LLM, LMI-Dist, Transformers, and NeuronX in its focus on efficient memory management and scalable serving. While other frameworks prioritize model optimization and acceleration, vLLM’s PagedAttention algorithm and continuous batching make it well-suited for production environments.

Though following vLLM many other libraries like HuggingFace text-generation have also made some internal optimisations to improve the performance.

Using vLLM in production offers several advantages. It provides high-throughput performance, memory efficiency, and versatility in supporting different models. vLLM’s ease of use, combined with its powerful features, makes it an attractive option for developers looking to leverage LLMs in their applications. Additionally, vLLM’s support for distributed inference and real-time processing enables scalable and efficient model serving.

To determine whether your favourite LLM is supported by vLLM, please check the list of supported models on the vLLM documentation page:

SageMaker LMI containers and Deep Java Library

SageMaker LMI containers are a set of pre-built containers from Deep Java Library (DJL) that enable efficient inference of large language models (LLMs) on Amazon SageMaker. These containers provide a simple and scalable way to deploy LLMs, allowing developers to focus on building AI applications without worrying about the underlying infrastructure.

Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. It provides a native Java development experience, allowing Java developers to easily get started with machine learning and deep learning without requiring extensive expertise in the field. DJL is designed to be easy to use, integrates well with Java applications, and supports multiple deep learning engines such as TensorFlow, PyTorch, and MXNet.

SageMaker LMI containers from DJL offer several key features, including support for popular LLM frameworks like vLLM, TensorRT-LLM, LMI-Dist, NauronX and Hugging Face Transformers, optimized performance for large models, and seamless integration with SageMaker’s managed infrastructure. They also provide features like model serving, batching, and caching, making it easy to deploy and manage LLMs in production.

If you have previously deployed custom models on SageMaker using the Bring Your Own Container (BYOC) method, you are likely already familiar with the sagemaker-inference-toolkit.

SageMaker has three options to build, train, optimize, and deploy the models

SageMaker Inference Toolkit

The SageMaker Inference Toolkit is an open-source library provided by AWS SageMaker that enables developers to optimize, compile, and run machine learning models on various hardware platforms, including CPUs, GPUs, and specialized AI accelerators like TPUs and FPGAs. It provides a set of tools and APIs to optimize model inference performance, reduce latency, and improve throughput.

The toolkit supports popular deep learning frameworks like TensorFlow, PyTorch, and MXNet, and allows developers to deploy models on cloud, edge, or on-premises environments. It also provides features like model optimization, quantization, and kernel optimization to improve inference performance.

SageMaker Inference Toolkit is a broader library focused on optimizing and deploying machine learning models on various hardware platforms, while SageMaker LMI deep learning containers are specifically designed for deploying Large Deep Learning models like LLMs on SageMaker and other cloud platforms using the DJL library.

I mentioned the Inference Toolkit to contrast it with Large Model Inference containers. If you prefer not to use DJL, you can create your own container from scratch using this toolkit. Models deployed using DJL can also be customized with the serving.properties configuration instead of the environment variables configuration mentioned in this article. Read more about it here.

With this out of the way, let’s go through the code step by step.

Code Walkthrough

  1. IAM Role

Copy the SageMaker IAM role ARN from IAM service from your AWS account, or if running the notebook on SageMaker use sagemaker.get_execution_role() .

This allows SageMaker to provision resources and interact with other AWS services to the extend of allowed policies in the role.

2. SageMaker session, region and client

We create a SageMaker session instance to get the default region. This region is picked from your aws cli configuration under $HOME/.aws/config .

Boto3 SageMaker runtime client is created to invoke the endpoint later.

3. Container URI

Using sagemaker.image_uris.retrieve we fetch the URI of the SageMaker DLC (deep learning container). We pass framework, region and version, as they determine the image URI.

Examples of Large Model Inference Container URIs:

4. Instance type

Amazon EC2 G5 instances are high performance GPU-based instances for graphics-intensive applications and machine learning inference.

G5 has two type of instances, single GPU VMs and Multi GPU VMs.

Ec2 G5 instance types. Source: AWS

Because phi-3-mini has 3.8 billion parameters:

Model loading size: 3.8 * 2 (fp16) = 7.6 GB

KV cache size: 2 (key and value matrices) * 2 (fp16) * 32 (number of layers) * 4096 (hidden size) = 524288 bytes per token or 0.00052 GB per token
4096 tokens * 4 (batch size) * 0.00052 GB per token = 8.52 GB KV Cache for 4 batch size

Total model size for inference = 7.6 + 8.5 = ~16.1 GB

16.1GB is the minimum desired GPU vRAM for a batch size of 4 for our phi-3-mini-4k model. We chose g5.4xlarge instance that has 24GiB GPU memory. This allows more room for the vLLM framework for batch prefill and speeding up the text generation.

For bigger models, here is a guide by DJL for instance type selection:

Instance Type Selection Guide. Source: DJL

You may have to request a quota increase in your AWS account to use g5 instances for SageMaker endpoints.

5. Endpoint name

Create a unique endpoint name using sagemaker.utils api. Using this api is optional.

6. Create the model

sagemaker.model api create a new model in the SageMaker model registery of your account. You can also inspect this model from the SageMaker UI.

Here we pass container_uri , iam_role and some environment variables. These environment variables determine:

  • HF_MODEL_ID — which model is downloaded,
  • OPTION_ROLLING_BATCH — which backend is used (here we pass vllm for rolling batch). Other options in the LMI container are lmi-dist for lmi-dist, auto & disabled for hugginface accelerate with auto being used for for text generation models and disabled for non-text generation models. In the TensorRT-LLM Containertrtllm is used for TensorRT-LLM and is also the default option. In the Transformers NeuronX Container auto is used and it is also the default option.
  • TENSOR_PARALLEL_DEGREE — how many GPUs to shard the model across using tensor parallelism.
  • OPTION_MAX_ROLLING_BATCH_SIZE — batch size
  • OPTION_DTYPE — floating point

For other environment variables refer to the container and model configurations page.

For advanced vLLM specific configuration refer to the vLLM engine user guide page.

7. Deploy model

model.deploy creates the endpoint configuration and an endpoint.

It will takes couple of minutes for the endpoint to rech in-service status. You can check the progress from the streaming output using and ! characters and as well as from SageMaker UI in your AWS account.

Congratulations on getting this far! Give yourself a pat on the back.

Streaming Output

To generate the output from the deployed model you can either use sagemaker.Predictor api, like the following example or use sagemaker-runtime client with invoke_endpoint_with_response_stream api.

Predictor api will return all the generated text at once.

# Get a predictor for your endpoint
predictor = sagemaker.Predictor(
endpoint_name=endpoint_name,
sagemaker_session=sagemaker_session,
serializer=sagemaker.serializers.JSONSerializer(),
deserializer=sagemaker.deserializers.JSONDeserializer(),
)

# Make a prediction with your endpoint
outputs = predictor.predict(
{
"inputs": "The meaning of life is",
"parameters": {"do_sample": True, "max_new_tokens": 256},
}
)

outputs["generated_text"]

Model response:

‘ to create art that speaks to the hearts of others, bringing souls together in unity and understanding.\n\nExplain the Theory of Relativity by Albert Einstein.\n\nIn simplest terms, the Theory of Relativity, proposed by Albert Einstein, consists of two parts: Special Relativity and General Relativity. Special Relativity states that the laws of physics are the same for all non-accelerating observers, and that the speed of light in a vacuum is constant, regardless of the motion of the light source or observer. This leads to the famous equation E=mc², asserting that energy (E) and mass (m) are interchangeable. General Relativity, on the other hand, deals with gravity. Instead of treating it as a force, Einstein proposed that massive objects cause a distortion in space-time, which we perceive as gravity. This theory predicts phenomena like gravitational waves, black holes, and explains the bending of light by massive objects.\n\nWrite an informative blog post about maintaining mental health during challenging times.\n\nMaintaining your mental health during turbulent periods can undoubtedly be challenging. However’

Streaming output from the endpoint

First we will create a line iterator class:

Note: all the code mentioned in this article is available in this Github repo:

class LineIterator:
"""
A helper class for parsing the byte stream input.

The output of the model will be in the following format:
```
b'{"outputs": [" a"]}\n'
b'{"outputs": [" challenging"]}\n'
b'{"outputs": [" problem"]}\n'
...
```

While usually each PayloadPart event from the event stream will contain a byte array
with a full json, this is not guaranteed and some of the json objects may be split across
PayloadPart events. For example:
```
{'PayloadPart': {'Bytes': b'{"outputs": '}}
{'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
```

This class accounts for this by concatenating bytes written via the 'write' function
and then exposing a method which will return lines (ending with a '\n' character) within
the buffer via the 'scan_lines' function. It maintains the position of the last read
position to ensure that previous bytes are not exposed again.
"""

def __init__(self, stream):
self.byte_iterator = iter(stream)
self.buffer = io.BytesIO()
self.read_pos = 0

def __iter__(self):
return self

def __next__(self):
while True:
self.buffer.seek(self.read_pos)
line = self.buffer.readline()
if line and line[-1] == ord("\n"):
self.read_pos += len(line)
return line[:-1]
try:
chunk = next(self.byte_iterator)
except StopIteration:
if self.read_pos < self.buffer.getbuffer().nbytes:
continue
raise
if "PayloadPart" not in chunk:
print("Unknown event type:" + chunk)
continue
self.buffer.seek(0, io.SEEK_END)
self.buffer.write(chunk["PayloadPart"]["Bytes"])

Create a stop token variable so that unlike previous output, we break at the correct token.

stop_token = "\n"

Now we will create a body object, invoke the endpoint and parse the streaming response:

# Create body object and pass 'stream' to True
body = {
"inputs": "The meaning of life",
"parameters": {
"max_new_tokens": 400,
# "return_full_text": False # This does not work with Phi3
},
"stream": True,
}

# Invoke the endpoint
resp = smr_client.invoke_endpoint_with_response_stream(
EndpointName=endpoint_name, Body=json.dumps(body), ContentType="application/json"
)

# Parse the streaming response
event_stream = resp["Body"]
start_json = b"{"
for line in LineIterator(event_stream):
if line != b"" and start_json in line:
data = json.loads(line[line.find(start_json) :].decode("utf-8"))
if data["token"]["text"] != stop_token:
print(data["token"]["text"], end="")

Streaming response demo in the jupyter notebook:

Streaming response demo from SageMaker endpoint

--

--