How to deploy a LLM chatbot

4 ways to productionize and bring the LLM chatbot to the customers

11 min readAug 15, 2023
Photo by Pavel Herceg on Unsplash

Numerous libraries are available to transition your LLM chatbot from development to production.

In this article, I will discuss some of the popular methods for achieving this, based on the current trends at the time of writing.

Option 1 — Adding a QA bot to your company’s discord server

This approach is beneficial for internal feedback and when your customers are also within the Discord ecosystem.

Demo of LLM via a discord bot

Moreover, elements of this solution can also be adapted for a standard chatbot on your website. Simply invoke the API Gateway from your web application instead of Discord.

Option 2 — Hugging ChatUI for a plug-and-play solution

Bring your own model or use a sagemaker endpoint and start chatting with ChatUI.

Demo of HuggingFace ChatUI of custom LLM

Option 3 — Using gradio

While it offers quicker development, careful consideration is needed before implementing it in a production environment.

Gradio Streaming Chatbot Demo

Option 4 — Streamlit HugChat plugin

Recommended only if you’re already working with Streamlit; otherwise, it’s not an optimal solution solely for a chatbot.

Streamlit Chatbot demo Powered by by HugChat Creds

In a production environment, the solution generally comprises three main components:

  1. Converting your company’s knowledge base into embeddings and storing them.
  2. Converting input questions into embeddings and efficiently retrieving nearest neighbor embeddings from a vector store.
  3. Providing context and questions to an LLM and delivering the output to the user.

This article primarily focuses on discussing how to present the output to the user. For the first two steps, I have provided a more detailed article, which you can explore here:

Option 1 — Adding LLM powered QA bot to discord

System Design

1. Creating a discord bot and add to your server

In another article, I delve into more detail about creating a bot and integrating it into a server. You can find the article here:

a. Navigate to

b. Click on “New Application”

c. Give a suitable name and click on “Create” button

Once the app is created, copy the “APPLICATION ID” APP_ID ,”PUBLIC KEY” PUBLIC_KEY for later use.

Under “Bot” menu item, click on “RESET TOKEN” button and copy the token. Store this BOT_TOKEN for later use.

Under “OAuth2” parent menu item, select “URL Generator” child menu item, check application.commands checkbox, and copy the “GENERATED URL” using “Copy” button.

Paste the URL in your browser (that has your discord account logged in with access to your server to which you want to add this bot).

Adding QA Bot to your Server (Guild)

2. Register commands

We’ll utilize slash commands. If you’re well-versed with the Discord API, feel free to explore alternatives to this approach.

Run the following code from any environment that has python and requests package installed.

We are using guild commands that are faster, and the command will be invoked via /qa input in discord.

Don’t forget to replace APP_ID , BOT_TOKEN , SERVER_ID (right click on server and click on “Copy Server ID” ) with your values.

Register this command via:


The response:

Register Command response

3. Create lambda to handle interaction (and API gateway trigger)

a. Create a new lambda function with python3.9 runtime, select the correct architecture as per your system because we will be adding a layer to this function.

b. Lambda code:

QA Bot interaction handler lambda code

The code is straightforward if you’ve worked with Discord bot slash commands before. Some functions might appear redundant, but they can be useful if you plan to expand upon this project in the future.

In lambda_handler function, t==1 is responding to the ping command.

In command_handler function, we simply invoke the sagemaker endpoint after parsing the question from the body object.

This code is similar to the code we used in :

c. Add a `sagemaker:InvokeEndpoint` permission to this lambda.

d. Go to configuration settings of lambda and add two environment variables:

i. PUBLIC_KEY — copied while creating the discord bot
ii. ENDPOINT_NAME — yet to be generated

e. Add a REST API Gateway trigger to the lambda:

To keep things simple “Security” is set to open.

f. Add PyNACL layer to this lambda:

!rm -rf layer/python
!mkdir -p layer/python
!pip install -q --target layer/python PyNaCl
!cd layer && zip -q --recurse-paths .

Upload this layer manually or add it via S3, then attach it to your lambda.

4. Add interactions endpoint URL

Copy the API Endpoint from under triggers,

API Endpoint

and paste it in the INTERACTION ENDPOINT URL option, under “General Information” in the discord developer portal of your app.

Interactions Endpoint URL of Discord App

If API gateway works

5. Test dummy command

You can comment out the endpoint-related code in the lambda and test the code.

In the following test, I simply return the user’s question to verify the functionality.

Testing the Bot

6. Deploy LLM to a Sagemaker endpoint

In this example, I will create an endpoint using the Sagemaker Jumpstart Foundation Models. Alternatively, you can use any other foundational model or a fine-tuned model based on your dataset and deploy it to an endpoint.

If you haven’t requested g5 instances before, start by requesting a quota increase for ml.g5.2xlarge for endpoint usage through Service Quotas. Please note that it might take up to 24 hours for the ticket to be resolved.

Next, navigate to your Sagemaker Studio, and under “Sagemaker Jumpstart,” select “Llama-2–7b-chat.”

Llama 2 7B Chat on Sagemaker Jumpstart

Change the Deployment configuration as desired, note the “Endpoint Name” variable ENDPOINT_NAME and click on “Deploy” button.

This model is not available for fine tuning, use “Llama 2 7b” for fine tuning on your dataset.

After couple of minutes the endpoint status should be in service.

Sagemaker Jumpstart Endpoint Status

You can also check endpoint status from “Endpoints” menu item which is under “Deployments” in Sagemaker Studio.

Endpoint Details

We will use the endpoint name to invoke it in lambda.

7. Modify lambda to invoke sagemaker endpoint and return response

Update ENDPOINT_NAME environment variable name in the lambda.

Also remove or uncomment the test code.

Now, when you use the /qacommand in Discord, it will invoke the Sagemaker endpoint and return the response from the LLM.

Discord bot responding after invoking Sagemaker Endpoint

Please note that I have NOT accelerated the gif; Lambda is capable of responding within 3 seconds.

After you have finished interacting with the endpoint, please delete it to release resources and halt billing.

Option 2 — HuggingFace ChatUI for plug-and-play

In his video, Abhishek Thakur ( utilizes the hugging face text-generation-inference ( along with chat-ui ( for a chatbot solution.

If you possess a hugging face inference endpoint or intend to employ an LLM from your local machine, begin by configuring and generating an endpoint using the text-generation-inference package.

This process will execute an open-source LLM on your machine, and the resulting endpoint will appear as follows:

You can employ this as an endpoint or opt for the SageMaker endpoint, provided you establish a new Lambda capable of invoking that endpoint and attaching a new API Gateway trigger to the Lambda function.

1. Create a new lambda with python environment.
Edit: You don’t have to create a lambda to invoke sagemaker endpoint, because ChatUI now supports sagemaker endpoint integration natively. Check out the ChatUI README on Sagemaker integration.

2. Function code:

3. Give `sagemaker:InvokeEndpoint` permission to this lambda policy.

4. Add ENDPOINT_NAME in the environment variable.

5. Add a new REST API Gateway trigger and note the API Endpoint.

6. Clone repository.

Cloning ChatUI Repo

7. ChatUI requires a running MongoDB instance to store chat history. You can utilize a MongoDB container as outlined in the repository:

docker run -d -p 27017:27017 --name mongo-chatui mongo:latest
MondoDB Container

8. Create a new .env.local file in ChatUI root directory and add two new objects, MONDODB_URL and MODELS .

Models object is similar to .env file, except we add a new endpoints key.

"name": "LLAMA 2 7B Chat, running on sagemaker endpoint",
"endpoints": [{"url": "API_ENDPOINT"}],
"datasetName": "Llama2-7b-chat/qa-sm",
"description": "A good alternative to ChatGPT",
"websiteUrl": "",
"userMessageToken": "<|prompter|>",
"assistantMessageToken": "<|assistant|>",
"messageEndToken": "</s>",
"preprompt": "Below are a series of dialogues between various people and an AI assistant. The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable. The assistant is happy to help with almost anything, and will do its best to understand exactly what is needed. It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer. That said, the assistant is practical and really does its best, and doesn't let caution get too much in the way of being useful.\n-----\n",

Replace the url with the API Endpoint created above.

Edit: Because ChatUI now supports sagemaker endpoint integration natively you can check out the ChatUI README on Sagemaker integration.

"endpoints": [
"host" : "sagemaker",
"url": "", // your aws sagemaker url here
"accessKey": "",
"secretKey" : "",
"sessionToken": "", // optional
"weight": 1

8. Run both the following commands:

npm install
npm run dev # For dev
npm run prod # For prod
npm install
npm run dev

The URL given by npm run dev will have the ChatUI app running.

ChatUI with Sagemaker Endpoint

Because I couldn’t figure our the response syntax, ChatUI always gave me this error:

ChatUI Error

But the endpoint and lambda was able to successfully return an output as per cloudwatch logs.

CloudWatch Response

Edit: Later on I was able to figure the response format of ChatUI for custom endpoints:

Please note that responding with python lambda, we loose streaming updates i.e. whole block of response appears at once unlike word by word. To achieve that you can either use Streaming response feature of AWS Lambda in NodeJS runtime or use FAST API streaming response.

Option 3 — Gradio chatbot

Gradio is a powerful tool employed for crafting user interfaces for machine learning models. With just a few lines of code, you can easily set up a demo for almost any ML model.

Andrew Ng’s course, “Building Generative AI Applications with Gradio,” emphasizes the rapid creation and demonstration of machine learning applications using Gradio.

In this example, we’ll utilize a Hugging Face endpoint in conjunction with text-generation-interface to establish an endpoint.

Gradio app code:

We will use gr.Chatbot function with streaming mode.

Gradio Demo

Option 4 — Streamlit chatbot

The article How to build an LLM-powered ChatBot with Streamlit is a nice place to start your journey on this subject.

1. Clone this repo:

2. Change the file code with the following:

3. Create a secrets.toml file under .streamlit directory.

You have the option to add EMAIL and PASS in that file itself to prevent entering your HF credentials.

Streamlit with HugChat

You can also skip and enter them during bootup.

Streamlit without HF Creds in secrets.toml

If everything is correct you will be able to chat with the LLM chosen in text-generation-interface.

Streamlit with HugChat demo

Production Setting

Please keep in mind that within a production environment, you’d ideally want the bot to respond solely based on the information contained in your knowledge base, which may include text, PDFs, videos, etc.

In another article of mine (, I discuss how to accomplish this. You can also utilize an RDS or another database to store the context and pass it along with each input.

For storing the embeddings of your knowledge base, you can opt for “pgvector” or an embedding database, or consider using an ANN library if your search tree is relatively small.


AWS offers a comprehensive example for deploying a chatbot powered by multiple LLMs using AWS CDK on AWS infrastructure.

You can access it here:

It deploys a comprehensive UI built with React that interacts with the deployed LLMs as chatbots, supporting sync requests and streaming modes to hit LLM endpoints, managing conversation history, stopping model generation in streaming mode, and switching between all deployed models for experimentation.

If you have the necessary resources, personally, this is one of the most robust solutions you can implement within your production environment.


This stack contains the necessary resources to set up a chatbot system, including:

a. The ability to deploy one or more large language models through a custom construct, supporting three different techniques:

  • Deploying models from SageMaker Foundation models by specifying the specific model ARN.
  • Deploying models supported by the HuggingFace TGI container.
  • Deploying all other models from Hugging Face with custom inference code.

b. Backend resources for the user interface, including chat backend actions and a Cognito user pool for authentication.

c. A DynamoDB-backed system for managing conversation history.

This stack also incorporates “model adapters”, enabling the setup of different parameters and functions for specific models without changing the core logic to perform requests and consume responses from SageMaker endpoints for different LLMs.

You can connect with me on LinkedIn:

My github website: