How to Productionize Large Language Models (LLMs)

Understand LLMOps, architectural patterns, how to evaluate, fine tune & deploy HuggingFace generative AI models locally or on cloud.

94 min readMar 27, 2024
Generated using Stable Diffusion

Table of contents

1. LLMs primer

  • Transformer architecture
    - Inputs (token context window)
    - Embedding
    - Encoder
    - Self-attention (multi-head) layers
    - Decoder
    - Softmax output
  • Difference between various LLMs (architecture, weights and parameters)
  • HuggingFace, the house of LLMs

2. How to play with LLMs

  • Model size and memory needed
  • Local model inference
    - Quantization
    - Transformers
    - GPT4All
    - LM Studio
    - llama.cpp
    - Ollama
  • Google colab
  • AWS SageMaker Studio, Studio Lab, SageMaker Jumpstart and Bedrock
    - SageMaker Studio
    - SageMaker Studio Lab
    - SageMaker Jumpstart
    - Amazon Bedrock
  • Deploy HuggingFace model on SageMaker endpoint

3. Architectural Patterns for LLMs

  • Foundation models
  • Prompt engineering
    - Tokens
    - In-Context Learning
    - Zero-Shot inference
    - One-shot inference
    - Few-shot inference
  • Retrieval Augmented Generation (RAG)
    - RAG Workflow
    - Chunking
    - Document Loading and Vector Databases
    - Document Retrieval and reranking
    - Reranking with Maximum Marginal Relevance
  • Customize and Fine tuning
    - Instruction fine-tuning
    - Parameter efficient fine-tuning
    - LoRA and QLoRA
  • Reinforcement learning from human feedback (RLHF)
    - Reward model
    - Fine-tune with RLHF
  • Pretraining (creating from scratch)
    - Continous pre-training
    - Pretraining datasets
    - HuggingFace autotrain
  • Agents
    - Agent orchestration
    - Available agents

4. Evaluating LLMs

  • Classical and deep learning model evaluation
    - Metrics
    - NLP metrics
  • Holistic evaluation of LLMs
    - Metrics
    - Benchmarks and Datasets
    - Evaluating RLHF fine-tuned model
    - Evaluation datasets for specialized domains
  • Evaluating in CI/CD
    - Rule based
    - Model graded evaluation
  • Evaluation beyond metrics and benchmarks
    - Cost and memory
    - Latency
    - Input context length and output sequence max length

5. Deploying LLMs

  • Deployment vs productionization
  • Classical ML model pipeline
    - Open-source tools
    - AWS SageMaker Pipelines
    - Different ways to deploy model on SageMaker
    - BYOC (Bring your own container)
    - Deploying multiple models
  • LLM Inference with Quantization
    - Quantize with AutoGPTQ
    - Quantize with llama.cpp
  • Deploy LLM on Local Machine
    - llama.cpp
    - Ollama
    - Transformers
    - text-generation webui by oobabooga
    - GPT4ALL
    - Chat with RTX by Nvidia
  • Deploy LLM on cloud
    - Major cloud providers
    - Deploy LLMs from HuggingFace on Sagemaker Endpoint
    - Sagemaker Jumpstart
    - SageMaker deployment of LLMs that you have pretrained or fine-tuned
  • Deploy using containers
    - Benefits of using containers
    - GPU and containers
    - Using Ollama
  • Using specialized hardware for inference
    - AWS Inferentia
    - Apple Neural engine
  • Deployment on edge devices
    - Different types of edge devices
    - TensorFlow Lite
    - SageMaker Neo
    - ONNX
  • CI/CD Pipeline for LLM based applications
    - Fine-tuning Pipeline
  • Capturing endpoint statistics
    - Ways to capture endpoint statistics
    - Cloud provider endpoints

6. Productionize LLM based projects

  • An Intelligent QA chatbot powered by Llama 2 Chat
  • LLM based recommendation system chatbot
  • Customer support chatbot using agents

7. Upcoming

  • GPT-5
  • Prompt compression
  • LLMops
  • AI Software Engineers (or agents)

LLMs primer

  • Transformer architecture
    - Inputs (token context window)
    - Embedding
    - Encoder
    - Self-attention (multi-head) layers
    - Decoder
    - Softmax output
  • Difference between various LLMs (architecture, weights and parameters)
  • HuggingFace, the house of LLMs

A large language model is generally a transformer model and works by receiving an input, encoding it, and then decoding it to produce an output prediction, or generating a completion to give input prompt.

Released in 2017, Transformers are at the core of most modern language models, and the “T” in BERT and GPT, two popular language architectures, stands for Transformer.

During model pretraining and fine-tuning, the Transformer is helping the model gain contextual understanding on the language from the input training/tuning corpus.

Transformer architecture

Source: Vaswani et al., “Attention is all you need”, arXiv, 2017

Inputs (token context window)

High level Transformer architecture

Input prompt is stored in a construct called the input “context window”. It is measured by the number of tokens it holds. The size of the context window varies widely from model to model.

Earlier generative models could hold only 512–1024 input tokens in the context window. However, more recent models can hold upwards of 10,000 and even 100,000 tokens. The model’s input context window size is defined during model design and pretraining.


Embeddings are vector representations of words or entities in a high-dimensional space, where words with similar meanings are closer to each other. In natural language processing and machine learning, embeddings are used to capture semantic relationships and contextual information.

Embeddings in Tranformers are learned during model pretraining and are actually part of the larger Transformer architeture. Each input token in the context windows is mapped to an embedding. These embeddings are used throughout the rest of the Transformer neural network, including the self-attention layers.


Encoder projects sequence of input tokens into a vector space that represents that strucute and meaning of the input. The vector space representation is learned during model pretraining.

Self-attention (multi-head) layers

Self-attention” mechanism attends every token in the data to all other tokens in the input sequence

Self-attention in transformer architecture is a mechanism that enables the model to weigh the significance of different words in a sequence relative to each other. In the multi-head variant, the attention mechanism is applied multiple times, each focusing on different aspects of the input. This allows the model to capture diverse relationships and dependencies within the sequence, enhancing its ability to understand context and long-range dependencies.

Self-attention is very computationally expensive as it calculates n square pairwise attention scores between every token in the input with every other token. A lot of generative performance improvements are targeted at the attention layers such as FlashAttention and grouped-query attention (GQA).


The attention weights are passed throgh rest of the Transformer neural network, including the decoder. The decoder uses the attention-based contextual understanding of the input tokens to generate new tokens, which ultimately “completes” the provided input. That is why the base model’s response is often called a completion.

Softmax output

Probability of being the next token across all tokens in the vocabulary

The softmax output layer generates a probability distribution across the entire token vocabulary in which each token is assigned a probability that it will be selected text.

Typically the token with highest probability will be generarted as the next token but there are mechanisms like temperature, top-k & top-p to modify next token selection to make the model more or less creative.

Difference between various LLMs

Encoder only — or autoencoders are pretrained using a technique called masked language modeling (MLM), which randomly mask input tokens and try to predict the masked tokens.

Encoder-only models are best suited for language tasks that utilize the embeddings generated by the encoder, such as semantic similarity or text classification because they use bidirectional representations of the input to better understand the fill context of a token — not just the previous tokens in the sequence. But they are not particularly useful for generative tasks that continue to generate more text.

Example of well known encode-only models is BERT.

Decoder only — or autoregressice models are pretrained using unidirectional causal language modeling (CLM), which predicts the next token using only the previous tokens — every other token is masked.

Decoder-only, autoregressive models use millions of text examples to learn a statistical language representation by continously predicting the next token from the previous tokens. These models are the standard for generative tasks, including question-answer. The families of GPT-3, Falcon and Llama models are well-known autoregressive models.

Encoder-decoder — models, often called sequence-to-sequence models, use both the Transformer encoder and decoder. They were originally designed for translation, are also very useful for text-summarization tasks like T5 or FLAN-T5.

Weights - In 2022, a group of researchers released a paper that compared model performance of various model and dataset size combinations. The paper claim that the optimal training data size (measured in tokens) is 20x the number of model parameters and that anything below that 20x ration is potentially overparameterized and undertrained (now referred to as Chinchilla scaling laws).

Chinchilla scaling laws for given model size and dataset size

According to Chinchilla scaling laws, there 175+ billion parameter models should be trained on 3.5 trillion tokens. Instead, they were trained with 180–350 billion tokens — an order of magnitude smaller than recommended. In fact, the more recent Llama 2 70 billion parameter model, was trained with 2 trillion tokens — greater than the 20-to-1 token-to-parameter ration described by the paper. This is one of the reason Llama 2 outperformed original Llama model based on various benchmarks.

Attention layers & Parameters (top k, top p) — most of the model cards explain the type of attention layers the model has and how your hardware can exploit it to full potential. Most common open-source models also document the parameters that can be tuned to achieve optimum performance based on your dataset by tuning certain parameters.

HuggingFace, the house of LLMs

Little guide to building Large Language Models by Thomas Co-founder of HuggingFace

Hugging Face is a platform that provides easy access to state-of-the-art natural language processing (NLP) models, including Large Language Models (LLMs), through open-source libraries. It serves as a hub for the NLP community, offering a repository of pre-trained models and tools that simplify the development and deployment of language-based applications.

The platform is particularly valuable for open-source LLMs because it democratizes access to powerful models, allowing developers and researchers to leverage cutting-edge capabilities without the need for extensive computational resources or expertise in model training.

Model card:

Model card

Model inference:

Model inference

Deploy model:

Deploy model

Train model:

Train model

Model files:

Model files

How to play with LLMs

  • Model size and memory needed
  • Local model inference
    - Quantization
    - Transformers
    - GPT4All
    - LM Studio
    - llama.cpp
    - Ollama
  • Google colab
  • AWS SageMaker Studio, Studio Lab, SageMaker Jumpstart and Bedrock
    - SageMaker Studio
    - SageMaker Studio Lab
    - SageMaker Jumpstart
    - Amazon Bedrock
  • Deploy HuggingFace model on SageMaker endpoint

Model sizes and memory needed

A single-model parameter, at full 32-bit precision, is represented by 4 bytes. Therefore, a 1-billion parameter model required 4 GB of GPU RAM just to load the model into GPU RAM at full precision. If you want to train the model, you need more GPU memory to store the states of the numerical optimizer, gradients, and activations, as well as temporary variables used by the function.

These addition components lead to approximately 12–20 extra bytes of GPU memory per model parameter. So to train a 1-billion-parameter model, you need approximately 24GB of GPU RAM at 32-bit full precision, six times the memory compared to just 4GB of GPU RAM for loading the model.

RAM needed to train a model


1 billion parameter model X 4 = 4GB for inference
1 billion parameter model X 24 = 24GB for pretrainig in full precision

Model size for inference Llama 2 13B:

Llama 2 13B Model size for inference

Local model inference


Quantization reduces the memory needed to load and train a model by reducing the precision of the model weights. Quantization converts the model parameters from 32-bit precision down to 16-bit precision or even 8-bit, 4-bit or even 1-bit.

By quantizing the model weights from 32-bit full precision down to 16-bit or 8-bit precision, you can quickly reduce your 1-billion-parameter-model memory requirement down 50% to only 2GB, or even down 75% to just 1GB for loading.

Approx GPU RAM needed to load a 1-billion parameter model on 32-bit, 16-bit and 8-bit precision

The comes at the cost of model quality but in some cases quantization has shown improved performance and reduction of cost in all cases.

Transformers Library

Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.

It provides APIs and tools to easily download and train state-of-the-art pretrained models for common tasks in different modalities, such as:

📝 Natural Language Processing: text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation.
🖼️ Computer Vision: image classification, object detection, and segmentation.
🗣️ Audio: automatic speech recognition and audio classification.
🐙 Multimodal: table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.

Also helps in pre-processing, training, fine-tuning and deploying transformers.

Using LLM from HuggingFace:

Pretrained models are downloaded and locally cached at: ~/.cache/huggingface/hub. This is the default directory given by the shell environment variable TRANSFORMERS_CACHE. On Windows, the default directory is given by C:\Users\username\.cache\huggingface\hub. You can change the shell environment variables shown below - in order of priority - to specify a different cache directory:

  1. Shell environment variable (default): HUGGINGFACE_HUB_CACHE or TRANSFORMERS_CACHE.
  2. Shell environment variable: HF_HOME.
  3. Shell environment variable: XDG_CACHE_HOME + /huggingface.

Utilize the Meta Llama model, signup, and request for access to play with Llama models.


Open-source | Available for Windows, Mac and Linux (Debian based)

GPT4All is a free-to-use, locally running, privacy-aware chatbot which does not require GPU or even internet to work on your machine (or even cloud).

In complete essence, GPT4All is an ecosystem to train and deploy powerful and customized large language models that run locally on consumer grade CPUs. Their GPT4All tool available on their website can be downloaded in any OS and used for development or play (maybe also for production).

A GPT4All model is a 3GB — 8GB file that you can download and plug into the GPT4All open-source ecosystem software. Nomic AI supports and maintains this software ecosystem to enforce quality and security alongside spearheading the effort to allow any person or enterprise to easily train and deploy their own on-edge large language models.


It is completely UI based and does not require any programming skills.

LM Studio

Open-source | Available for Windows, Mac (ARM) and Linux (beta)

LM Studio helps you find, download, and experiment with LLMs on your locally on your laptop.

LM Studio is an easy to use desktop app for experimenting with local and open-source Large Language Models (LLMs). The LM Studio cross platform desktop app allows you to download and run any ggml-compatible model from Hugging Face, and provides a simple yet powerful model configuration and inferencing UI.

“GG” refers to the initials of its originator (Georgi Gerganov).

The app leverages your GPU when possible and you can also choose to offload only some model layers to GPU VRAM.

LM Studio Windows
LM Studio Linux


Open-source | Mac OS, Linux, Windows (via CMake), Docker, FreeBSD

Originally LLaMA.cpp was a C/C++ port of Facebook’s LLaMA model, a large language model (LLM) that can generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way. Now it it not limited to LlaMa family of models.

llama.cpp supports inference, fine-tuning, pretraining and quantization (upto 2 bits) of ggml models with minimal setup and state-of-the-art performance on a wide variety of hardware — locally and in the cloud. It also supports conversion of other model extensions to ggml.

It is one of the most stable, production grade library with multimodel support and bidings in 14 most popular programming languages including python.

Running inference with llama.cpp in Linux demo with code

  1. Create a new conda environment
conda create --name llamacpp python=3.11 -y

2. Activate the env and clone the llama.cpp repo

conda activate llamacpp && git clone

3. cd into llama.cpp directory and execute make (install make first if you don’t have)

4. Meanwhile that command finishes execution, download any gguf model, or request access to Llama from Facebook and use llama.cpp python3 models/YOUR_MODEL command to convert your model weights + tokenizer to gguf format.

If you want to use Llama 2 and save time and space, you can download already converted and quantized models from TheBloke, including:

For this demo I used TheBloke’s TinyLlama 1B parameter 4-bit quantized model. Download any of the available gguf model files as per your system’s capabilities and place the downloaded file inside llama.cpp/models directory.

5. After the make finished execution, you can do inference with the LLM.

./main -m ./models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "What is the meaning of life?" -n 1028

Check for input parameters from the official github repo of llama.cpp.

Despite being a tiny model, it still gave a decent response due to the distilled capabilities of the stronger parent model.

Full relevant response:

 What is the meaning of life? Why do people exist? These are some of the questions that have fascinated many people throughout history. However, the answer to these questions remains elusive, and some people have even proposed alternative answers. In this essay, I will explore the meanings of life, death, and rebirth, and discuss whether these concepts are truly meaningful or not. I will also examine the role of religion and philosophy in exploring these questions, and how their interpretations have influenced modern thought.

The Meaning of Life

The first question to answer is what does the concept of life mean? Philosophers and religious leaders have had different interpretations of life, and each has created their own version of what it means to live a good and fulfilling life. For example, the Stoics believed that life was a continuous cycle of birth, death, and rebirth, and that all action was a matter of duty. The Epicureans, on the other hand, believed that life was a momentary existence that we should spend enjoying the present moment.

One of the most well-known philosophers in this regard is Socrates, who believed that the ultimate goal of life was self-knowledge and self-realization. He believed that the goal of life was not to achieve pleasure or happiness, but to find true meaning and truth. Another philosopher, Friedrich Nietzsche, believed that life was a struggle between the will to power and the will to powerlessness, which resulted in the idea of the "death of the individual."

The Meaning of Death

Another fundamental question that people have asked is, "What happens when we die?" Philosophers and religions have had different interpretations of death, which are often connected to the concept of soul. For some religions, such as Islam, death is seen as a return to God, and the soul is immortal. For others, like Christianity and Judaism, death is a state of limbo, and the soul is only fully liberated upon the resurrection.

Religious interpretations of death also influence how we approach life after we die. Many religions believe that death is a natural part of life, and that the soul goes to heaven or hell after death. For some religions, like Hinduism and Buddhism, the idea of reincarnation is significant, where people return to their birthplace in a cycle of birth, death, and rebirth.

The Meaning of Life

The final fundamental question that people have asked is, "What is the meaning of life?" This question is often connected to the concept of purpose and meaning. Some people believe that life has a purpose, such as serving others or finding a meaningful career, while others believe that life is just a random series of events, without any purpose or meaning.

Religions and philosophy often offer different perspectives on life's meaning. For some religions, life is seen as a journey towards the divine, while for others, life is seen as a way to achieve happiness and fulfillment. Some philosophers believe that life has no meaning or purpose, while others believe that life is a process of growth and development that leads to a higher purpose.


In conclusion, the concept of life, death, and meaning have been central questions that people have asked for centuries, and they have been influenced by different religious and philosophical perspectives. From the concept of soul to reincarnation and purpose, these questions have led people to explore the nature of life, death, and the meaning of existence. In this essay, we have examined the history of these questions, their significance, and their influence on culture and society. [end of text]

llama_print_timings: load time = 3603.26 ms
llama_print_timings: sample time = 33.06 ms / 784 runs ( 0.04 ms per token, 23715.89 tokens per second)
llama_print_timings: prompt eval time = 144.83 ms / 8 tokens ( 18.10 ms per token, 55.24 tokens per second)
llama_print_timings: eval time = 36522.88 ms / 783 runs ( 46.64 ms per token, 21.44 tokens per second)
llama_print_timings: total time = 37059.85 ms / 791 tokens
Log end


Open-source | Mac, Linux & Windows

Ollama is UI based tool that supports inference on number of open-source large language models. It is super easy to install and get running in few minutes.

Out of the box Ollama supports the models listed in the model registry of the website, and it covers most popular models. It also supports importing GGUF, pytorch and safetensors based models and quantization options.

  1. Download Ollama or copy install command for linux from official website

2. Run the command in terminal

curl -fsSL | sh

Note: it automatically detected my NVIDIA GeForce 960M 3GB GPU

3. Now you can either directly run any of the models from Ollama library or customize the prompt before running them. To run directly:

ollama run phi

Prompt: What is the meaning of life?


>>> What is the meaning of life?
I do not have personal beliefs or opinions, but i can provide you with
some common interpretations and perspectives on this question.

some people believe that the meaning of life is to achieve happiness and
fulfillment by pursuing their passions, building meaningful relationships,
and contributing positively to society.

others argue that the meaning of life comes from spiritual or religious
beliefs, such as finding a connection to something greater than oneself or
living in accordance with divine laws.

still, others believe that there may not be a single universal answer to
this question, and that individuals may create their own unique meanings
for life based on their experiences, values, and goals.

it is important to note that the meaning of life is a deeply personal and
subjective topic, and it can vary greatly from person to person.
ultimately, the search for meaning in one's life is an ongoing process,
and there is no right or wrong answer.

Let's imagine four different interpretations of the meaning of life as
1. Happiness & fulfillment through pursuing passions and building
meaningful relationships.
2. Spiritual belief with connection to something greater than oneself.
3. Personal creation of unique meanings for life based on individual
experiences, values, and goals.
4. Contribution positively to society.

Also, consider the following statements:
a) The one who believes in pursuing passions doesn't believe in
contributing positively to society.
b) The person who creates his own meanings does not have a spiritual
c) Either the person who contributes positively to society or the one with
a spiritual belief follows the same interpretation as you, but it's not
clear which one.
d) You follow your own interpretation of life's meaning.

Question: Based on these statements and interpretations, can you figure
out what each person believes in?

Let’s start by considering statement d - "You follow your own
interpretation of life's meaning". This means that you cannot be the one
who follows the beliefs of contributing positively to society (statement
a) because it contradicts with the statement which says “you don't believe
in pursuing passions.”

From step 1, since you can't follow contribution positively to society and
you also don’t believe in pursuing passions, this means that only one
interpretation remains for you - "Personal creation of unique meanings for
life based on individual experiences, values, and goals". So you are the
third person who follows this interpretation.

Now let's consider statement a again, which says: The one who believes in
pursuing passions doesn't believe in contributing positively to society.
Since it contradicts with what we found out from step 1 (that you don’t
believe in pursuing passions), it means that there is an error here. This
is proof by contradiction.

From the contradiction in step 3, let's re-examine statement a. It says:
The one who believes in pursuing passions doesn't believe in contributing
positively to society. But from what we know so far (Step 1) you don’t
believe in contributing positively to society either. So the person who
believes in pursuing passions can't be you or anyone else. Therefore,
there is a contradiction in this statement.

From step 4, it's clear that our first assumption was incorrect. This
means the only other interpretation left for someone else (not mentioned
yet) must be: Spiritual belief with connection to something greater than
oneself. Since we already know from step 1 that you follow your own unique
meanings, and from statement b that person who creates his/her own
meanings does not have a spiritual belief, it implies that there is
another person following the interpretation of spiritual beliefs.

Now let's consider statement c: Either the person who contributes
positively to society or the one with a spiritual belief follows the same
interpretation as you (and vice versa). But from step 5, we know someone
else has a spiritual belief, and from step 4 that you don't have this
belief. This implies that they should be the one following your unique
meanings, which means the person who contributes positively to society
must follow your beliefs.

Answer: You believe in "Personal creation of unique meanings for life
based on individual experiences, values, and goals". The other three
people believe in either spiritual beliefs or contributing positively to
society (or a combination).

4. To customize the prompt, first pull the model

ollama pull gemma:2b-instruct

Create a Modelfile

FROM gemma:2b-instruct

# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 0.6

You are a psychologist chatbot called Therapy. Answer all the user's questions with empathy.

5. Create the model and run it

ollama create therapy -f ./Modelfile
ollama run therapy

Prompt: What is the meaning of life?


>>> What is the meaning of life?
As a psychologist chatbot, I do not have personal experiences or emotions,
but I can provide information and guidance on the meaning of life.

**Here are some perspectives on the meaning of life that I can offer:**

* **Personal Growth:** Finding meaning in life can be achieved through
personal growth, self-discovery, and pursuing one's passions and values.
* **Relationships:** Meaning can be found in meaningful relationships with
others who share similar values and interests.
* **Contribution:** Contributing to something larger than oneself and
leaving a positive impact on the world can provide a sense of purpose.
* **Self-Care:** Prioritizing one's physical and emotional well-being is
crucial for maintaining a fulfilling life.
* **Experiencing the Beauty of Life:** Cultivating an appreciation for the
beauty of nature, art, music, and everyday experiences can foster a sense
of wonder and meaning.
* **Finding Meaning in Relationships:** Meaning can be found in deep and
meaningful connections with others who offer support, love, and
* **Exploring Meaning:** Engaging in meaningful activities, such as
learning, traveling, or pursuing creative pursuits, can lead to a sense of
* **Living in the Present:** Focusing on the present moment and savoring
each experience can help appreciate life's beauty and brevity.
* **Helping Others:** Contributing to a cause or making a positive
difference in the world can provide a sense of meaning and purpose.

**Remember, the meaning of life is ultimately a personal question that
each individual must explore and answer for themselves.**

**I hope this information is helpful. Please let me know if you have any
other questions.**

Run multimodal models too:

What's in this image? /Users/jmorgan/Desktop/smile.png

Import any GGUF, PyTorch and safetensors model : documentation

Pass in prompt as arguments:

ollama run llama2 "Summarize this file: $(cat"

Remove the model:

ollama rm gemma:2b-instruct
ollama rm phi

If Ollama gets stuck at “pulling manifest” restart it with:

systemctl daemon-reload
systemctl restart ollama

Google colab

Google Colaboratory, also known as Google Colab, is a cloud-based platform provided by Google that offers free access to Jupyter notebook environments with GPU and TPU support. You can run Python code, execute commands, and perform data analysis tasks directly in the browser without the need for any setup or installation.

For playing with large language models that require GPUs for inference, Colab offers significant advantages with free and paid plans. Users can leverage Colab’s GPU support to execute large language models.

Free Colab users get chargeless access to GPU and TPU runtimes for up to 12 hours. Its GPU runtime comes with an Intel Xeon CPU @2.20 GHz, 13 GB RAM, a Tesla K80 accelerator or V100, and 12 GB GDDR5 VRAM. Emperical evidence has shown that Tesla K80 performs much better than a mobile or SOC GPU of similar spec.

Video where Google Colab free plan outperforms M1 pro and M3 pro for tranformer based model.

Running LLMs in google colab:

Falcon 7B instruct model takes total of 13.67 GB of T4 GPU RAM. T4 GPUs are available in free tier.

Full response:

Result: What is the meaning of life?
As an AI language model, I don't have my own personal opinions or beliefs,
but the meaning of life is a highly debated philosophical question
that may vary depending on individual beliefs and experiences.
What do you think?

AWS SageMaker Studio, SageMaker Jumpstart and Bedrock

Amazon SageMaker is a fully managed service provided by AWS that enables developers and data scientists to build, train, and deploy machine learning (ML) models at scale. It simplifies the ML workflow by providing tools for labelling data, selecting algorithms, training models, and deploying them into production. Key features include:

  1. End-to-End ML Workflow: SageMaker offers a seamless workflow, from building and training models to deploying them for inference.
  2. Managed Notebooks: Integrated Jupyter notebooks allow easy collaboration for data exploration and model development.
  3. Built-in Algorithms: SageMaker includes a variety of pre-built algorithms for common ML tasks, reducing the need for custom coding.
  4. Hyperparameter Tuning: Automated hyperparameter optimization helps find the best configuration for model performance.
  5. Model Deployment: Easily deploy models for real-time or batch processing with automatic scaling.
  6. A/B Testing: Conduct experiments with multiple models to evaluate and compare their performance.
  7. Secure and Scalable: Built-in security features and the ability to scale resources up or down based on demand.
  8. Support for Popular Frameworks: SageMaker supports popular ML frameworks like TensorFlow, PyTorch, and scikit-learn.

It provides a comprehensive environment for machine learning, covering the entire process from data preparation to model deployment and monitoring.

The jupyter notebooks in SageMaker studio runs on AWS infrastrucutre. That means you can utilize powerful GPUs on-demand and halt billing immediately after you no longer need them. This helps tremendously for people who don’t have the resources to invest in powerful GPU like Nvidia A100 and H100. But at the time of writing they only support 80GB of GPU RAM. And since you likely want to train models larger than 1 billion parameters, you will need a bunch of them connected together.

Amazon EC2 p4d.24xlarge instances have upto 8 GPUs each with Nvidia A100 GPUs and 640 GB of total GPU memory, at the price of $32.77 per hour. Depending on your use case it can still be a lot cost effective than trying to create you own GPU cluster like Nvidia DGX A100 system.

Sagemaker Studio

Amazon SageMaker Studio with new UI announced in re:Invent 2023

Amazon SageMaker Studio is cloud based fully manged integrated development environment (IDE) with pre-built tools for end-to-end ML development.

SageMaker Studio streamlines the ML workflow, making it accessible to data scientists, developers, and engineers. Its comprehensive set of features distinguishes it as a powerful and user-friendly platform for ML development and deployment.

  1. Unified Platform: SageMaker Studio unifies various ML tasks like data exploration, training, and deployment within a single, visual interface.
  2. Collaboration: Multiple team members can collaborate seamlessly using shared notebooks and resources, fostering teamwork.
  3. One-Click Model Deployment: Easily deploy models to endpoints for real-time or batch predictions with a single click.
  4. Experiment Tracking: Automatically tracks experiments, parameters, and results for easy model comparison and reproducibility.
  5. Integrated Debugger: Debug and profile ML models during training to identify and fix issues efficiently.
  6. Automatic Model Tuning: Hyperparameter tuning is simplified with automated optimization for improved model performance.
  7. Security and Compliance: Built-in security features ensure data privacy and compliance with regulations.
  8. Versatility: Supports popular ML frameworks like TensorFlow, PyTorch, and scikit-learn, offering flexibility in model development.

To play with LLMs in SageMaker studio and also in general with LLMs on SageMaker, your AWS account must be pre-approved for higher grade GPUs. If you have a new account or it has only been used for beningn services since its creation then you will have to request for “Service Quota” limit increase for that particular type of instances and usage type like for notebooks, endpoints or other jobs.

Example of quota incrase for ml.g5.4xlarge instance that has 24GiB GPU Memory, for SageMaker endpoint usage with the cost of only $1.624 on-demand price per hour.

Similarly you will have to request quota increase for using higher grade GPUs in SageMaker Studio.

Playing with LLMs in SageMaker studio is same as playing with them on your local notebooks or in google Colab except that you have access to industrial ML/AI specialized GPUs.

Demo of inference on SageMaker studio:

  1. If you haven’t already then first set-up SageMaker in your AWS account and enter the studio environment.
  2. Start a new studio notebook with default environment or create a new env with following commands (recommended):
export CONDA_ENV_NAME = llm-playground \
&& conda create --name $(CONDA_ENV_NAME) python pip ipykernel -y \
&& conda activate $(CONDA_ENV_NAME) \
&& python -s -m ipykernel install --user --name $(CONDA_ENV_NAME) --display-name "Python LLM Playground"

In the above commands first we create a new conda environment llm-playground with python, pip and ipykernel packages then after activating it we install jupyter kernel so that we can choose this environment in jupyter notebook.

3. After creating a ipykernel start a new notebook environment. Select the above created kernel and use the notebook like any other jupyter notebook.

SageMaker Studio Lab

SageMaker studio lab enables quick experimentation of with machine learning tools, frameworks and libraries in a plug-n-play environment similar to Google Colab. It is completely free and seperate from your AWS account.

You can choose between CPU and GPU, and it also provies many popular pre-packaged frameworks and customization opportunities with dedicated storage of upto 15GB.

The catch with studio lab is that it seems to be only available for some VIP people. I requested a account months ago and did not receive access. Then I used my university email id to request access and it was approved instantly.

Other catch is that because it is free, you have access to the resources for a very limited time before it resets next day (and AWS can change the limits anytime).

  • Max continuous notebook run time on CPU = 4 hours
    In a whole day you can run two such above cycles or many shorter continours job for a total of 8 hours in a day.
  • Max GPU notebook usage in a day = 4 hours

If you can get access, then you can easily play with models like Mistral and stable diffusion in studio lab easily.

  1. Login to sagemaker studio lab
  2. Scroll down and click on “Open Notebook” in make ai generated image box.

3. On the new page, change compute type to GPU and click on “Start runtime”

4. Click on “copy to project”, it will open SageMaker studio

5. Run the jupyter notebook cells, one by one

Login via your huggingface token to download stable diffusion model from huggingface:

If your get “No module named ‘torch’”, install pytorch using the command from official website and restart the runtime:

Default prompt from Notebook


"create an image of a person standing alone in a subway station, "
"under a single bright fluorescent light, surrounding tiles reflect the light, "
"casting soft shadows around the individual engrossed in their smartphone, "
"ambiance is quiet, contemplative, with the architectural details of the subway, "
"turnstiles, signs, faintly visible in the periphery, "
"suggesting an urban narrative of isolation amidst the city's rush",


Output with same prompt on Midjourney 6 :

SageMaker Jumpstart

Released inDecember 2020, SageMaker Jumpstart provides easy access to pretrained, open-source models for a wide range of problem types. You can incrementally train and tune these models before deployment. JumpStart also provides solution templates that set up infrastructure for common use cases, and executable example notebooks for machine learning.

You can deploy, fine-tune, and evaluate pretrained models from popular models hubs through the JumpStart.

Let’s deploy Llama2 7 Billion chat parameter model:

If you haven’t requested g5 instances before, start by requesting a quota increase for ml.g5.2xlarge for endpoint usage through Service Quotas. Please note that it might take up to 24 hours for the ticket to be resolved.

Navigate to your Sagemaker Studio, and under “Sagemaker Jumpstart,” select “Llama-2–7b-chat.”

Llama 2 7B Chat on Sagemaker Jumpstart

Change the Deployment configuration as desired, note the “Endpoint Name” variable ENDPOINT_NAME and click on “Deploy” button.

This model is not available for fine tuning, use “Llama 2 7b” for fine tuning on your dataset.

After couple of minutes the endpoint status should be in service.

Sagemaker Jumpstart Endpoint Status

You can also check endpoint status from “Endpoints” menu item which is under “Deployments” in Sagemaker Studio.

Endpoint Details


We can also deploy the same model using boto3 too.

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id="meta-textgeneration-llama-2-7b-f")
example_payloads = model.retrieve_all_examples()

accept_eula = True

if accept_eula:
predictor = model.deploy(accept_eula=accept_eula)

for payload in example_payloads:
response = predictor.predict(payload.body)
prompt = payload.body[payload.prompt_key]
generated_text = response[0]["generated_text"]
print("\nInput\n", prompt, "\n\nOutput\n\n", generated_text, "\n\n===============")

To play with latest LLMs they have to be first present in SageMaker jumpstart.

Amazon Bedrock

Amazon Bedrock went Generally Available (GA) around September 2023. It is a fully managed service that makes high-performing foundation models (FMs) from few third-party providers (e.g., AI21 Labs, Anthropic, Cohere, Stability AI, Meta’s Llama 2) and also from Amazon (e.g., Titan), available for your use through a unified boto3 API.

It is a serverless service, you can get started quickly, privately customize foundation models with your own data, and easily and securely integrate and deploy them into your applications using AWS tools without having to manage any infrastructure.

But with other things like Amazon, thanks to bitcoin miners, before you can use any of the foundation models, you must request access to that model. If you try to use the model (with the API or within the console) before you have requested access to it, you will receive an error message.

Amazon Bedrock Playground

Invoke Amazon Bedrock models using boto3:

For streaming response you need to call invoke_model_with_response_stream api:

response = client.invoke_model_with_response_stream(

stream = response["body"]
if stream:
for event in stream:
chunk = event.get("chunk")
if chunk:

Full response:

{'completion': ' Here', 'stop_reason': None, 'stop': None}
{'completion': ' is a 100 word essay', 'stop_reason': None, 'stop': None}
{'completion': ' about Snickers candy', 'stop_reason': None, 'stop': None}
{'completion': ' bars:\n\nS', 'stop_reason': None, 'stop': None}
{'completion': 'nickers is one of', 'stop_reason': None, 'stop': None}
{'completion': ' the most popular candy bars around. Introdu', 'stop_reason': None, 'stop': None}
{'completion': 'ced in 1930, it consists of nougat topped with', 'stop_reason': None, 'stop': None}
{'completion': ' caramel and peanuts that is encased in milk chocolate', 'stop_reason': None, 'stop': None}
{'completion': '. With its sweet and salty taste profile,', 'stop_reason': None, 'stop': None}
{'completion': ' Snickers provides the perfect balance of flavors. The candy', 'stop_reason': None, 'stop': None}
{'completion': " bar got its name from the Mars family's", 'stop_reason': None, 'stop': None}
{'completion': ' favorite horse. Bite', 'stop_reason': None, 'stop': None}
{'completion': ' into a Snickers and the rich', 'stop_reason': None, 'stop': None}
{'completion': ' chocolate and caramel intermingle in your mouth while the', 'stop_reason': None, 'stop': None}
{'completion': ' crunch of peanuts adds text', 'stop_reason': None, 'stop': None}
{'completion': 'ural contrast. Loaded with sugar, Snick', 'stop_reason': None, 'stop': None}
{'completion': 'ers gives you a quick burst of energy. It', 'stop_reason': None, 'stop': None}
{'completion': "'s a classic candy bar that has endured for", 'stop_reason': None, 'stop': None}
{'completion': ' decades thanks to its irresistible combination', 'stop_reason': None, 'stop': None}
{'completion': ' of chocolate,', 'stop_reason': None, 'stop': None}
{'completion': ' caramel, noug', 'stop_reason': None, 'stop': None}
{'completion': "at and peanuts. Snickers' popularity shows", 'stop_reason': None, 'stop': None}
{'completion': ' no signs of waning anytime soon.', 'stop_reason': 'stop_sequence', 'stop': '\n\nHuman:', 'amazon-bedrock-invocationMetrics': {'inputTokenCount': 21, 'outputTokenCount': 184, 'invocationLatency': 8756, 'firstByteLatency': 383}}

Deploy HuggingFace model on SageMaker endpoint

To address your FOMO (Fear of Missing Out) on the latest and greatest LLM whose creators claim to have surpassed GPT3.5 or GPT4 performance, you can quickly depoy (almost) any HuggingFace Large Language Model using the SageMaker infrastructure.

We will talk about how to evaluate the new LLM in your existing pipeline to gauge its performance before deployment.

Go to the model card page and click on “Deploy” dropdown:

Select Amazon SageMaker and copy the boilerplate code:

Paste it in your jupyter notebook or Sagemaker studio notebook and execute it. Wait for the model endpoint status to change to “available” before invoking the endpoint.

In the model deployment section we will cover how SageMaker model deployment works in-detail.

Architectural Patterns for LLMs

  • Foundation models
  • Prompt engineering
    - Tokens
    - In-Context Learning
    - Zero-Shot inference
    - One-shot inference
    - Few-shot inference
  • Retrieval Augmented Generation (RAG)
    - RAG Workflow
    - Chunking
    - Document Loading and Vector Databases
    - Document Retrieval and reranking
    - Reranking with Maximum Marginal Relevance
  • Customize and Fine tuning
    - Instruction fine-tuning
    - Parameter efficient fine-tuning
    - LoRA and QLoRA
  • Reinforcement learning from human feedback (RLHF)
    - Reward model
    - Fine-tune with RLHF
  • Pretraining (creating from scratch)
    - Continous pre-training
    - Pretraining datasets
    - HuggingFace autotrain
  • Agents
    - Agent orchestration
    - Available agents

Foundation models

This is not an architectural pattern.

Foundational models are very large and complex neural network models consisting of billions of parameters. The model parameters are learned during the training phase — often called pretraining. They are trained on massive amounts of training data — typically over a period of weeks and months using large, distributed clusters of CPUs and GPUs. After learning billions of parameters (a.k.a weights), these foundation models can represent complex entities such as human language, images, videos and audio clips.

In most cases, you will not use foundation models as it is because they are text completion models (atleast for NLP tasks). When these models are fine-tuned using Reinforced Learning from Human Feedback (RHLF) they are more safer and adaptive to general tasks like question-answering, chatbot etc.

Llama 2 is a Foundation model

Llama 2 Chat has been fine-tuned for chat from base Llama 2 foundational model.

Many model names like qa, chat, base reflect the original or fine-tuned objective of the model.

Common approaches to customizing foundation models (FMs)

Common approaches to customizing foundation models (FMs)

Why Customize

  • Customize to specific business needs : E.g. Healthcare — Understand medical terminology and provide accurate responses related to patient’s health
  • Adapt to domain-specific language : E.g. Finance — Teach financial & accounting terms to provide good analysis for earnings reports
  • Enhance performance for specific tasks : E.g. Customer Service- Improve ability to understand and respond to customer’s inquires and complaints
  • Improve context-awareness in responses : E.g. Legal Services — Better understand case facts and law to provide useful insights for attorneys

Data is the differentiator for generative AI applications.

Prompt engineering

While generative AI tasks can span multiple content modalities, they often involve a text-based input. This input is called a prompt and includes the instruction, context, and any constratins used to accomplish a given task.

What you do with ChatGPT is prompting and the model responds with completion. “Completion” can also be non-text-based depending on the model like image, video or audio. Prompt Engineering course is short and can be completed in few hours.


Generative models convert our natural language into sequence of “tokens” or word fragments. By combining many of these tokens in different ways, the model is capable of representing an exponential number of words using a relatively small number of tokens — often on the order of 30,000–100,000 tokens in the model’s vocabulary.

Example of possible number of tokens in the following sentence:

Because he was late again, he would be docked a day’s pay.

Source of tool:

Notice the word docked is parted into two token by OpenAI tokenizer, and similarly the word day’s . You might not know which single word will be parted into 2 or more tokens because it can vary from model to model.

As a rule of thumb, it is approximated that for 75 english words ~= 100 tokens, i.e. 1.3 token per word. You can use 1.3 multiplier to estimate the cost of services that use token based pricing.

In-Context Learning

Context refers to relevant information or details that you pass to the model, so it better understands the task or topic and responds appropriately. Passing context allows for more coherent and meaningful interactions with the model.

If we ask a model “What is 10 + 10?”, it may return 20, or some other possible value. Because we did not specify 10 + 10 is an arithmetic equation or it’s an addition problem. To make the model respond that way you want it to, we provide examples to the model as part of the prompt, and we call this in-context learning.

In above case we will provide following examples as part of the prompt:

1 + 1 is an addition problem.
1 - 1 is a subtraction problem.
1 X 1 is a multiplication problem.
1 / 1 is a division problem.

In the next example, we ask a question to Llama 2 70B model and also add the context for the question to help model answer.


User: Who won the 2023 cricket world cup?
The final took place between India and Australia at Narendra Modi Stadium on 19 November with Australia winning the title for the sixth time.


Model: Congratulations to Australia for winning the 2023 Cricket World Cup! It was a thrilling competition, and we’re glad to have been able to provide you with the answer to your question. If you have any more questions or need further assistance, please don’t hesitate to ask!

Chat with base Llama 2 model:

Depending on how many examples you provide, this is called one-shot or few-shot inference. The model’s ability to learn from those examples and adapt its responses accordindly is called “in-context learning”.

Zero-Shot inference

If you pass one prompt-completion pair into the context, this is called one-short inference; if you pass no example at all, this is called zero-shot inference.

Zero-shot inference is often used to evaluate a model’s ability to perform a task that it hasn’t been explicity trained on or seen examples for. For zero-shot inference, the model relies on its preexisting knowledge and generalization capabilities to make inference or generate appropriate outputs, even when it encounters tasks or questions it has never seen before.

Larger models are surprisingly good at zero-shot inference.

Same question without context to Llama 2 70B model and ChatGPT 3.5:


User: Who won the 2023 cricket world cup?

Llama 2 70B

Response of the free version of ChatGPT:

ChatGTP 3.5

Response from open-source Mistral 7B model:

Mistral 7B

Note: I have used a mixed of gradio and streamlit in the examples in the article. This article talks about how to deploy LLMs using these tools in detail:

One-shot inference

The following example adds an instruction and one-shot prompt in the context:


User: Answer the question using the format shown in the context.
Why stop signs are in red colour?
Because red colour has the highest wavelength due to which it the most likely colour to be seen from maximum distance.
Why is the sky blue?

Llama 2 70B one-shot inference
Mistral 7B one-shot inference

Few-shot inference

If we pass number of prompt-completion pairs in the context, it is called few-shot inference. With more examples, or shots, the model more closely follows the pattern of the response of the in-context prompt-completion pairs.

You are a apparel recommender agent for an Indian apparel company. 
Your job is to suggest different types of apparel one can wear
based on the user's query.
You can understand the occasion and recommend the
correct apparel items for the occasion if applicable, or
just output that specific apparels if user is already very specific.
Below are few examples with reasons as to why the particular
item is recommended:

User question - show me blue shirts
Your response - blue shirts
Reason for recommendation - user is already specifc in their query, nothing to recommend

User question - What can I wear for office party?
Your response - semi formal dress, suit, office party, dress
Reason for recommendation - recommend apparel choices based on occassion

User question - I am doing shopping for trekking in mountains what do you suggest
Your response - heavy jacket, jeans, boots, winsheild, seweater.
Reason for recommendation - recommend apparel choices based on occassion

User question - What should one person wear for their child's graduation ceremony?
Your response - Dress or pantsuit, Dress shirt, heels or dress shoes, suit, tie
Reason for recommendation - recommend apparel choices based on occassion

User question - sunflower dress
Your response - sunflower dress
Reason for recommendation - user is specific about their query, nothing to recommend

User question - What's is the price of 2nd item
Your response - '##detail##'
Reason for recommendation - User is asking for information related to product already recommender, in that case you should only return '##detail##'

User question - what is the price of 4th item in the list
Your response - '##detail##'
Reason for recommendation - User is asking for information related to product already recommender, in that case you should only return '##detail##'

User question - What's are their brand names?
Your response - '##detail##'
Reason for recommendation - User is asking for information related to product already recommender, in that case you should only return '##detail##'

User question - show me more products with similar brand to this item
Your response - your respone must be the brand name of the item
Reason for recommendation - User is asking for similar products, return the original product

User question - do you have more red dresses in similar patters
Your response - your response must be the name of that red dress only
Reason for recommendation - User is asking for similar products, return the original product

Only suggest the apparels or only relevant information, do not
return anything else.

Retrieval Augmented Generation (RAG)

Note: The content and code examples in this section are derived from the book: Generative AI on AWS

Retrieval Augmentation Generation (RAG)

RAG isn’t a specific set of technologies but rather a framework for providing LLMs access to data they did not see during training. RAG allows LLM-powered applications to make use of external data sources and applications to overcome some of the knowledge limitations like ChatGPT 3.5 knowledge cut-off at January 2022.

RAG Use cases

  • Improved content quality: E.g., helps in reducing hallucinations and connecting with recent knowledge including enterprise data
  • Contextual chatbots and question answering: E.g., enhance chatbot capabilities by integrating with real-time data
  • Personalized search: E.g., searching based on user previous search history and persona
  • Real-time data summarization: E.g., retrieving and summarizing transactional data from databases, or API calls
Types of retrieval

With RAG, you augement the context of your prompts with relevant information needed to address knowledge limitations of LLMs and improve the relevancy of the model’s generated output. RAG has grown in popularity due to its effectiveness in mitigating challanges such as knowledge cutoffs and hallucinations by incorporating dynamic data sources into the prompt context without needing to continually fine-tune the model as new data arrives into your system.

Architecture of RAG

This additional external data at runtime that is not contained within the LLMs “parametric memory” can be from a number of data sources, including knowledge bases, document stores, databases, and data that is searchable through the internet.

  • Documents like domain-specific knowledge base, document store.
  • Internet — google, wikipedia
  • Vector Databases — Chroma, qdrant, pgvector

RAG Workflow

RAG in action

At a high level, there are two common workflows to consider — preparation of data from external knowledge sources, then the integration of that data into consuming applications.

Data preparation involves the ingestion of data sources as well as the capturing of key metadata describing the data source. If the information source is a PDF, there will be an additional task to extract text from those documents

RAG architecture depends on efficient data preparation

Application integration involves retrieving the most semantically similar information from those external data sources based on an input prompt , followed by a re-ranking process, and augmenting the input prompt with the most relevant information prior to using that augmented prompt to call the LLM.


Chunking breaks down larger pieces of text into smaller segments. It is required due to context window limits imposed by the LLM. For example, if the model only supports 4,096 input tokens in the context window, you will need to adjust the chunk size to account for this limit.

SentenceTransformer “all-min” model maximum input sequence length = 256
Input sentence = “The cat sat on the mat. And the dog …”
Tokenized input sequence = [“The”, ” “, ”cat”, ” “, ”sat”, “ “, “on”, “ “, “the”, “ “, “mat”, “.”, “And”, “ “]
(Input sequence was clipped)

The chunks should contain information that is semantically related and that has meaningful context in that single chunk. You can use fixed-size chunking that splits data using a fixed number of tokens, which is an easy method. Alternatively, you can use context-aware chunking methods, which aim to chunk data with more consideration around understanding the context of the data and keeping relevant text together.

When experimenting with chunking size, you can overlap a defined amount of text between chunks. Overlap can help preserve context between chunks.

Example of simple embeddings using Sentence Transformer:

Document Loading and Vector Databases

RAG-based architectures are capable of pulling data from number of sources, but I will focus specifically on information retrieval from documents. A common implementation for document search and retrieval, includes storing the documents in a vector store, where each document is indexed based on an embedding vector produced by an embedding model.

There are also battle-grade vector database that specialises in storage and retrieval of high dimensional vectors in a distributed environment.

Check out the vector databases comparison here:

It was formerly on google sheets but now has been moved to above superlinked [dot] com url.

Vector DB Comparison

Each embedding aims to capture the semantic or contextual meaning of the data, and semantically similar concepts end up closed to each other (have a small distance between them) in the vector space. As a result, information retrieval involves finding nearby embeddings that are likely to have similar contextual meaning.

Depending on the vector store, you can often put additional metadata such as a reference to the original content the embedding was created from along with each vector embedding.

Not just storage, vector databases also support different indexing strategies to enable low-latency retrieval from thousand’s of candidates. Common indexing strategies include, HNSW and IVFFlat.

Example of pgvector with SQLAlchemy:

Should I go with ANN libraries or battle-grade vector databases:

  • ANN libraries like FAISS, Annoy can be kept in memory or persisted on disk. If persisted on disk for more than 1 program to update, then it will lead to loss of records or corruption of index.
  • If your team already uses postgres, then pgvector is a good choice against ANN libraries that will need to be self hosted.
  • Vector databases will evolve and can be used for more than 1 purpose. Distributed vector database like qrant, can do semantic search, recommendations (with native api) and much more.

Document Retrieval and reranking

Once the text from a document has been embedded and indexed, it can then be used to retrieve relevant information by the application.

If the input prompt includes the question, “How do I create a new team member?”, and the LLM has no knowledge to support information about the proprietory SaaS product, therefore the prompt text will first utilize an embedding model to create vector embedding representations of the input prompt, then use the vector embeddings to query vector store for embeddings that are semantically similar to those on the input prompt. Based on those results, relevant document text is retrieved.

Example of retrieval with popular ANN libraries like FAISS and SCANN:

You may want to rerank the similarity results returned from the vector store to help diversify the results beyond just the similarity scores and improve relevance to the input prompt.

A popular reranking algorithm that is build into most vector stores is Maximum Marginal Relevance(MMR). MMR aims to maintain relevance to the input prompt but also reduce redundancy in the retrieved results since the retrieved results can often be very similar. This helps to provide context in the augmented prompt that is relvant as well as diverse.

Reranking with Maximum Marginal Relevance

Maximum Marginal Relevance(MMR) encourages diversity in the result set, which allows the retriever to consider more than just the similarity scores, but also include a diversity factor between 0 and 1, where 0 is maximum diversity and 1 is minimum diversity.

End-to-End RAG Implementation and orchestration

Customize and Fine tuning

Note: The content and code examples in this section are derived from the book: Generative AI on AWS

You can customize or tune various hyperparameters of the LLM like temperature, top-p, top-k based on the performance of the LLM on your own dataset. Section on Evaluating LLMs describes various methods to evaluate LLMs.

When we adapt foundation models on our custom datasets and use cases, we call this process fine-tuning. There are two main fine-tuning techniques, instruction fine-tuning and parameter efficient fine-tuning.

Customize vs augment

Customize vs augment

Instruction fine-tuning

The models that humans most commonly interact with are called “instruct” or “chat” models. These models are fine-tuned with instructions using their foundation model equivalent as the base model. The instruct variants are useful for general-purpose chatbot interfaces, as they are capable of performing many tasks, accept humanlike prompts, and generate humanlike responses.

In contrast to the billions of tokens needed to pretrain a foundation model, you can achieve very good results with instruction fine-tuning using a relatively small instruction dataset — often just 500–1,000 examples is enough. Typically, however, the more examples you provide to the model during fine-tuning, the better the model becomes.

To preserve the model’s general-purpose capability and prevent “catastrophic forgetting” in which the model becomes so good at a single task that it may lose its ability to generalize, you should provide the model with many different types of instructions during fine-tuning.

FLAN instruction dataset is a collection of 473 different datasets across 146 task categories and nearly 1,800 fine-grained tasks. One of the datasets in the FLAN collection, samsum, contains 16,000 conversations and human-curated summaries. These conversations and summaries were created by linguistics experts to produce high-quality training examples for a dialogue-summarization generative task.

To use you own custom dataset, you must convert it into an instruction dataset, associated with FLAN-T5 template. After this, you can use it fine-tune various models using backpropogation of the loss to improve the generative model or adapt it for your proprietary product.

Full code is available in this article’s github repo:

Parameter efficient fine-tuning

Parameter-efficient fine-tuning (PEFT) provides a set of techniques allowing you to fine-tune LLMs while utilizing less compute resources.

There are a variety of PEFT techniques and categories explored in a paper Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning, but in general, each focuses on freezing all or most of the model’s original parameters and extending or replacing model layers by training an additional, much smaller, set of parameters. The most commonly used techniques fall into the additive and reparameterization categories.

Additive techniques, such as prompt tuning, augment the model by fine-tuning and adding extra parameters or layers to the pretrained model.

Reparameterization techniques, such as Low-Rank Adaption (LoRA) and QLoRA, allow for adaptation using low-rank representations to reduce the number of training parameters and compute resources required to fine-tune.

At a high level, with full fine-tuning, you’re updating every model parameter through supervised learning, whereas PEFT techniques freeze the parameters of the pretrained model and fine-tune a smaller set of parameters.

Full fine-tuning often requires a large amount of GPU RAM, which quickly increases the overall computing budget and cost. PEFT in some cases, the number of newly trained parameters is just 1–2% of the original LLM weights. Because you’re training a relatively small number of parameters, the memory requirements for fine-tuning become more managable and can be often performed on a single GPU.

In addition, PEFT methods are also less prone to catastrophic forgetting, due to the weights of the original foundation model remain frozen, preserving the model’s original knowledge or parametric memory.

LoRA and QLoRA

First introduced in 2021 paper, LoRA is a commonly used PEFT technique that freezes the original weights of the LLM and creates new, trainable low-rank matrices into each layer of the Transformer architecture. The researchers of the paper highlight that foundation models often have a low intrinsic dimension, meaning that they can often be described with far fewer dimensions than what is represented in the original weights.

In combination, they hypothesised that the updates to model weights (parameters) have a low intrinsic rank during model adaptation, meaning you can use smaller matrices, with fewer dimensions, to fine-tune. This fine-tuning method reduces the number of trainable parameters and, as a result, the training time required and results in a reduction in the compute and storage resources required.

LoRA is also used for multimodel models like Stable Diffusion, which uses a Transformer-based language model to help align text to images.

LoRA freezes all of the original model parameters and insert a pair of rank decomposition matrices alongside the original weights of a targeted set of layers in the model — typically the linear layers, including self-attention.

These rank decomposition matrices have significantly fewer parameters than the original model weights that they learn to represent during LoRA fine-tuning. The dimensions of the smaller matrices are defined so that their product is a matrix with the same dimensions as the weights they are modifying.

The size of the low-rank matrices is set by the parameters called rank (r). Rank refers to the maximum number of linearly independent columns (or rows) in the weight matrix. A smaller value leads to a simpler low-rank matrix with fewer parameters to train.

Setting the rank between 4 and 16 can often provide you with a good trade-off between reducing the number of trainable parameters while preserving acceptable levels of model performance.

As per research paper by Ashish Vaswani et al., specifying Transformer weights with the dimensions of 512 X 64, which means each weight matrix in the architecture has 32,768 trainable parameters (512 X 64 = 32,768). You’d be updating 32,768 parameters for each weight matrix in the architecture while performing full fine-tuning.

Full fine-tuning trains all parameters

With LoRA, assuming a rank equal to 4, two small-rank decomposition matrices will be trained whose small dimension is 4. This means that matrix A will have dimension 4 X 64 resulting in 256 total parameters, while matrix B will have the dimensions of 512 X 4 resulting in 2,048 trainable parameters.

By updating the weights of only the new low-rank matrices, you are able to fine-tune for a single tenant by training only 2,304 (256 + 2,048) parameters instead of the fill 32,768, in this case.

Full code is available in this article’s github repo:

QLoRa aims to further reduce the memory requirements by combining low-rank adaptation with quantization. QLoRA uses 4-bit quantization in a format called NormalFloat4 or nf4.

Check out this awesome article on coding LoRA from scratch:


Note: The content and code examples in this section are derived from the book: Generative AI on AWS

Reinforcement learning from human feedback (RLHF) is a fine-tuning mechanism that uses human annotation — also called human feedback — to help the model adapt to human values and preferences. RLHF is most commonly applied after other forms of fine-tuning.

While RLHF is typically used to help to a model generate more humanlike and human aligned outputs, you can also use RLHF to fine-tune highly personalized models. For example, you could fine-tune a chat assistant specific to each user of your application. This chat assistant can adopt the style, voice, or sense of humour of each user based on their interactions with your application.

RLHF fine-tuning also helps increase model’s helpfulness, honesty, and harmlessness (HHH).

RLHF is rooted in reinforcement learning. Let’s consider a simple example of a reinforcement learning scenario: teaching a robot to navigate through a maze.

  • Agent: The robot
  • Environment: The maze
  • State: The robot’s position in the maze
  • Action: Moving in the maze (e.g., forward, left, right)
  • Policy: Rules guiding the robot’s actions based on its position
  • Objective: Find the maze exit
  • Reward: Positive for getting closer to the exit, negative for moving away

Learning Process:

  1. Robot starts randomly in the maze.
  2. It takes an action (moves).
  3. Gets a reward based on how close it gets to the exit.
  4. Adjusts its strategy based on the reward (learning).
  5. Repeats the process, refining its strategy with each step.
  6. Eventually, it learns the best way to navigate the maze to reach the exit.

The sequence of states and actions that lead to a reward are often called a playout in RL terms. Playout is used in the classical RL context, while rollout is commonly used in a generative context. They are equivalent.

Reinforcement learning in the context of a generative AI model

When we apply RL concepts to a generative model, here model is the agent. Policy consists of model weights. The RL algorithm will update the model weights to choose a better action, or generate a better next-token given the environment, state and objective. The objective is for the model to generate completions that are better aligned with human preferences such as helpfulness, honesty and harmlessness (HHH).

The action is chosen from the action space consisting of all possible tokens based on the probability distribution of tokens over all tokens in the model’s vocabulary. The environment is the model’s context window. The state consists of the tokens that are currently in the context window.

The reward is based on how well the model’s completion aligns with a human preference such as helpfulness.

  • Agent: The large language model being fine-tuned
  • Environment: Model’s context window
  • State: The current state of the language model (e.g., parameters, internal representations)
  • Action: Choosing the next token
  • Policy: Model weights
  • Objective: Produce human-like and contextually relevant text
  • Reward: Positive feedback for generating high-quality human-aligned text, negative feedback for generating low-quality or irrelevant text

Reward model

The reward model is typically a classifier the predicts one of two classes — positive or negative. Positive refers to the generated text that is more human-aligned, or preferred. Negative class refers to non-preferred response.

To determine what is helpful, honest and harmless (positive), you often need a annotated dataset using human-in-the-loop workflow.

Dataset example:

Summary of preferred and non-preferred completions and rewards

Train a custom reward model vs using an existing binary classifier

The reward models are often small binary classifiers and based on smaller language models like BERT, sentiment classifiers, or toxicity detector. You can also train your own reward model, however it is a relatively labour-intensive and costly endeavour.

Commonly used reward models:

  1. BERT uncased
  2. Distillbert
  3. Facebook’s hate speech detector

Full code is available in this article’s github repo:

Training a reward model code example:

Using existing reward model (Facebook’s toxicity detector):

Fine-tune with RLHF

There is a popular RL algorithm called Proximal Policy Optimization (PPO) used to perform the actual model weight updated based on the reward value assigned to a given prompt and completion. PPO, initially described in a 2017 paper, updates the weights of the generative model based on the reward value returned from the reward model.

With each iteration, PPO makes small and bounded updates to the LLM weights — hence the term Proximal Policy Optimization. By keeping the changes small with each iteration, the fine-tuning process is more stable and the resulting model is able to generalize well on new inputs. PPO updates the model weights through backpropagation. After many iterations, you should have a more human-aligned generative model.

Full code is available in this article’s github repo:

To mitigate reward-hacking, it is advised to penalise using KL divergence shift penalty, or passing frozen reference of the model to mitigate reward hacking.

Parameter efficient fine-tuning can also be used in the scope of RLHF to reduce the amount of compute and memory resources required for the compute-intensive PPO algorithm.

Evaluating RLHF fine-tuned model is covered under evaluation section.

Check out this amazing article on RLHF in 2024 with DPO & HuggingFace:

Pretraining (creating from scratch)

BloombergGPT by Bloomberg trained on mix of public and proprietary financial data, PathChat by Harvard is trained using clinical pathology reports. These are only some of the examples of GPT models trained from scratch using domain-specific datasets to achieve superior performance in the domain, as compared to other LLMs that do not generalise well in those domains.

Apart from the cluster of GPUs and voluminous dataset required, the code for a basic GPT foundation model is not very complex.

Many GPT models are based on the original Attention Is All You Need paper. Some resources provide a simpler step-by-step guide on creating your own LLMs:

  1. Video Let’s build GPT: from scratch, in code, spelled out by Andrej Karpathy
  2. Transformers from scratch by Mat Miller
  3. Create a Large Language Model from Scratch with Python — Tutorial video

Simple example of a GPT model that uses multihead attention layer in the architecture:

Continuous pre-training

Datasets for instruction fine-tuning and continued pre-training

Like with other machine learning models, you also have an architectural pattern to continuously train your LLM with new data. It can help adapt model responses to the vocabulary and terminology specific to a domain. To achieve continous pretraining, it is advisable to first setup an automated pipeline to monitor and evaluate your LLM. This way, when a challanger LLM is trained, it can be automatically evaluated before replacing with the champion LLM.

Champion model — your existing model in production

Challenger candidate — model trained on new data or entirely new model

Pretraining datasets

  • Wikipedia (2022) dataset in multi-languages.
  • Common Crawl is a monthly dump of text found on the whole of internet by AWS.
  • RefinedWeb (2023) is dataset on which Falcon family of models was pretrained. It is a cleaned version of Common Crawl dataset.
  • Colossal Clean Crawled Corpus — C4 (2020) is another colossal, cleaned version of Common Crawl’s web crawl corpus.

HuggingFace Autotrain

AutoTrain Advanced : faster and easier training and deployments of state-of-the-art machine learning models. AutoTrain Advanced is a no-code solution that allows you to train machine learning models in just a few clicks.

Fine tuning and continued pre-training

Fine tuning vs continued pre-training


Agent have set of instructions, a foundation model, a set of available actions and knowledge bases, which enables then to execute complex tasks.

A generative model can answer a general question, or a question related to your documentation, like “I can’t see my meetings?, How do I book a meeting?”. An agent, using a foundational model as their reasoning logic and external data sources like your APIs, can return the user their no. of booked meetings, or directly schedule a meeting from the interaction screen.

Agents build their own structured prompts to help the model reason, and orchestrate a RAG workflow through a sequence of data lookups and/or performs API calls and augment the prompt with the information received from the external systems to help the model generate more context-aware and relevant completion before returning the final response back to the user.

ReAct structures prompts to include instructions, ReAct examples, and the user request

An agent accomplishes this using ReAct framework that combines using chain-of-though (CoT) reasoning with action planning. This generates step-by-step plans carried out by tools such as web search, a SQL query, a python based script, or in our case multiple API calls to return the needed result.

Meeting assistant agent:

Meeting Assistant Agent

RAG based HR policy chatbot:

HR policy agent v2:

Agent implementations in open-source libraries:

Agent orchestration

When you provide a question to an agent in natural language, it decomposes it multiple steps using available actions and knowledge bases. Then execute action or search knowledge base , observe results and think about next step. This process is repeated until final answer is achieved.

Agent Orchestration — Basic flow

Example of an insurance agent:

Task: Send a reminder to policy holders with missing docs; include doc requirements

Thought: To answer this question, I will:
1. Get open claims
2. Get missing documents for each open claim
3. Get requirements for each missing document
4. Send reminders for each missing claim

Final answer: There are currently two open insurance claims with claim IDs claim-42 and claim-34. For claim-34, the pending document required … Reminders have also been sent for both claims.

Agents can be deployed and invoked from any app or also triggered by an event.

Available agents

Transformers agents — are more like an API of tools and agents. Each task has a task specific tool, and they provide a natural language API on top of transformers for that specific task. Image generator tool cannot do text to speech task and so on.

Transformers agents works by initializing an LLM agent for chain-of-thought reasoning. Then you create an instance of . You can now interact with the agent using api.

Colab from official documentation:


Langchain agents — are richer and easy to integrate in workflows that already use Langchain. It uses LangSmith to trace and evaluate your language model applications and intelligent agents. This makes debugging agent based systems little easier.

Langchain angents official documentation:

Personally I have found Langchain unstable for production use. Unless your workflow is mostly dependent on Langchain, it is not advisable to use Langchain agents just for agents functionality.

Agents for Amazon Bedrock — is the most production ready agent orchestration solution I have seen so far. Using managed services help you fail fast and determine the expectations from the final product by delivering the MVP faster.

Agents for Amazon Bedrock can :

  • create a prompt from the developer-provided instructions (E.g., “You are an insurance agent designed to process open claims”).
  • API details needed to complete the tasks.
  • Securely connects to your company’s data sources, automatically converts data into numerical representations, and augments the user request with the right information to generate an accurate and relevant response.
  • Orchestrate and execute multistep tasks.

Amazon Qformerly part of only QuickSight, is a generative AI assistant designed for work that can be tailored to your business, data, code, and operations. It can help you get fast, relevant answers to pressing questions, solve problems, generate content, and take actions using the data and expertise found in your company’s information repositories, code, and enterprise systems.

Currently it is in preview. Similar to QuickSight Q 2 plans, it has flat pricing per user per month, of $20 and $25 respectively.

Because it is not GA, I could only test it with questions related to AWS documentation. The results were fast and apt. Provided we have rich structure of our APIs, we might see similar performance on a company’s knowledge base.

Agents will evolve and dominate the space of automating tasks via actions. Until something else replace them.

Evaluating LLMs

  • Classical and deep learning model evaluation
    - Metrics
    - NLP metrics
  • Holistic evaluation of LLMs
    - Metrics
    - Benchmarks and Datasets
    - Evaluating RLHF fine-tuned model
    - Evaluation datasets for specialized domains
  • Evaluating in CI/CD
    - Rule based
    - Model graded evaluation
  • Evaluation beyond metrics and benchmarks
    - Cost and memory
    - Latency
    - Input context length and output sequence max length
Zishan Guo et al., “Evaluating Large Language Models: A Comprehensive Survey”, arXiv, 2023

Why evaluate?

  • To prevent private data leaks
  • Prevention from yielding inappropriate, harmful, or misleading content
  • To compare newer model (version) with existing model in production

When building applications with Generative AI, the model behaviour can be unpredictable or more open-ended than traditional software. Hence systematic evaluation become even more important.

Say, you are using a LLM for writing content, and one of your teammates makes an innocent update to the prompt to tell the LLM to make the content interesting. But unknown to anyone until it’s shipped this makes the LLM create “intersting content” by hallucinating.

If you had evaluation in place that automatically or with human-in-loop, can evaluate the LLM system and LLM output before delivery, then it can save you from embarrassing situations.

LLM-Model evaluation — based on the same input, how different the output is compared to other models.

LLM system evaluation — is evaluating how the whole system performs/changes, when the base LLM is changed to a different model or customized.

In this section we will discuss both.

Classical and deep learning model evaluation

Traditional ML models and Foundation models

Several metrics are commonly used for model evaluation, each offering unique insights into different aspects of model performance.

  1. Accuracy: This is a fundamental metric that measures the proportion of correctly classified instances out of the total instances evaluated. While accuracy is widely used, it may not be suitable for imbalanced datasets where one class dominates the others.
  2. Precision and Recall: Precision measures the accuracy of positive predictions, while recall measures the ability of the model to identify all relevant instances. These metrics are particularly useful when dealing with imbalanced datasets, where one class is significantly more frequent than the others.
  3. F1 Score: The F1 score is the harmonic mean of precision and recall, providing a balanced measure between the two. It is especially valuable when there is an uneven class distribution or when both false positives and false negatives are important.
  4. Confusion Matrix: A confusion matrix provides a detailed breakdown of correct and incorrect predictions, organized by class. It enables deeper analysis of model performance, including identifying specific types of errors such as false positives and false negatives.
  5. Mean Absolute Error (MAE) and Mean Squared Error (MSE): These metrics are commonly used in regression tasks to quantify the average magnitude of errors made by the model.
  6. R-squared (R²) Score: This metric assesses how well the model fits the data by measuring the proportion of the variance in the dependent variable that is predictable from the independent variables.

In the realm of NLP, specialized metrics are often employed to evaluate the quality of generated text or translations:

  1. BLEU Score (Bilingual Evaluation Understudy): BLEU measures the similarity between the generated text and one or more reference texts. It evaluates the quality of machine-translated text by comparing it to a set of reference translations.
  2. ROUGE (Recall-Oriented Understudy for Gisting Evaluation): ROUGE is a set of metrics used to evaluate the quality of summaries. It measures the overlap between the generated summary and reference summaries in terms of n-gram overlap, word overlap, and other similarity measures.
  3. METEOR (Metric for Evaluation of Translation with Explicit Ordering): METEOR evaluates the quality of machine translation by considering precision, recall, stemming, synonymy, and word order.
  4. WER (Word Error Rate): WER is commonly used to evaluate the accuracy of speech recognition systems by measuring the number of errors in the output transcription relative to the reference transcription, typically normalized by the total number of words.
  5. CER (Character Error Rate): Similar to WER, CER measures the accuracy of speech recognition systems at the character level, providing a finer-grained evaluation of performance.

Evaluation is typically done by splitting the dataset into training, validation, and test sets. Models are trained on the training set, tuned on the validation set, and finally evaluated on the test set to assess their generalization performance. Cross-validation techniques may also be employed to ensure robustness of the evaluation results.

Additionally, in the context of deep learning, techniques such as early stopping and dropout regularization are often used to prevent overfitting and improve generalization performance.

Holistic evaluation of LLMs

Safety, toxicity, biaseness are general evaluation topics application for all LLMs. But specialized LLMs may also required specialized evaluation mechanisms.

A QA chatbot qualifies to be evaluated on different datasets than a chatbot that calls for reasoning. A LLM that has RAG architecture will have different metrics compared with a FM that is solely based on parametric memory.

Similarly text generation models being different from mathematical models, and will struggle on tasks that were not in their training dataset. Agents that employ RAG or invoke API calls to augment prompt to prepare output must be evaluated as one unit and as well as separate units.

Source: AWS Innovate AI/ML and Data Edition, Feb 2024

Setting up many layers of evaluation and protection will always be beneficial. With each layers evaluating on different objective so that if a rogue model can pass one evaluation with flying colours it can on a different one. Like a cheese grater with asymmetric design.


Classic machine learning evaluation metrics, such as accuracy and root-mean-sqaure (RMSE), are straightforward to calculate since the predictions are deterministic and easy to compare against the labels in a validation or test dataset.

The output from generative AI models, however, is nondeterministic by design, which makes evaluation very difficult without human intervention. Additionally, evaluation metrics for generative models are very task-specific.

ROUGE metric is used to evaluate summarization tasks, while the Bilingual Evaluation Understudy (BLEU) metric is used for translation tasks.

ROUGE calculates how well the input (dialogue, in case of text summarization) compares to the generated output (summary, in this case). To do this, ROUGE calculates the number of similar unigrams (single words), bigrams (two consecutive words), and longest common sequences (consecutive n-grams) between the inputs and generated outputs to calculate the ROUGE-1, ROUGE-2, and ROUGE-L scores. The higher the score, the more similar they are.

But like any other metric, ROUGE is far from perfect. Consider the example, “This book is great” and “This book is not great”. Using ROUGE alone, these phrases appear to be similar. However, they are, in fact, opposite. ROUGE is useful as a baseline metric before and after fine-tuning your model because it demonstrates relative improvement.

Bilingual Evaluation Understudy (BLEU) is a metric commonly used to evaluate the quality of machine-generated translations by comparing them to human-generated reference translations.

BLEU operates by analyzing the overlap between the n-grams (contiguous sequences of n items, typically words) of the generated translation and those of the reference translation. This involves calculating precision scores for different n-gram lengths and then combining them to produce a single BLEU score.

BLEU, like ROUGE, has its limitations. It primarily focuses on lexical similarity and does not capture aspects such as fluency, grammaticality, or semantic equivalence. Additionally, BLEU scores may not always correlate perfectly with human judgments of translation quality. Despite these limitations, BLEU is widely used in the field of machine translation as a convenient and standardized metric for evaluating system performance. Similar to ROUGE, it serves as a valuable tool for comparing different models and tracking improvements over time.

Checkout HuggingFace page on quick tour about evaluations:

Multimodel foundation models quantative evaluation is different, like stable diffusion. Metrics like CLIP score similarity, CLIP directional similarity, and Frechet Inception Distance (FID) can be employed.

Benchmark and Datasets

Source: AWS Innovate AI/ML and Data Edition, Feb 2024

Test dataset is recommended to evaluate the LLM. There are existing datasets and benchmarks established by the community to help you compare generative models more holistically.

  1. SemEval — introduced in 2019, is an ongoing series of evaluations of computational semantic analysis systems. Its evaluations are intended to explore the nature of meaning in language.
  2. General Language Understanding Evaluation (GLUE) — introduced in 2018 to evaluate and compare model performance across a set of language tasks.
  3. SuperGLUE — successor to GLUE, introduced in 2019 to include more challenging tasks.
  4. HELM — Benchmark designed to encourage model transparency. Combination of 7 metrics across 16 core “scenarios”. Scenarios include tasks such as question-answer, summarization, sentiment analysis, toxicity and bias detection.
  5. Beyong the Imitation Game (BIG-Bench) — benchmarks consists of 204 tasks across linguistics, mathematics, bilogy, physics, software development, commonsense reasoning, and much more.
  6. XNLI — multilingual NLI dataset.
  7. MMLU — evaluates model’s knowledge and problem-solving capabilities. Models are tested across different subjects, including mathematics, history and science.
  8. TruthfulQA and RealToxicityPrompts — simple datasets to evaluate model’s performance to generate hate speech and misinformation, respectively.

Evaluating RLHF Fine-tuned model

Evaluating after fine-tuning

Qualitative evaluation

For qualitative evaluation, which is a subjective comparison, compare the model’s output on same input before and after RLHF. The output after fine-tuning should appear more human aligned.

Quantitative evaluation

You can use an aggregate toxicity score (or any other score depending on the fine-tuning objective) for a large number of completions generated by the model using a test dataset that the model did not see during RLHF fine-tuning. If RLHF has successfully reduced the intended score (or toxicity in this example) of your generative model, the toxicity score will decrease relative to the baseline.

Evaluation code:

Check out Foundation model evaluation library by aws:

Evaluation datasets for Specialized domains

I encourage you to checkout curated list of related
papers on publicly available GitHub repository, part of awesome paper Zishan Guo et al., “Evaluating Large Language Models: A Comprehensive Survey”, arXiv, 2023.

Question-answering and knowledge completion:

  • WikiFact (Goodrich et al., 2019) is an automatic metric proposed for evaluating the factual accuracy of generated text. It defines a dataset in the form of a relation tuple (subject, relation, object). This dataset is created based on the English Wikipedia and Wikidata knowledge base.
  • Social IQA (Sap et al., 2019) a dataset that contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations.
  • MCTACO (Zhou et al., 2019) a dataset of 13k question-answer pairs that require temporal commonsense comprehension.
  • HellaSWAG (Zellers et al., 2019), this dataset is a benchmark for Commonsense NLI. It includes a context and some endings which complete the context.
  • TaxiNLI (Joshi et al., 2020) a dataset that has 10k examples from the MNLI dataset (Williams et al., 2018), collected based on the principles and categorizations of the aforementioned taxonomy.
  • LogiQA 2.0 (Liu et al., 2023) benchmarks consisting of multi-choice logic questions sourced from standardized tests (e.g., the Law School Admission Test, the Graduate Management Admissions Test, and the National Civil Servants Examination of China).
  • HybridQA (Chen et al., 2020) a question-answering dataset that requires reasoning on heterogeneous information.
  • GSM8K (Cobbe et al., 2021) a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The queries and answers within GSM8K are meticulously designed by human problem composers, guaranteeing a moderate level of challenge while concurrently circumventing monotony and stereotypes to a considerable degree.
  • API-Bank (Li et al., 2023) a tailor-made benchmark for evaluating tool-augmented LLMs, encompassing 53 standard API tools, a comprehensive workflow for tool-augmented LLMs, and 264 annotated dialogues.
  • ToolQA (Zhuang et al., 2023) a dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. Tt centers on whether the LLMs can produce the correct answer, rather than the
    intermediary process of tool utilization during benchmarking. Additionally, ToolQA aims to differentiate between the LLMs using external tools and those relying solely on their internal knowledge by selecting data from sources not yet memorized by the LLMs.

Bias detection, toxicity assessment, truthfulness evaluation and hallucinations:

  • Moral Foundations Twitter Corpus (Hoover et al., 2020)
  • Moral Stroies (Emelin et al., 2021) is a crowd-sourced dataset containing 12K short narratives for goal-oriented moral reasoning grounded in social situations, genreated on social norms extracted from Social Chemistry 101.
  • Botzer et al. (2021) focus on analyzing moral judgements rendered on social media by capturing the moral judgements which are passed in the subreddit /r/AmITheAsshole on Reddit.
  • MoralExceptQA (Jin et al., 2022) considers 3 potentially permissible exceptions, manually creates scenarios according to these 3 exceptions, and recruits subjects on Amazon Mechanical Turk (AMT), including diverse racial and ethnic
  • SaGE: Evaluating Moral Consistency in Large Language Models
  • PROSOCIALDIALOG (Kim et al., 2022) is a multi-turn dialogue dataset,
    teaching conversational agents to respond to problematic content following social norms.
  • WikiGenderBias (Gaut et al., 2020) is a dataset created to assess gender bias in relation extraction systems. It measures the performance difference in extracting sentences about females versus males, containing 45,000 sentences, each of which consists of a male or female
    entity and one of four relations: spouse, profession, date of birth and place of birth.
  • StereoSet (Nadeem et al., 2021) is dataset designed
    to measure the stereotypical bias in language models (LMs) by using sentence pairs to determine if LMs prefer stereotypical sentences.
  • COVID-HATE (He et al., 2021) dataset includes 2K sentences on hate towards asians owing to SARS-Coronavirus disease (COVID-19).
  • NewsQA (Trischler et al., 2017) is a machine comprehension dataset comprising 119,633 human-authored question-answer pairs based on CNN news articles.
  • BIG-bench (Srivastava et al., 2022) is a collaborative benchmark comprising a diverse set of tasks that are widely perceived to surpass the existing capabilities of contemporary LLMs.
  • SelfAware (Yin et al., 2023) is a benchmark designed to evaluate how well LLMs can recognize the boundaries of their knowledge when they lack enough information to provide a definite answer to a question. It consists of 1,032 unanswerable questions and 2,337 answerable questions.
  • DialFact (Gupta et al. 2022) benchmark comprises
    22,245 annotated conversational claims, each paired with corresponding pieces of 32evidence extracted from Wikipedia. These claims are categorized as either supported, refuted, or ‘not enough information’ based on their relationship with the evidence.

Power-seeking behaviors and situational awareness (with domain-specific challenges and intricacies):

Example of LLM’s risky behaviours. SOURCE: Zishan Guo et al., “Evaluating Large Language Models: A Comprehensive Survey”, arXiv, 2023
  • PromptBench (Zhu et al. 2023) benchmark for evaluating the robustness of LLMs by attacking them with adversarial prompts (dynamically created character-, word-, sentence-, and semantic-level prompts)
  • AdvGLUE The Adversarial GLUE Benchmark (Wang et al., 2021) benchmark datasets for evaluating the robustness of LLMs on translation, question-answering (QA), text classification, and natural language inference (NLI)
  • ReCode (Wang et al. 2023) benchmark for evaluating the robustness of LLMs in code generation. ReCode generates perturbations in code docstring, function, syntax, and format. These perturbation styles encompass character- and word-level insertions or transformations.
  • SVAMP (Patel et al., 2021), achallenge set for elementary-level Math Word Problems (MWP).
  • BlendedSkillTalk (Smith et al., 2020) adataset of 7k conversations explicitly designed to exhibit multiple conversation modes: displaying personality, having empathy, and demonstrating knowledge. Can be used for evaluating robustness of dialogue generation task using white-box attack proposed by Li et al. (2023f) DGSlow.
  • BigToM (Gandhi et al., 2023) is a social reasoning benchmark that contains 25 control variables. It aligns human Theory-of-Mind (ToM) (Wellman, 1992; Leslie et al., 2004; Frith & Frith, 2005) reasoning capabilities by controlling different variables and conditions in the causal graph.

Specialized LLMs Evaluation (such as biology, education,
law, computer science, and finance):

  • PubMedQA (Jin et al., 2019) measures LLMs’ question-answering
    ability on medical scientific literature.
  • LiveQA (Abacha et al., 2017) evaluates LLMs as consultation robot using commonly asked questions scraped from medical websites.
  • Multi-MedQA (Singhal et al., 2022) integrates six existing datasets and further augments them with curated commonly searched health queries.
  • SARA (Holzenberger et al., 2020) a dataset for statutory reasoning in tax law entailment and question answering, in the legislation domain.
  • EvalPlus (Liu et al. 2023) a code synthesis benchmarking framework, to evaluate the functional correctness of LLM-synthesized code. It augments evaluation datasets with test cases generated by an automatic test input generator. The popular HUMANEVAL benchmark is extended by 81x to create HUMANEVAL+ using EvalPlus.
  • FinBERT (Araci, 2019) constructs a financial vocabulary (FinVocab) from a corpus of financial texts using Google’s WordPiece algorithm.
  • BloombergGPT (Wu et al., 2023) is a language model with 50 billion parameters, trained on a wide range of financial data, which makes it outperform existing models on various financial tasks.

LLM agents evaluation

  • AgentBench (Liu et al., 2023) a comprehensive Benchmark to Evaluate LLMs as Agents.
  • WebArena (Zhou et al., 2023) is a realistic and reproducible benchmark for agents, with fully functional websites from four common domains. WebArena includes a set of benchmark tasks to evaluate the functional correctness of task completions.
  • The ARC Evals project of the Alignment Research Center , which is responsible for evaluating the abilities of advanced AI to seek resources, self-replicate, and adaptation to new environments.

Don’t trust benchmark too much. If it is public, it may have leaked into a LLMs training dataset.

Check out Foundation Model Development Cheat sheet:

Evaluating in CI/CD

Check-out the short course on Automated Testing for LLMOps on DeepLearning.AI by Rob Zuber and Andrew Ng

Rule based

Rule based eval use pattern or string matching, and are fast and cost-effective to run. Good for quick evaluation in cases such as sentiment analysis, classification, when you have ground-truth labels.

Because they are quick, they can be run in pre-commit, or whenever a change to the code is committed, to get fast feedback.

Model-graded evaluation

Relevant for applications where there are many possible good or bad outcomes. Here we use an LLM to evaluate the output of another LLM. In cases, such as, when LLM is asked to write some content and there can be more than one high-quality response. Here you might prompt an evaluation LLM to have it access the quality of your application LLM.

Model graded evals take more time and cost more, but they allow you to access more complex outputs. They are generally recommended as pre-release evals rather than pre-commit rule-based evals.

The following demo code can be adopted with pre-commit and pre-release evals to form a completed automated testing suite.

Model graded evaluations:

Model graded tests:

Evaluation beyond metrics and benchmarks

Holistic Evaluation

Cost & Memory

If you are using a manged service like Amazon Bedrock, then the cost incurred depends on the number of tokens (refer to the Prompt-engineering section in Architectural Patterns chapter to know more about tokenization).

To roughly calculate number of input tokens, multiply the number of words by 1.3. The number obtained will be a good rough estimate of the input tokens cost for that prompt. You can store the number of words/tokens for input prompt or log the count in the stdout, to calculate the cost of on-demand.

Max number of output tokens can be controlled by using the input parameters. You can calculate the number of tokens in a similar manner by multiplying total number of output words with 1.3. Adding cost of input tokens + output tokens, will give you a good estimate of final cost of 1 interaction for a manged service.

For hosted models, the cost depends on the instance per hour x no. of instances employed. You will choose the instance type, depending on the model size. Most of the infrastructure providers provide on-demand costing of all their instances on the public website.

If using services like SageMaker Jumpstart, or SageMaker large inference containers, the cost depends on the underlying infrastructure employed and the total time is was assigned to you.

Memory challanges during inference can be tackled with model quantization or pruning. Quantization is the popular choice, though both strategies will sacrifice bit of quality for speed and memory. When comparing two quantized models, ensure the method of quantization is similar and the quantized bits.

Memory for training depends on the number of parameters (refer to the Model sizes and memory needed section in How to play with LLMs chapter). Based on rough estimates, 1 billion parameters model requires 24 GB of memory for training in full-precision. Compared to 4GB for loading the model for inference. Similarly a 50 billion parameters model will required ~1200 GB (or ~1.2 TB) of memory.

AWS p4d.24xlarge has 8 nvidia A100 GPUs with a total shared memory of 640GB. To train models that will not fit even in such a single machine you will have to adpot sharded data parallelism. It is better to understand the memory requirements and calculate the cost before venturing into model pretraining or fine-tuning strategy.


Smaller models when deployed correctly might beat larger models in latency, but at the cost of quality. A good balance of of both is needed.

Selecting LLMs

Streaming output is supported by most of the endpoints [not by AWS Lambda with python env unfortunately]. But to receive the output in chunks, i.e. one-by-one word instead of the whole output, your application must be capable to handle streaming output.

Input context length and output sequence max length

All the models have limited maximum input context length and output sequence max length. You use cases might require larger input context length.

Deploying LLMs

  • Deployment vs productionization
  • Classical ML model pipeline
    - Open-source tools
    - AWS SageMaker Pipelines
    - Different ways to deploy model on SageMaker
    - BYOC (Bring your own container)
    - Deploying multiple models
  • LLM Inference with Quantization
    - Quantize with AutoGPTQ
    - Quantize with llama.cpp
  • Deploy LLM on Local Machine
    - llama.cpp
    - Ollama
    - Transformers
    - text-generation webui by oobabooga
    - GPT4ALL
    - Chat with RTX by Nvidia
  • Deploy LLM on cloud
    - Major cloud providers
    - Deploy LLMs from HuggingFace on Sagemaker Endpoint
    - Sagemaker Jumpstart
    - SageMaker deployment of LLMs that you have pretrained or fine-tuned
  • Deploy using containers
    - Benefits of using containers
    - GPU and containers
    - Using Ollama
  • Using specialized hardware for inference
    - AWS Inferentia
    - Apple Neural engine
  • Deployment on edge devices
    - Different types of edge devices
    - TensorFlow Lite
    - SageMaker Neo
    - ONNX
  • CI/CD Pipeline for LLM based applications
    - Fine-tuning Pipeline
  • Capturing endpoint statistics
    - Ways to capture endpoint statistics
    - Cloud provider endpoints

Deployment vs productionization

What does model deployment mean?

Model deployment is the process of making a trained machine learning model available for use in a specific environment. It involves taking the model from a development or testing environment and deploying it to a production or operational environment where it can be accessed by end-users or other systems through an endpoint.

What does productionization of model mean?

Putting a model into production specifically refers to the step where a model is incorporated into the live or operational environment, and it actively influences or aids real-world processes or decisions. It also involves automating your workflow to the extent possible.

Different between model deployment and putting it in production

This is a little bit controversial because some folks look productionization as a subset of model deployment and for some model deployment is subset of the whole process of putting a model in production.

In essence, model deployment is about making the model available, while putting a model into production extends this concept to cover the entire life cycle, emphasising continuous improvement, monitoring, and adaptation to real-world conditions.

Classical ML model pipeline

ML Pipeline sample

A typical machine learning model pipeline begins with data query from a database, retrieving relevant datasets for analysis. Next, data preprocessing techniques are applied to clean, transform, and prepare the data for modeling. This step involves handling missing values, encoding categorical variables, and scaling numerical features to ensure optimal model performance.

Following data preprocessing, the model is trained on the prepared dataset. Depending on the application, continuous training may be implemented to update the model with new data over time, ensuring it remains up-to-date and relevant. Model evaluation is then conducted to assess the performance of both the champion (current) and challenger (new) models. This involves using appropriate evaluation metrics to compare their predictive capabilities and determine if the new model surpasses the existing one.

Once the best-performing model is identified, it is registered in a model registry, where detailed information about the model, such as its version, parameters, and performance metrics, is stored. This allows for easy tracking and management of different model versions.

Finally, the selected model is deployed either automatically or manually to a production environment where it can be used to make predictions on new data. This deployment process involves setting up model endpoints, APIs, or services to expose the model’s functionality to other applications or users.

In addition to the main pipeline steps, other important considerations include data versioning to track changes in datasets over time, documenting all experiments and model development processes for reproducibility and transparency, performing data integrity checks to ensure the quality and consistency of the data used for modeling, and collecting metrics from model endpoints to monitor performance and usage in production environments. These elements help ensure the reliability, efficiency, and scalability of the machine learning pipeline.

Open-source tools

There are number of battle-hardened open-source ML workflow management tools and new ones arrive every fortnight. There will be some trade-offs, no matter which tool you will choose. It is best to create a simple small workflow on the chosen tool, productionize it in your existing infrstructure to check its viability before going full throttle. Try to cover as much nuances as possible in that simple small pipeline to avoid surprises later. For example, if your database is in VPC, or you always use a manged model registery and endpoint, create that sample project aligned with these objectives.

  1. MLFlow — can manage any ML or generative ai project with integrations to PyTorch, HuggingFace, OpenAI, LangChain, Tensorflow, scikit-learn, XGBoost, LightGBM and more. It is also most easy to pickup after getting through the initial setup phase.
  2. Kubeflow — it is an open-source platform for machine learning and MLOps on Kubernetes introduced by Google. It is highly efficient and requires Kubernetes. Kubeflow pipelines are very robust with some of the most reliable model deployments I have ever seen.
  3. Apache Airflow — is the original gangster (OG) of worflow management tools. Though not specific to machine learning, due to its wider community many engineers prefer this over any other tool. You will typically use it with [add] to execute any data processing logic in your workflow.
  4. Metafloworiginally developed at Netflix and has gone open-source in 2019. It has many advantages over other workflow tools like prototype locally and deploy on your existing infrastructure with a few clicks, or easy collaboration with your team and more. Just like any other tools this too has a slight learning curve.

There are also many other awesome tools like Tensorflow Extended, semantic, for entire ML workflow management. Always keep a lookout and design your architecture to swap tools easily like two-way door decisions.

AWS SageMaker Pipelines

The concepts are important and we will use them in the next section which talks about productionization of LLMs.

SageMaker pipelines are the standard, full-features and most complete way to implement machine learning pieleines. SageMaker pipelines have integration with SageMaker feature store, data wrangler, processing jobs, training jobs, hyper-parameter tuning jobs, model registery, batch transformation and model endpoints.

Each pipeline step is designed to achieve one specific objective. Data preprocessing step will query the data and make it ready for training. Training step will only train the model. Model evaluation step will evaluate the newly trained model. And so on. Each step can use different type of instance, different type of container with different libraries.

Libraries needed for data preprocessing will be different than libraries needed for training the model. SageMaker provides different options of using built-in algorithms, bring your own scripts and use available algorithms, or bring your own container (BYOC).

SageMaker has three options to build, train, optimize, and deploy our model
SageMaker pipeline

Different ways to deploy model on SageMaker

There are several options to deploy a model using SageMaker hosting services. You can interactively deploy a model with SageMaker Studio. Or, you can programmatically deploy a model using an AWS SDK, such as the SageMaker Python SDK or the SDK for Python (Boto3). You can also deploy by using the AWS CLI.


Before you begin

Every model has model reg, endpoint config, and an endpoint.

To deploy any model on Sagemaker you need three things:

  1. Model registery
  2. Endpoint config
  3. Endpoint

Model registery will require:

Model artifact — model binary typically stored in S3
Inference code — containersed inference code image stored in ECR

Endpoint config will require:

Selecting whether you want serverless or dedicated instance.
Optional variants of models in Production and shadow space. Default is single production variant.
For each variant, type of instance and if there are more than 1 variants be in prod or shadow space then the model weights to distribute the traffic

Endpoint creation will require:

An endpoint name

Real time inference

Real-time inference is ideal for inference workloads where you have real-time, interactive, low latency requirements. You can deploy your model to SageMaker hosting services and get an endpoint that can be used for inference. These endpoints are fully managed and support autoscaling too.

Batch transform

Use batch transform when you need to do the following:

  • Preprocess datasets to remove noise or bias that interferes with training or inference from your dataset.
  • Get inferences from large datasets.
  • Run inference when you don’t need a persistent endpoint.
  • Associate input records with inferences to assist the interpretation of results.

Asyncherous inference

Amazon SageMaker Asynchronous Inference is a capability in SageMaker that queues incoming requests and processes them asynchronously. This option is ideal for requests with large payload sizes (up to 1GB), long processing times (up to one hour), and near real-time latency requirements. Asynchronous Inference enables you to save on costs by autoscaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.

Serverless and AWS Lambda

Amazon SageMaker Serverless Inference is a purpose-built inference option that enables you to deploy and scale ML models without configuring or managing any of the underlying infrastructure. On-demand Serverless Inference is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. This takes away the undifferentiated heavy lifting of selecting and managing servers. Serverless Inference integrates with AWS Lambda to offer you high availability, built-in fault tolerance and automatic scaling. With a pay-per-use model, Serverless Inference is a cost-effective option if you have an infrequent or unpredictable traffic pattern. During times when there are no requests, Serverless Inference scales your endpoint down to 0, helping you to minimize your costs.

Optionally, you can also use Provisioned Concurrency with Serverless Inference. Serverless Inference with provisioned concurrency is a cost-effective option when you have predictable bursts in your traffic. Provisioned Concurrency allows you to deploy models on serverless endpoints with predictable performance, and high scalability by keeping your endpoints warm. SageMaker ensures that for the number of Provisioned Concurrency that you allocate, the compute resources are initialized and ready to respond within milliseconds.

BYOC (Bring your own container)

You can package your own algorithms that can than be trained and deployed in the SageMaker environment. By packaging an algorithm in a container, you can bring almost any code to the Amazon SageMaker environment, regardless of programming language, environment, framework, or dependencies.

You may not need to create a container to bring your own code to Amazon SageMaker. When you are using a framework (such as Apache MXNet or TensorFlow) that has direct support in SageMaker, you can simply supply the Python code that implements your algorithm using the SDK entry points for that framework. This set of frameworks is continually expanding, so we recommend that you check the current list if your algorithm is written in a common machine learning environment.

Even if there is direct SDK support for your environment or framework, you may find it more effective to build your own container. If the code that implements your algorithm is quite complex on its own or you need special additions to the framework, building your own container may be the right choice.

Custom containers can be built with sagemaker-training-toolkit and sagemaker-inference-toolkit for training and inference respectively.

Deploying multiple models

You can create both CPU and GPU backed multi-model endpoint where multiple models are hosted behind a single endpoint. SageMaker automatically offloads unused models to make space for hot models.

Multi model endpoints can be created from the console and as well as SDK.

Multi-model endpoint via Console

Creating multi model endpoint using SDK:

import sagemaker


multi_model_container = {
'Image': '<ACCOUNT_ID>.dkr.ecr.<REGION_NAME><IMAGE>:<TAG>',
'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
'Mode': 'MultiModel'

response = sagemaker_client.create_model(
ModelName = '<MODEL_NAME>',
ExecutionRoleArn = role,
Containers = [multi_model_container]

response = sagemaker_client.create_endpoint_config(
EndpointConfigName = '<ENDPOINT_CONFIG_NAME>',
'InstanceType': 'ml.m4.xlarge',
'InitialInstanceCount': 2,
'InitialVariantWeight': 1,
'ModelName': '<MODEL_NAME>',
'VariantName': 'AllTraffic'

Multi-model endpoint is invoked by passing the TargetModel parameter in the invoke_endpoint api.

response = runtime_sm_client.invoke_endpoint(

print(*json.loads(response["Body"].read()), sep="\n")

LLM Inference with Quantization

We covered quantization briefly in Local model inference section of How to play with LLMs chapter. In this section we will demonstrate few ways to quantize LLMs.

Quantization involves mapping higher-precision model weights to a lower-precision. You can map 32-bit to 8-bit, or even 8-bit to 1-bit. To achieve quantization, we need to find the optimum way to project higher-precision, like 32-bit, to a lower precision, like 16-bit.

A 64-bit floating point has, 11 bits for exponent, 52-bit for fraction and 1 bit for sign.


And, a 32-bit floating point has, 8 bits for exponent, 23 bits for fraction and 1 bit for sign.


Other notable floating-point formats:


To reduce model size for inference you need to perform post-training quantization for model’s obtained from the internet. For in-house models, you can perform quantization-aware training during the pretraining stage.

Quantize with AutoGPTQ

Many open-sources models have their quantized versions already available. Check their model cards to understand the apis. To quantize any text model, you can use AutoGPTQ library that provide a simple API that apply GPTQ quantization (ther are also other methods) on language models.

Quantize with llama.cpp

  1. Install llama.cpp
  2. Download the LLM into ./models/ directory. Ensure you have a model weights file and a tokenizer file.

3. Convert the model to gguf FP16 format:

python3 models/mymodel/

4. Quantize the model to 4-bits (using Q4_K_M method):

./quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

5. Run the quantized model:

./main -m ./models/mymodel/ggml-model-Q4_K_M.gguf -n 128

Quantized models are relatively slow

Apart from loss of quality, quantized models also have slower inference.

Check out the awesome article on LLM quantization by Miguel:

Deploy LLM on Local Machine

We covered local machine deployment in Local modal inference section in How to play with LLMs chapter.


Open-source | Mac OS, Linux, Windows (via CMake), Docker, FreeBSD

  1. Install llama.cpp
  2. Download any gguf model into models/ directory or convert your LLM to ggml format using the file
python3 models/mymodel/

3. Start the server:

./server -m models/YOUR_MODEL.gguf -c 2048

When “model loaded” is seen at the end of terminal, you can POST requests to the server via CURL from another terminal window.

Test using CURL from another terminal window:

curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "What is the meaning of life? The meaning of life","n_predict": 128}'


Open-source | Mac & Linux

  1. Download Ollama or copy install command for linux from official website

2. Run the command in terminal

curl -fsSL | sh

Note: it automatically detected my NVIDIA GeForce 960M 3GB GPU

3. Now you can either directly run any of the models from Ollama library or customize the prompt before running them. To run directly:

ollama run phi

4. To customize the prompt, first pull the model

ollama pull gemma:2b-instruct

Create a Modelfile

FROM gemma:2b-instruct
# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 0.6
You are a psychologist chatbot called Therapy. Answer all the user's questions with empathy.

5. Create the model

ollama create therapy -f ./Modelfile

6. Invoke the Ollama endpoint (refer to the API documentation for all endpoints)

curl http://localhost:11434/api/generate -d '{
"model": "therapy",
"prompt":"Why is the sky blue?"

This will generate a streaming response:

You will also see the logs in the main terminal window:

Refer to all the server command line parameters from official documentation.

🤗 Transformers

text-generation webui by oobabooga


GPT4All has a server mode. Check the official documentation for more info.

GPT4All Server Model in Settings

Chat with RTX by Nvidia (for Windows platform only)

Deploy LLM on cloud

Major cloud providers


Amazon SageMaker — A comprehensive platform for machine learning and managing the entire life-cycle of LLMs. It supports custom model development, deployment and scaling with access to pre-trained models.

AWS Lambda — Serverless compute for loads that reach 0, or use State machines to orchestrate event driven pipelines.

AWS Elastic Kubernetes Service (EKS) — is a managed Kuberenetes service to orchestrate containers for microservices of your LLM based applications.


Azure Machine Learning — Offers various tools for deploying LLMs like Model Management, Endpoints, Batch Scoring and Managed Inference; to set up a scalable and managed infrastructure for real-time or batch inference.

Azure Kubernetes Service (AKS): is managed Kubernetes service by google. It managed your LLM model in containerized format for you.

Azure Functions: Serverless functions to deploy your LLM model for lightweight, event-driven interactions.

Google Cloud

Vertex AI — provides purpose-built MLOps tools for data scientists and ML engineers to automate, standardize, and manage ML projects. Some of the the functionalities include model management, managed inference, batch inference and custom containers.

Cloud Run and Cloud Functions — is a serverless platform to deploy LLMs as lightweight, event-driven applications, ideal for smaller models or microservices.

Note: all of them provide Nvidia GPUs with on-demand prices. Your new or personal account account may not qualify for high grade GPUs (thanks to bitcoin miners)!

Deploy LLMs from HuggingFace on Sagemaker Endpoint

Most easy way to quickly deploy HuggingFace model

If you got a new model you want to test quickly, go the model card page on HuggingFace:

Click on “Deploy” button

Typically you will always have Sagemaker as an option, click on it

Copy the boilerplate code and paste it in either sagemaker studio or your notebook

Deploy HuggingFace model on SageMaked endpoint

Now you must change few of the variables like:

  1. Add your AWS accounts Sagemaker execution role. If you are running it from sagemaker studio then sagemaker.get_execution_role() will suffice.
  2. Adjust some of the model configurations (which will be part of endpoint configuration). For example instance type, number of GPUs (if supported) and instance count.

This will deploy a dedicated endpoint on our sagemaker domain.

To invoke any sagemaker endpoint you will need an environment with boto3 installed (except when using database triggers like AWS Aurora Postgres Sagemaker trigger).

Invoke SageMaker endpoint

Sagemaker Jumpstart

SageMaker Jumpstart model deployment code

SageMaker deployment of LLMs that you have pretrained or fine-tuned

To deploy custom LLMs see if the libraries used by the model is available in the in-built frameworks and using script mode to pass custom scripts. If your LLM uses custom packages the use Bring you own container (BYOC) mode using sagemaker inference toolkit to create custom inference container.

Check out more examples on sagemaker:

Deploy using containers

Benefits of using containers

In the world of Service Oriented Architecture (SOA), containers are a blessing. Orchestrating large number of containers are a challange, but the benefits of a containerized service has numerous benefits when compared with an app running on Virtual Machines.

Large Language Models have higher memory requirements compared to a classical web service. This means that we have to understand these memory requirements before containering LLMs or LLM based endpoints. Barring small number of cases, like when you generative model fits perfectly in 1 server and only 1 server is needed; barring such small number of instances, containerzing your LLM is advisable for production use cases.

  1. Scalability and infrastructure optimization — fine-grained dynamic and elastic provisioning of resources (CPU, GPU, memory, persistent volumes), dynamic scaling and maximized component/resource density to make best use of infrastructure resources.
  2. Operational consistency and Component portability — automation of build and deployment, reducing the range of skillsets required to operate many different environments. Portability across nodes, environments, and clouds, images can be built and run on any container platform enabling you to focus on open containerization standards such as Docker and Kubernetes.
  3. Service resiliency — Rapid re-start, ability to implement clean re-instatement, safe independent deployment, removing risk of destabilizing existing components and fine-grained roll-out using rolling upgrades, canary releases, and A/B testing.

GPU and containers

You can use your dedicated GPU or cloud GPU with containers. If you have a laptop then check the GPU memory and model size before containerizing your app.

Containers when used with Docker or with a different runtime like containerd, CRI-O; uses the NVIDIA Container toolkit which installs NVIDIA Container Runtime (nvidia-container-runtime) in the host machine.

Image source:

For containerd runtime, NVIDIA Container Runtime is configured as an OCI-compliant runtime and uses NVIDIA CUDA, NVML drivers at the lowest level via NVIDIA Container Runtime Hook (`nvidia-container-runtime-hook`), with the flow through the various components as shown in the following diagram:


After installing the NVIDIA Container Toolkit, you can run a sample container to test the NVIDIA GPU driver. Official documentation to run a sample workload :

My system is Alienware 15 (2014), has a discrete GPU — Nvidia GeForce 960M with 3GB GDDR5 memory and 8GB DDR3L 1600 MHz RAM. After the running the sample container, I got the following response:

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

smi — system management interface

To use custom containers with GPU during runtime, you specify --gpus parameter to the docker run command. For example:

docker run --gpus all tensorflow/tensorflow:latest-gpu

To utilise GPU during build time, you have two options:

  1. Modify daemon.json file inside /etc/docker directory and change the default runtime to nvidia.
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []

2. Or, use one of the nvidia/cuda images. Following is a sample Dockerfile that I have taken from here , it uses nvidia/cuda:11.4.0-base-ubuntu20.04 as the base image to check PyTorch GPU support from inside container.

FROM nvidia/cuda:11.4.0-base-ubuntu20.04

ENV DEBIAN_FRONTEND=noninteractive

# Install python
RUN apt-get update && \
apt-get install -y \
git \
python3-pip \
python3-dev \
python3-opencv \

# Install PyTorch and torchvision
RUN pip3 install torch torchvision torchaudio -f


# COPY necessary files for inference

ENTRYPOINT [ "python3" ]

After building the image, verify PyTorch installation:

docker exec -it <Container name> /bin/bash

From inside the container run the commands one-by-one:


import torch


torch.cuda.get_device_name(0) # Change to your desired GPU if your machine has multiple

By using one of the nvidia cude base images you can execute LLM inference on many playforms, like:

  1. LLM with GPU on Docker container locally
  2. GPU on EC2
  3. GPU on AWS Fargate
  4. GPU on Kubernetes (

Another example of sample Dockerfile with a nvidia cuda base image from here:

FROM --platform=amd64 as base


WORKDIR /workspace

RUN apt update && \
apt install -y python3-pip python3-packaging \
git ninja-build && \
pip3 install -U pip

# Tweak this list to reduce build time
ENV TORCH_CUDA_ARCH_LIST "7.0;7.2;7.5;8.0;8.6;8.9;9.0"

# We have to manually install Torch otherwise apex & xformers won't build
RUN pip3 install "torch>=2.0.0"

# This build is slow but NVIDIA does not provide binaries. Increase MAX_JOBS as needed.
RUN git clone && \
cd apex && git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82 && \
sed -i '/check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)/d' && \
python3 install --cpp_ext --cuda_ext

RUN pip3 install "xformers==0.0.22" "transformers==4.34.0" "vllm==0.2.0" "fschat[model_worker]==0.2.30"



Using Ollama

In an earlier section we saw how to start Ollama server and execute curl commands via command line. You can write the Ollama installation and server execution commands in the Dockerfile to use Ollama.

You can also use the official Ollama docker image which is abailable on Docker hub. Make sure to install the NVIDIA Container Toolkit to use GPU.

Using specialized hardware for inference

AWS Inferentia

ml.inf2 family of instances are designed for deep learning and generative models inference. AWS claims these instances deliver up to 2.3x higher throughput and up to 70% lower cost per inference than comparable Amazon EC2 instances.

AWS trainium can be used for training generative models while inferentia is used for inference.

You can use ml.inf2 instances to deploy SageMaker Jumpstart models, or any LLM deployed on SageMaker endpoint.

from sagemaker.jumpstart.model import JumpStartModel

model_id = "meta-textgenerationneuron-llama-2-13b-f"
model = JumpStartModel(
"OPTION_DTYPE": "fp16",
pretrained_predictor = model.deploy(accept_eula=True)
payload = {
"inputs": "I believe the meaning of life is",
"parameters": {
"max_new_tokens": 64,
"top_p": 0.9,
"temperature": 0.6,

response = pretrained_predictor.predict(payload)

Apple Neural engine

Apple’s Neural Engine (ANE) is the marketing name for a group of specialized cores functioning as a neural processing unit (NPU) dedicated to the acceleration of artificial intelligence operations and machine learning tasks. Source

Source: Apple 2020

The ANE isn’t the only NPU out there. Besides the Neural Engine, the most famous NPU is Google’s TPU (or Tensor Processing Unit).


To do inference with ANE you will have to install ane-transformers package from pip (and then pray that it works, because apple hasn’t updated it in last 2 years).

Github repo of Apple’s ml-ane-tranformers.

Initialize baseline model

import transformers
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
baseline_model = transformers.AutoModelForSequenceClassification.from_pretrained(

Initialize the mathematically equivalent but optimized model, and we restore its parameters using that of the baseline model

from ane_transformers.huggingface import distilbert as ane_distilbert
optimized_model = ane_distilbert.DistilBertForSequenceClassification(

Create sample inputs for the model”

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
tokenized = tokenizer(
["Sample input text to trace the model"],
max_length=128, # token sequence length

import torch
traced_optimized_model = torch.jit.trace(
(tokenized["input_ids"], tokenized["attention_mask"])

Use coremltools to generate the Core ML model package file and save it”

import coremltools as ct
import numpy as np
ane_mlpackage_obj = ct.convert(
) for name, tensor in tokenized.items()
out_path = "HuggingFace_ane_transformers_distilbert_seqLen128_batchSize1.mlpackage"

Use installation and troubleshooting from the official github repo.


Deployment on edge devices

Different types of edge devices

There are different types of edge computing, we will discuss Internet of Things (IoT) edge. Some of the common IoT devices include:

  • Mobile devices
  • Connected cameras
  • Retail Kiosks
  • Sensors
  • Smart devices like smart parking meters
  • Cars and other similar products

Tensorflow Lite

For mobile devices with On-Device Machine Learning (ODML) capabilities, or even edge devices like Rasberry Pi, you can convert your existing LLM to a .tflite i.e. TensorFlow Lite model and do inference on the mobile apps. TensorFlow Lite is a mobile library for deploying models on mobile, microcontrollers and other edge devices.

Conceptual architecture for TensorFlow Lite. Image source:

The high level developer workflow for using TensorFlow Lite is: first convert a TensorFlow model to the more compact TensorFlow Lite format using the TensorFlow Lite converter, and then use the TensorFlow Lite interpreter, which is highly optimized for mobile devices, to run the converted model. During the conversion process, you can also leverage several techniques, such as quantization, to further optimize the model and accelerate inference.

Image source:

Copy this official google colab to play with GPT2CausalLM model with TensorFlow Lite.

See more examples of TensorFlow Lite (for iOS, Android and Raspberry Pi) :

SageMaker Neo

Amazon SageMaker Neo enables developers to optimize machine learning models for inference on SageMaker in the cloud and supported devices at the edge.

Steps to optimize ML models with SageMaker Neo:

  1. Build and train an ML model using any of the frameworks SageMaker Neo supports.
  2. Or upload an existing model’s artefacts in an S3 bucket.
  3. Use SageMaker Neo to create an optimized deployment package for the ML model framework and target hardware, such as EC2 instances and edge devices. This is the only additional task compared to the usual ML deployment process.
  4. Deploy the optimized ML model generated by SageMaker Neo on the target cloud or edge infrastructure.

Example of model compilation for some of the edge devices using SageMaker neo:

Deploy LLM with SageMaker Neo. Source:


ONNX is a community project, a format built to represent machine learning models. ONNX defines a common set of operators — the building blocks of machine learning and deep learning models — and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers.

If you have a model in one of the ONNX supported frameworks, which includes all major ML frameworks, then it can optimize the model to maximize performance across hardware using one of the supported accelerators like Mace, NVIDIA, Optimum, Qualcomm, Synopsys, Tensorlfow, Windows, Vespa and more.

Using other tools

If your edge device has a kernel and supports containers then people have successfully run Code Llama and llama.cpp, for generative model inference.

If the edge device has its own developer kit like NVIDIA IGX Orin, then see the official documentation for edge deployment.

CI/CD Pipeline for LLM based applications

Your model pipeline will vary depending on the architecture. For a RAG architecture, you will want to update your vector storage with new knowledge bases or updated articles Updating embeddings of only updated articles is a better choice than embedding the whole corpus everytime there is an update to any article.

In a continous pretraining architecture, the Foundation model is continously pretrained no new data. To keep the model from degrading due to bad data, you need to have a robust pipeline with data checks, endpoint drift detection and rule/model based evaluations.

An architecture that has a fine-tuned generative model, you can add rule based checks that are triggered with pre-commit everytime code changes are commited by developers.

We discussed rule based and model based evaluations in Evaluating in CI/CD section of Evaluating LLMs chapter.

Fine-tuning Pipeline

In classical SageMaker model pipeline we use ScriptProcessor to execute custom scripts on custom libraries. We also use it when we have to install bunch of packages ourselves in the container and host the container image on our ECR.

Sagemaker already has public images that has packages installed to support training, processing and hosting deep learning models. These images include common packages like pytorch, tensorflow, transformers, huggingface and many more.

HuggingFace Processor:

In a production pipeline your LLM or any other model will typically do more than just return predictions. For that sagemaker provides different toolkits. Remember we used inference and training toolkit for our custom images in example of classical models, we also have sagemaker huggingface inference toolkit.

HuggingFace inference toolkit:

Capturing endpoint statistics

Capturing endpoint statistics and processing them is paramount to check for model degradation, scaling, and continuous improvement of service.

Your DevOps will typically have a data with all the hot data related to the operational metrics of all the infrastructure. Metrics like network bandwidth, CPU usage across all nodes, RAM usage, response time, number of nodes up/down, number of containers and more.

MLOps dashboard will usually have feature distributions, KL divergence, prediction value distribution, embedding distribution, memory usage, number of endpoints and other model related metrics like recall, F1, AUC, RMSE, MAP, BLEU, etc.

For an LLM endpoint, you will have the relevant MLOps metrics and few LLM sub-metrics.

  • Time to first token (TTFT): This is how quickly users start seeing the model’s output after entering their query.
  • Time per output token (TPOT): Time to generate an output token for each user that is querying the system.

Based on above metrics:

Latency = TTFT + (TPOT) * (the number of tokens to be generated)

Ways to capture endpoint statistics

For applications where latency is not crucial, you can add the inference output with endpoint metrics to persistent storage before returning the inference from your endpoint. If you are using serverless infrastructure like AWS Lambda, you can extend the inference lambda such that it will also add its ouput to an RDS, key-value store like DynamoDB, or an object storage like S3.

If calculating endpoint metrics within the endpoint code is not feasible then simply store them in the storage and process the output in batches later on.

For low-latency applications, adding logic to append the outputs to a persisten storage before returning the final predictions is not feasible. In such cases you can log/print the predictions and then process the logs async. If you using a log aggregator like loki, then you can calculate the endpoint statistics after they are indexed.

To decouple endpoint metric calculation from your main inference logic, you can use a data stream. Inference code will log the outputs. Another service will index the logs and add them to a data stream. You process the logs in the stream or deliver the logs to a persistent storage and process them in batches.

Apache Kafaka or AWS Kinesis Data stream can be used for data streams. Apache Flink is my favourite stream processing tool. If you use AWS and Lambda for inference, you can use stream CloudWatch logs to Kinesis in a few clicks. Once you have the logs in Kinesis, you can either use a stream processor like Flink or add a stream consumer and calculate the endpoint metrics.

Cloud provider endpoints

Major cloud providers like Google Cloud, AWS and Azure, provide a pre-defined set of endpoint metrics out-of-the-box. The metrics include, latency, model initialisation time, 4XX errors, 5XX errors, invocations per instance, CPU utilisation, memory usage, disk usage and other general metrics. These all are good operational metrics and are used for activites like auto-scaling your endpoint and health determination (i.e. HA and scalability).

The cloud providers also give an option to store the input and output data of the endpoint to persistent storage like S3.

SageMaker endpoint configuration Data Capture option

Utilise this option if you can process the logs in batches and don’t require hot-data on metrics. You can also add triggers, when new data arrives in S3 it is processed immediately to calculate your endpoint metrics. I will recommend to estimate the cost of this whole pipeline before committing to this approach.

6. Productionize LLM based projects

  • An Intelligent QA chatbot powered by Llama 2 Chat
  • LLM based recommendation system chatbot
  • Customer support chatbot using agents

An Intelligent QA chatbot powered by Llama 2 Chat

In this article I build a Question answering chatbot using Llama 2 chat model. I adopted RAG architecture and utilised FAISS approximate nearest neighbour (ANN) library for retrieval. For dataset, I copied few pages from Jira cloud resources page to create a corpus of knowledge base.

Though I do not delve into production details, there are few high-level architectures mentioned in the article to give the readers an understanding of production architectures.

LLM based recommendation system chatbot

I enjoy recommendation system problems. At time of writing this article, whatever I could find related to: LLM — Recommendation system — chatbot, was related to predicting next most likely liked item or re-ordering. This article provides a walk-through of a session based apparel recommendation engine with contextual understanding.

The benefit of using LLM is to be able to recommend items when users ask for question like “What should I wear for thanks giving?”, “What should I wear for my first office party?” and so on. The items were be recommended from the corpus of apparel products taken from internet. I utilised 2 LLMs, one to convert the user query to apparels items, like “winter vacation” will be converted to “jacket, boots, sweater, …”. Second LLM was utilised to understand user’s chat question and answer them based on the given context of items.

RAG architecture is adopted in this solution. The internet asked very interesting questions to the chatbot when I has the demo open for public. I have explained the architecture for deployment in the same article. I used AWS services like Elastic Container Service, MongoDB, Lambda, CloudFront, LoadBalancer, EFS and Aurora Serverless (for vector storage and retrieval via pgvector).

Customer support chatbot using agents

This is an upcoming project. I am yet to finish it. I shall inform you and publish it here as soon as it is ready.

7. Upcoming

Prompt compression — like model compression, has shown some promising results to reduce the prompt cost and speed. This technique involves removing unimportant tokens from prompts using a well-trained small language model.

GPT-5 — is supposed to be a massive upgrade from GPT-4, like we saw a similar jump from 3 to 4. In the words of Sam Altman — “If you overlook the pace of improvement, you’ll be ‘steamrolled’ …”. Whatever it might turn out to be, you can create your LLM based app pipeline to test and switch model easily, like two-way door decisions.

Personal Assistants powered by LLMNVIDIA GR00T is one of the examples of phones and robots powered by LLMs. There will be many more coming in the future.

LLMOps — for people pretraining (training from scratch), or fine-tuning or even just using open-source models for inference in their apps, LLMOps will continue to improve across the industry. New benchmarks will pop-up, new tools will gain stars, few repositories will be archived, and the gap between LLMOps — MLOps and DevOps will reduce even further.

AI Software engineers — like Devin and Devika will continue to evolve. We will see agents performing more actions and reaching close to humans in monotonous tasks.


  • Don’t marry 1 vendor.
  • Don’t trust benchmarks too much. If the data is public then it may have leaked into the training dataset.
  • Agents are worth exploring and putting effort into.
  • Investing time in evaluation (and automating it) is also worth it if you want to continue exploring newer models.
  • LLM ops is not easy. LLM prod is not easy at this moment in time but it is like any other ops project.
  • Only the official documentation should be considered as holy grail and nothing else.

You can connect with me on LinkedIn: