Image Source: https://github.com/gventuri/pandas-ai

Ask Questions and Get Plots on your Data with PandasAI & ChatGPT

Mahesh
5 min readJul 8, 2023

--

Make dataframes conversational and generate automatic plots based on questions in natural language

In my previous blog post titled “How to Build an Intelligent QA Chatbot on Your Data with LLM or ChatGPT,” I discussed one limitation of a basic QA bot that relies on a Large Language Model (LLM) as its base model. When a user needs to find their meetings for the day, LLMs may encounter difficulties in directly querying the database and retrieving the relevant information. This kind of interaction necessitates a seamless integration between the LLM, database systems, and data processing logic. This integration is an area that requires our attention and focus.

Rather than merely guiding users towards self-help, it is more effective to provide them with actual information they seek.

The PandasAI library represents a significant step in this direction.

PandasAI is a Python library that adds Generative AI capabilities to pandas, the popular data analysis and manipulation tool. It is designed to be used in conjunction with pandas, and is not a replacement for it.

In this article, we will utilize this library to derive insightful analysis from a Kaggle dataset — Data Science Salaries: 2020–2023.

Dataset Structure

All the code is available in this Colab notebook.

First install the pandasai library and plotly with pip:

pip install pandasai plotly==5.14.1 -q

Download the dataset from above kaggle link after login and import the modules:

import pandas as pd

from pandasai import PandasAI
from pandasai.llm.openai import OpenAI

Create a LLM instance:

llm = OpenAI(api_token="USE_ENV_VARIABLE_OR_DOTENV")

pandas_ai = PandasAI(llm)

WARNING: Never put your OpenAI api key in plain text in the notebook. Use env variable or it is best to go with dotenv.

Basic Questions

Now we can start by asking basic questions like “What are the top 5 Company Location with highest Salary?”

Result

It appears that the model is incorporating the salary amount, although I haven’t verified it myself. Nonetheless, the generated result appears to be suitable.

Result

Honestly, at first glance, I am unable to determine the intended significance of the index. Therefore, let’s request the model to generate charts to gain further clarity.

This time, I have added the following to the end of the question:

What are the top 5 Company Location with highest average Salary? Show the results in an appropriate chart.

And the result:

Result with chart

With a different question:

Judging by the prompt, one can tell that it must have taken a few tries to get the plot right. To fully leverage the potential of LLMs using prompts, I highly recommend checking out ChatGPT Prompt Engineering for Developers by Isa Fulford and Andrew Ng.

I also experimented with a similar approach using the following prompt:

So, there are a few issues. Firstly, I’m having difficulty understanding the distribution, and secondly, the ticks on the chart are not readable.

Consequently, I have made the decision to utilize Plotly, a graphing library renowned for its interactive capabilities in creating 2D and 3D charts for statistical, financial, and scientific data.

In the output provided below, the prompt is nearly identical to the one in the previous image, with the addition of the instruction, “Plot the results in an appropriate Plotly chart…”

Plotly Result

This is interactive and therefore easier to understand.

Production setting

In a production environment, unless your data engineering team has already established ETL pipelines to handle every possible ad-hoc query, it might be beneficial to directly query your data warehouse for quick wins. However, it’s important to note that data in the data warehouse is not denormalized. Depending on the architecture (such as Kimball, Inmon, Data Mesh, etc.), you can also pass multiple dataframes to the PandasAI library.

In the example below, we create a new dummy dataset consisting of:

  1. clickstream_df - representing clickstream data
  2. user_metadata - representing user metadata
  3. item_metadata - representing item metadata

This template closely resembles a star schema.

import time
import random

random.seed(24)


clickstream_df = pd.DataFrame(
{
"timestamp": [time.time()]*10,
"user_id": random.sample(range(1, 50), 10),
"action" : ["click","up","down"]*3 + ["click"],
"item_id": random.sample(range(100, 500), 10),
}
)

user_metadata = pd.DataFrame(
{
"timestamp": [time.time()]*10,
"user_id": clickstream_df["user_id"],
"continent" : ["AUSTRALIA","SA","AFRICA"]*3 + ["GREENLAND"],
}
)

item_metadata = pd.DataFrame(
{
"timestamp": [time.time()]*10,
"item_id": clickstream_df["item_id"],
"item_type" : ["advertisement","product","comment"]*3 + ["search"],
}
)
clickstream_df
user_metadata
item_metadata

We start by asking basic questions and passing all three dataframes:

Result

It was able to comprehend the question correctly.

However, in the subsequent question, it was unable to generate any output:

Result

This outcome is not surprising. Even after including a Plotly chart in the prompt message, it still failed to produce any output.

Project Augmentation

Overall, I believe that combining PandasAI with ChatGPT provides a solid starting point for analytics projects involving Large Language Models (LLMs). It’s worth noting that besides ChatGPT, you can also explore other LLMs, all of which are well-documented in the PandasAI documentation.

To add more excitement to the mix, one can consider having a model convert user questions into SQL queries. This would enable direct querying of the data warehouse or data mart, allowing for seamless integration with dataset visualization and plotting.

Research in the NLP to SQL domain has already made significant progress, and I consider it a highly valuable project for companies where analytics is a fundamental part of their product. Faster insights, reduced burden on data engineers for ETL pipelines, and efficient ad-hoc analytics are just a few benefits that make this project worthwhile.

--

--