Added support for a Pandas DataFrame OutputParser (#13257)

**Description:**

Added support for a Pandas DataFrame OutputParser with format
instructions, along with unit tests and a demo notebook. Namely, we've
added the ability to request data from a DataFrame, have the LLM parse
the request, and then use that request to retrieve a well-formatted
response.

Within LangChain, it seamlessly integrates with language models like
OpenAI's `text-davinci-003`, facilitating streamlined interaction using
the format instructions (just like the other output parsers).

This parser structures its requests as
`<operation/column/row>[<optional_array_params>]`. The instructions
detail permissible operations, valid columns, and array formats,
ensuring clarity and adherence to the required format.

For example:

- When the LLM receives the input: "Retrieve the mean of `num_legs` from
rows 1 to 3."
- The provided format instructions guide the LLM to structure the
request as: "mean:num_legs[1..3]".

The parser processes this formatted request, leveraging the LLM's
understanding to extract the mean of `num_legs` from rows 1 to 3 within
the Pandas DataFrame.

This integration allows users to communicate requests naturally, with
the LLM transforming these instructions into structured commands
understood by the `PandasDataFrameOutputParser`. The format instructions
act as a bridge between natural language queries and precise DataFrame
operations, optimizing communication and data retrieval.

**Issue:**

- https://github.com/langchain-ai/langchain/issues/11532

**Dependencies:**

No additional dependencies :)

**Tag maintainer:**

@baskaryan 

**Twitter handle:**

No need. :)

---------

Co-authored-by: Wasee Alam <waseealam@protonmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
This commit is contained in:
Rohan Dey
2023-11-29 22:08:50 -05:00
committed by GitHub
parent 235bdb9fa7
commit 41a4c06a94
6 changed files with 521 additions and 0 deletions

View File

@@ -0,0 +1,229 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas DataFrame Parser\n",
"\n",
"A Pandas DataFrame is a popular data structure in the Python programming language, commonly used for data manipulation and analysis. It provides a comprehensive set of tools for working with structured data, making it a versatile option for tasks such as data cleaning, transformation, and analysis.\n",
"\n",
"This output parser allows users to specify an arbitrary Pandas DataFrame and query LLMs for data in the form of a formatted dictionary that extracts data from the corresponding DataFrame. Keep in mind that large language models are leaky abstractions! You'll have to use an LLM with sufficient capacity to generate a well-formed query as per the defined format instructions.\n",
"\n",
"Use Pandas' DataFrame object to declare the DataFrame you wish to perform queries on."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pprint\n",
"from typing import Any, Dict\n",
"\n",
"import pandas as pd\n",
"from langchain.llms import OpenAI\n",
"from langchain.output_parsers import PandasDataFrameOutputParser\n",
"from langchain.prompts import PromptTemplate"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model_name = \"text-davinci-003\"\n",
"temperature = 0.5\n",
"model = OpenAI(model_name=model_name, temperature=temperature)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Solely for documentation purposes.\n",
"def format_parser_output(parser_output: Dict[str, Any]) -> None:\n",
" for key in parser_output.keys():\n",
" parser_output[key] = parser_output[key].to_dict()\n",
" return pprint.PrettyPrinter(width=4, compact=True).pprint(parser_output)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Define your desired Pandas DataFrame.\n",
"df = pd.DataFrame(\n",
" {\n",
" \"num_legs\": [2, 4, 8, 0],\n",
" \"num_wings\": [2, 0, 0, 0],\n",
" \"num_specimen_seen\": [10, 2, 1, 8],\n",
" }\n",
")\n",
"\n",
"# Set up a parser + inject instructions into the prompt template.\n",
"parser = PandasDataFrameOutputParser(dataframe=df)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"LLM Output: column:num_wings\n",
"{'num_wings': {0: 2,\n",
" 1: 0,\n",
" 2: 0,\n",
" 3: 0}}\n"
]
}
],
"source": [
"# Here's an example of a column operation being performed.\n",
"df_query = \"Retrieve the num_wings column.\"\n",
"\n",
"# Set up the prompt.\n",
"prompt = PromptTemplate(\n",
" template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n",
" input_variables=[\"query\"],\n",
" partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
")\n",
"\n",
"_input = prompt.format_prompt(query=df_query)\n",
"output = model(_input.to_string())\n",
"print(\"LLM Output:\", output)\n",
"parser_output = parser.parse(output)\n",
"\n",
"format_parser_output(parser_output)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"LLM Output: row:1\n",
"{'1': {'num_legs': 4,\n",
" 'num_specimen_seen': 2,\n",
" 'num_wings': 0}}\n"
]
}
],
"source": [
"# Here's an example of a row operation being performed.\n",
"df_query = \"Retrieve the first row.\"\n",
"\n",
"# Set up the prompt.\n",
"prompt = PromptTemplate(\n",
" template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n",
" input_variables=[\"query\"],\n",
" partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
")\n",
"\n",
"_input = prompt.format_prompt(query=df_query)\n",
"output = model(_input.to_string())\n",
"print(\"LLM Output:\", output)\n",
"parser_output = parser.parse(output)\n",
"\n",
"format_parser_output(parser_output)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"LLM Output: mean:num_legs[1..3]\n"
]
},
{
"data": {
"text/plain": [
"{'mean': 4.0}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Here's an example of a random Pandas DataFrame operation limiting the number of rows\n",
"df_query = \"Retrieve the average of the num_legs column from rows 1 to 3.\"\n",
"\n",
"# Set up the prompt.\n",
"prompt = PromptTemplate(\n",
" template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n",
" input_variables=[\"query\"],\n",
" partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
")\n",
"\n",
"_input = prompt.format_prompt(query=df_query)\n",
"output = model(_input.to_string())\n",
"print(\"LLM Output:\", output)\n",
"parser.parse(output)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Here's an example of a poorly formatted query\n",
"df_query = \"Retrieve the mean of the num_fingers column.\"\n",
"\n",
"# Set up the prompt.\n",
"prompt = PromptTemplate(\n",
" template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n",
" input_variables=[\"query\"],\n",
" partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
")\n",
"\n",
"_input = prompt.format_prompt(query=df_query)\n",
"output = model(_input.to_string()) # Expected Output: \"Invalid column: num_fingers\".\n",
"print(\"LLM Output:\", output)\n",
"parser.parse(output) # Expected Output: Will raise an OutputParserException."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}