Added support for a Pandas DataFrame OutputParser (#13257)

**Description:** Added support for a Pandas DataFrame OutputParser with format instructions, along with unit tests and a demo notebook. Namely, we've added the ability to request data from a DataFrame, have the LLM parse the request, and then use that request to retrieve a well-formatted response. Within LangChain, it seamlessly integrates with language models like OpenAI's `text-davinci-003`, facilitating streamlined interaction using the format instructions (just like the other output parsers). This parser structures its requests as `<operation/column/row>[<optional_array_params>]`. The instructions detail permissible operations, valid columns, and array formats, ensuring clarity and adherence to the required format. For example: - When the LLM receives the input: "Retrieve the mean of `num_legs` from rows 1 to 3." - The provided format instructions guide the LLM to structure the request as: "mean:num_legs[1..3]". The parser processes this formatted request, leveraging the LLM's understanding to extract the mean of `num_legs` from rows 1 to 3 within the Pandas DataFrame. This integration allows users to communicate requests naturally, with the LLM transforming these instructions into structured commands understood by the `PandasDataFrameOutputParser`. The format instructions act as a bridge between natural language queries and precise DataFrame operations, optimizing communication and data retrieval. **Issue:** - https://github.com/langchain-ai/langchain/issues/11532 **Dependencies:** No additional dependencies :) **Tag maintainer:** @baskaryan **Twitter handle:** No need. :) --------- Co-authored-by: Wasee Alam <waseealam@protonmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2025-09-05 04:55:14 +00:00 · 2023-11-29 22:08:50 -05:00
parent 235bdb9fa7
commit 41a4c06a94
6 changed files with 521 additions and 0 deletions
--- a/docs/docs/modules/model_io/output_parsers/pandas_dataframe.ipynb
+++ b/docs/docs/modules/model_io/output_parsers/pandas_dataframe.ipynb
@@ -0,0 +1,229 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Pandas DataFrame Parser\n",
+    "\n",
+    "A Pandas DataFrame is a popular data structure in the Python programming language, commonly used for data manipulation and analysis. It provides a comprehensive set of tools for working with structured data, making it a versatile option for tasks such as data cleaning, transformation, and analysis.\n",
+    "\n",
+    "This output parser allows users to specify an arbitrary Pandas DataFrame and query LLMs for data in the form of a formatted dictionary that extracts data from the corresponding DataFrame. Keep in mind that large language models are leaky abstractions! You'll have to use an LLM with sufficient capacity to generate a well-formed query as per the defined format instructions.\n",
+    "\n",
+    "Use Pandas' DataFrame object to declare the DataFrame you wish to perform queries on."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pprint\n",
+    "from typing import Any, Dict\n",
+    "\n",
+    "import pandas as pd\n",
+    "from langchain.llms import OpenAI\n",
+    "from langchain.output_parsers import PandasDataFrameOutputParser\n",
+    "from langchain.prompts import PromptTemplate"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_name = \"text-davinci-003\"\n",
+    "temperature = 0.5\n",
+    "model = OpenAI(model_name=model_name, temperature=temperature)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Solely for documentation purposes.\n",
+    "def format_parser_output(parser_output: Dict[str, Any]) -> None:\n",
+    "    for key in parser_output.keys():\n",
+    "        parser_output[key] = parser_output[key].to_dict()\n",
+    "    return pprint.PrettyPrinter(width=4, compact=True).pprint(parser_output)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define your desired Pandas DataFrame.\n",
+    "df = pd.DataFrame(\n",
+    "    {\n",
+    "        \"num_legs\": [2, 4, 8, 0],\n",
+    "        \"num_wings\": [2, 0, 0, 0],\n",
+    "        \"num_specimen_seen\": [10, 2, 1, 8],\n",
+    "    }\n",
+    ")\n",
+    "\n",
+    "# Set up a parser + inject instructions into the prompt template.\n",
+    "parser = PandasDataFrameOutputParser(dataframe=df)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "LLM Output: column:num_wings\n",
+      "{'num_wings': {0: 2,\n",
+      "               1: 0,\n",
+      "               2: 0,\n",
+      "               3: 0}}\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Here's an example of a column operation being performed.\n",
+    "df_query = \"Retrieve the num_wings column.\"\n",
+    "\n",
+    "# Set up the prompt.\n",
+    "prompt = PromptTemplate(\n",
+    "    template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n",
+    "    input_variables=[\"query\"],\n",
+    "    partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
+    ")\n",
+    "\n",
+    "_input = prompt.format_prompt(query=df_query)\n",
+    "output = model(_input.to_string())\n",
+    "print(\"LLM Output:\", output)\n",
+    "parser_output = parser.parse(output)\n",
+    "\n",
+    "format_parser_output(parser_output)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "LLM Output: row:1\n",
+      "{'1': {'num_legs': 4,\n",
+      "       'num_specimen_seen': 2,\n",
+      "       'num_wings': 0}}\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Here's an example of a row operation being performed.\n",
+    "df_query = \"Retrieve the first row.\"\n",
+    "\n",
+    "# Set up the prompt.\n",
+    "prompt = PromptTemplate(\n",
+    "    template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n",
+    "    input_variables=[\"query\"],\n",
+    "    partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
+    ")\n",
+    "\n",
+    "_input = prompt.format_prompt(query=df_query)\n",
+    "output = model(_input.to_string())\n",
+    "print(\"LLM Output:\", output)\n",
+    "parser_output = parser.parse(output)\n",
+    "\n",
+    "format_parser_output(parser_output)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "LLM Output: mean:num_legs[1..3]\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "{'mean': 4.0}"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Here's an example of a random Pandas DataFrame operation limiting the number of rows\n",
+    "df_query = \"Retrieve the average of the num_legs column from rows 1 to 3.\"\n",
+    "\n",
+    "# Set up the prompt.\n",
+    "prompt = PromptTemplate(\n",
+    "    template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n",
+    "    input_variables=[\"query\"],\n",
+    "    partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
+    ")\n",
+    "\n",
+    "_input = prompt.format_prompt(query=df_query)\n",
+    "output = model(_input.to_string())\n",
+    "print(\"LLM Output:\", output)\n",
+    "parser.parse(output)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Here's an example of a poorly formatted query\n",
+    "df_query = \"Retrieve the mean of the num_fingers column.\"\n",
+    "\n",
+    "# Set up the prompt.\n",
+    "prompt = PromptTemplate(\n",
+    "    template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n",
+    "    input_variables=[\"query\"],\n",
+    "    partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
+    ")\n",
+    "\n",
+    "_input = prompt.format_prompt(query=df_query)\n",
+    "output = model(_input.to_string())  # Expected Output: \"Invalid column: num_fingers\".\n",
+    "print(\"LLM Output:\", output)\n",
+    "parser.parse(output)  # Expected Output: Will raise an OutputParserException."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}