langchain/docs/docs/how_to/query_high_cardinality.ipynb

749 lines
20 KiB
Plaintext

{
"cells": [
{
"cell_type": "raw",
"id": "df7d42b9-58a6-434c-a2d7-0b61142f6d3e",
"metadata": {},
"source": [
"---\n",
"sidebar_position: 7\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "f2195672-0cab-4967-ba8a-c6544635547d",
"metadata": {},
"source": [
"# How deal with high cardinality categoricals when doing query analysis\n",
"\n",
"You may want to do query analysis to create a filter on a categorical column. One of the difficulties here is that you usually need to specify the EXACT categorical value. The issue is you need to make sure the LLM generates that categorical value exactly. This can be done relatively easy with prompting when there are only a few values that are valid. When there are a high number of valid values then it becomes more difficult, as those values may not fit in the LLM context, or (if they do) there may be too many for the LLM to properly attend to.\n",
"\n",
"In this notebook we take a look at how to approach this."
]
},
{
"cell_type": "markdown",
"id": "a4079b57-4369-49c9-b2ad-c809b5408d7e",
"metadata": {},
"source": [
"## Setup\n",
"#### Install dependencies"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e168ef5c-e54e-49a6-8552-5502854a6f01",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install -qU langchain langchain-community langchain-openai faker langchain-chroma"
]
},
{
"cell_type": "markdown",
"id": "79d66a45-a05c-4d22-b011-b1cdbdfc8f9c",
"metadata": {},
"source": [
"#### Set environment variables\n",
"\n",
"We'll use OpenAI in this example:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "40e2979e-a818-4b96-ac25-039336f94319",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:34:54.036110Z",
"iopub.status.busy": "2024-09-11T02:34:54.035829Z",
"iopub.status.idle": "2024-09-11T02:34:54.038746Z",
"shell.execute_reply": "2024-09-11T02:34:54.038430Z"
}
},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"if \"OPENAI_API_KEY\" not in os.environ:\n",
" os.environ[\"OPENAI_API_KEY\"] = getpass.getpass()\n",
"\n",
"# Optional, uncomment to trace runs with LangSmith. Sign up here: https://smith.langchain.com.\n",
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\"\n",
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass()"
]
},
{
"cell_type": "markdown",
"id": "d8d47f4b",
"metadata": {},
"source": [
"#### Set up data\n",
"\n",
"We will generate a bunch of fake names"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "e5ba65c2",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:34:54.040738Z",
"iopub.status.busy": "2024-09-11T02:34:54.040515Z",
"iopub.status.idle": "2024-09-11T02:34:54.622643Z",
"shell.execute_reply": "2024-09-11T02:34:54.622382Z"
}
},
"outputs": [],
"source": [
"from faker import Faker\n",
"\n",
"fake = Faker()\n",
"\n",
"names = [fake.name() for _ in range(10000)]"
]
},
{
"cell_type": "markdown",
"id": "41133694",
"metadata": {},
"source": [
"Let's look at some of the names"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "c901ea97",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:34:54.624195Z",
"iopub.status.busy": "2024-09-11T02:34:54.624106Z",
"iopub.status.idle": "2024-09-11T02:34:54.627231Z",
"shell.execute_reply": "2024-09-11T02:34:54.626971Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'Jacob Adams'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"names[0]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b0d42ae2",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:34:54.628545Z",
"iopub.status.busy": "2024-09-11T02:34:54.628460Z",
"iopub.status.idle": "2024-09-11T02:34:54.630474Z",
"shell.execute_reply": "2024-09-11T02:34:54.630282Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'Eric Acevedo'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"names[567]"
]
},
{
"cell_type": "markdown",
"id": "1725883d",
"metadata": {},
"source": [
"## Query Analysis\n",
"\n",
"We can now set up a baseline query analysis"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "0ae69afc",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:34:54.631758Z",
"iopub.status.busy": "2024-09-11T02:34:54.631678Z",
"iopub.status.idle": "2024-09-11T02:34:54.666448Z",
"shell.execute_reply": "2024-09-11T02:34:54.666216Z"
}
},
"outputs": [],
"source": [
"from pydantic import BaseModel, Field, model_validator"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "6c9485ce",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:34:54.667852Z",
"iopub.status.busy": "2024-09-11T02:34:54.667733Z",
"iopub.status.idle": "2024-09-11T02:34:54.700224Z",
"shell.execute_reply": "2024-09-11T02:34:54.700004Z"
}
},
"outputs": [],
"source": [
"class Search(BaseModel):\n",
" query: str\n",
" author: str"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "aebd704a",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:34:54.701556Z",
"iopub.status.busy": "2024-09-11T02:34:54.701465Z",
"iopub.status.idle": "2024-09-11T02:34:55.179986Z",
"shell.execute_reply": "2024-09-11T02:34:55.179640Z"
}
},
"outputs": [],
"source": [
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"system = \"\"\"Generate a relevant search query for a library system\"\"\"\n",
"prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\"system\", system),\n",
" (\"human\", \"{question}\"),\n",
" ]\n",
")\n",
"llm = ChatOpenAI(model=\"gpt-4o-mini\", temperature=0)\n",
"structured_llm = llm.with_structured_output(Search)\n",
"query_analyzer = {\"question\": RunnablePassthrough()} | prompt | structured_llm"
]
},
{
"cell_type": "markdown",
"id": "41709a2e",
"metadata": {},
"source": [
"We can see that if we spell the name exactly correctly, it knows how to handle it"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "cc0d344b",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:34:55.181603Z",
"iopub.status.busy": "2024-09-11T02:34:55.181500Z",
"iopub.status.idle": "2024-09-11T02:34:55.778884Z",
"shell.execute_reply": "2024-09-11T02:34:55.778324Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Search(query='aliens', author='Jesse Knight')"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_analyzer.invoke(\"what are books about aliens by Jesse Knight\")"
]
},
{
"cell_type": "markdown",
"id": "a1b57eab",
"metadata": {},
"source": [
"The issue is that the values you want to filter on may NOT be spelled exactly correctly"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "82b6b2ad",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:34:55.784266Z",
"iopub.status.busy": "2024-09-11T02:34:55.782603Z",
"iopub.status.idle": "2024-09-11T02:34:56.206779Z",
"shell.execute_reply": "2024-09-11T02:34:56.206068Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Search(query='aliens', author='Jess Knight')"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_analyzer.invoke(\"what are books about aliens by jess knight\")"
]
},
{
"cell_type": "markdown",
"id": "0b60b7c2",
"metadata": {},
"source": [
"### Add in all values\n",
"\n",
"One way around this is to add ALL possible values to the prompt. That will generally guide the query in the right direction"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "98788a94",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:34:56.210043Z",
"iopub.status.busy": "2024-09-11T02:34:56.209657Z",
"iopub.status.idle": "2024-09-11T02:34:56.213962Z",
"shell.execute_reply": "2024-09-11T02:34:56.213413Z"
}
},
"outputs": [],
"source": [
"system = \"\"\"Generate a relevant search query for a library system.\n",
"\n",
"`author` attribute MUST be one of:\n",
"\n",
"{authors}\n",
"\n",
"Do NOT hallucinate author name!\"\"\"\n",
"base_prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\"system\", system),\n",
" (\"human\", \"{question}\"),\n",
" ]\n",
")\n",
"prompt = base_prompt.partial(authors=\", \".join(names))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "e65412f5",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:34:56.216144Z",
"iopub.status.busy": "2024-09-11T02:34:56.216005Z",
"iopub.status.idle": "2024-09-11T02:34:56.218754Z",
"shell.execute_reply": "2024-09-11T02:34:56.218416Z"
}
},
"outputs": [],
"source": [
"query_analyzer_all = {\"question\": RunnablePassthrough()} | prompt | structured_llm"
]
},
{
"cell_type": "markdown",
"id": "e639285a",
"metadata": {},
"source": [
"However... if the list of categoricals is long enough, it may error!"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "696b000f",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:34:56.220827Z",
"iopub.status.busy": "2024-09-11T02:34:56.220680Z",
"iopub.status.idle": "2024-09-11T02:34:58.846872Z",
"shell.execute_reply": "2024-09-11T02:34:58.846273Z"
}
},
"outputs": [],
"source": [
"try:\n",
" res = query_analyzer_all.invoke(\"what are books about aliens by jess knight\")\n",
"except Exception as e:\n",
" print(e)"
]
},
{
"cell_type": "markdown",
"id": "1d5d7891",
"metadata": {},
"source": [
"We can try to use a longer context window... but with so much information in there, it is not garunteed to pick it up reliably"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "0f0d0757",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:34:58.850318Z",
"iopub.status.busy": "2024-09-11T02:34:58.850100Z",
"iopub.status.idle": "2024-09-11T02:34:58.873883Z",
"shell.execute_reply": "2024-09-11T02:34:58.873525Z"
}
},
"outputs": [],
"source": [
"llm_long = ChatOpenAI(model=\"gpt-4-turbo-preview\", temperature=0)\n",
"structured_llm_long = llm_long.with_structured_output(Search)\n",
"query_analyzer_all = {\"question\": RunnablePassthrough()} | prompt | structured_llm_long"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "03e5b7b2",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:34:58.875940Z",
"iopub.status.busy": "2024-09-11T02:34:58.875811Z",
"iopub.status.idle": "2024-09-11T02:35:02.947273Z",
"shell.execute_reply": "2024-09-11T02:35:02.946220Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Search(query='aliens', author='jess knight')"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_analyzer_all.invoke(\"what are books about aliens by jess knight\")"
]
},
{
"cell_type": "markdown",
"id": "73ecf52b",
"metadata": {},
"source": [
"### Find and all relevant values\n",
"\n",
"Instead, what we can do is create an index over the relevant values and then query that for the N most relevant values,"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "32b19e07",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:35:02.951939Z",
"iopub.status.busy": "2024-09-11T02:35:02.951583Z",
"iopub.status.idle": "2024-09-11T02:35:41.777839Z",
"shell.execute_reply": "2024-09-11T02:35:41.777392Z"
}
},
"outputs": [],
"source": [
"from langchain_chroma import Chroma\n",
"from langchain_openai import OpenAIEmbeddings\n",
"\n",
"embeddings = OpenAIEmbeddings(model=\"text-embedding-3-small\")\n",
"vectorstore = Chroma.from_texts(names, embeddings, collection_name=\"author_names\")"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "774cb7b0",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:35:41.780883Z",
"iopub.status.busy": "2024-09-11T02:35:41.780774Z",
"iopub.status.idle": "2024-09-11T02:35:41.782739Z",
"shell.execute_reply": "2024-09-11T02:35:41.782498Z"
}
},
"outputs": [],
"source": [
"def select_names(question):\n",
" _docs = vectorstore.similarity_search(question, k=10)\n",
" _names = [d.page_content for d in _docs]\n",
" return \", \".join(_names)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "1173159c",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:35:41.783992Z",
"iopub.status.busy": "2024-09-11T02:35:41.783913Z",
"iopub.status.idle": "2024-09-11T02:35:41.785911Z",
"shell.execute_reply": "2024-09-11T02:35:41.785632Z"
}
},
"outputs": [],
"source": [
"create_prompt = {\n",
" \"question\": RunnablePassthrough(),\n",
" \"authors\": select_names,\n",
"} | base_prompt"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "0a892607",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:35:41.787082Z",
"iopub.status.busy": "2024-09-11T02:35:41.787008Z",
"iopub.status.idle": "2024-09-11T02:35:41.788543Z",
"shell.execute_reply": "2024-09-11T02:35:41.788362Z"
}
},
"outputs": [],
"source": [
"query_analyzer_select = create_prompt | structured_llm"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "8195d7cd",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:35:41.789624Z",
"iopub.status.busy": "2024-09-11T02:35:41.789551Z",
"iopub.status.idle": "2024-09-11T02:35:42.099839Z",
"shell.execute_reply": "2024-09-11T02:35:42.099042Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"ChatPromptValue(messages=[SystemMessage(content='Generate a relevant search query for a library system.\\n\\n`author` attribute MUST be one of:\\n\\nJennifer Knight, Jill Knight, John Knight, Dr. Jeffrey Knight, Christopher Knight, Andrea Knight, Brandy Knight, Jennifer Keller, Becky Chambers, Sarah Knapp\\n\\nDo NOT hallucinate author name!'), HumanMessage(content='what are books by jess knight')])"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"create_prompt.invoke(\"what are books by jess knight\")"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "d3228b4e",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:35:42.106571Z",
"iopub.status.busy": "2024-09-11T02:35:42.105861Z",
"iopub.status.idle": "2024-09-11T02:35:42.909738Z",
"shell.execute_reply": "2024-09-11T02:35:42.908875Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Search(query='books about aliens', author='Jennifer Knight')"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_analyzer_select.invoke(\"what are books about aliens by jess knight\")"
]
},
{
"cell_type": "markdown",
"id": "46ef88bb",
"metadata": {},
"source": [
"### Replace after selection\n",
"\n",
"Another method is to let the LLM fill in whatever value, but then convert that value to a valid value.\n",
"This can actually be done with the Pydantic class itself!"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "a2e8b434",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:35:42.915376Z",
"iopub.status.busy": "2024-09-11T02:35:42.914923Z",
"iopub.status.idle": "2024-09-11T02:35:42.923958Z",
"shell.execute_reply": "2024-09-11T02:35:42.922391Z"
}
},
"outputs": [],
"source": [
"class Search(BaseModel):\n",
" query: str\n",
" author: str\n",
"\n",
" @model_validator(mode=\"before\")\n",
" @classmethod\n",
" def double(cls, values: dict) -> dict:\n",
" author = values[\"author\"]\n",
" closest_valid_author = vectorstore.similarity_search(author, k=1)[\n",
" 0\n",
" ].page_content\n",
" values[\"author\"] = closest_valid_author\n",
" return values"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "919c0601",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:35:42.927718Z",
"iopub.status.busy": "2024-09-11T02:35:42.927428Z",
"iopub.status.idle": "2024-09-11T02:35:42.933784Z",
"shell.execute_reply": "2024-09-11T02:35:42.933344Z"
}
},
"outputs": [],
"source": [
"system = \"\"\"Generate a relevant search query for a library system\"\"\"\n",
"prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\"system\", system),\n",
" (\"human\", \"{question}\"),\n",
" ]\n",
")\n",
"corrective_structure_llm = llm.with_structured_output(Search)\n",
"corrective_query_analyzer = (\n",
" {\"question\": RunnablePassthrough()} | prompt | corrective_structure_llm\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "6c4f3e9a",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:35:42.936506Z",
"iopub.status.busy": "2024-09-11T02:35:42.936186Z",
"iopub.status.idle": "2024-09-11T02:35:43.711754Z",
"shell.execute_reply": "2024-09-11T02:35:43.710695Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Search(query='aliens', author='John Knight')"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"corrective_query_analyzer.invoke(\"what are books about aliens by jes knight\")"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "a309cb11",
"metadata": {
"execution": {
"iopub.execute_input": "2024-09-11T02:35:43.717567Z",
"iopub.status.busy": "2024-09-11T02:35:43.717189Z",
"iopub.status.idle": "2024-09-11T02:35:43.722339Z",
"shell.execute_reply": "2024-09-11T02:35:43.720537Z"
}
},
"outputs": [],
"source": [
"# TODO: show trigram similarity"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "poetry-venv-311",
"language": "python",
"name": "poetry-venv-311"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}