Files
langchain/docs/docs/how_to/document_loader_csv.ipynb
2024-08-23 10:01:16 -07:00

190 lines
6.5 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "dfc274c4-0c24-4c5f-865a-ee7fcdaafdac",
"metadata": {},
"source": [
"# How to load CSVs\n",
"\n",
"A [comma-separated values (CSV)](https://en.wikipedia.org/wiki/Comma-separated_values) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.\n",
"\n",
"LangChain implements a [CSV Loader](https://python.langchain.com/v0.2/api_reference/community/document_loaders/langchain_community.document_loaders.csv_loader.CSVLoader.html) that will load CSV files into a sequence of [Document](https://python.langchain.com/v0.2/api_reference/core/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) objects. Each row of the CSV file is translated to one document."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "64a25376-c31a-422e-845b-6538dcc68898",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='Team: Nationals\\n\"Payroll (millions)\": 81.34\\n\"Wins\": 98' metadata={'source': '../../../docs/integrations/document_loaders/example_data/mlb_teams_2012.csv', 'row': 0}\n",
"page_content='Team: Reds\\n\"Payroll (millions)\": 82.20\\n\"Wins\": 97' metadata={'source': '../../../docs/integrations/document_loaders/example_data/mlb_teams_2012.csv', 'row': 1}\n"
]
}
],
"source": [
"from langchain_community.document_loaders.csv_loader import CSVLoader\n",
"\n",
"file_path = (\n",
" \"../../../docs/integrations/document_loaders/example_data/mlb_teams_2012.csv\"\n",
")\n",
"\n",
"loader = CSVLoader(file_path=file_path)\n",
"data = loader.load()\n",
"\n",
"for record in data[:2]:\n",
" print(record)"
]
},
{
"cell_type": "markdown",
"id": "1c716f76-364d-4515-ada9-0ae7c75e61b2",
"metadata": {},
"source": [
"## Customizing the CSV parsing and loading\n",
"\n",
"`CSVLoader` will accept a `csv_args` kwarg that supports customization of arguments passed to Python's `csv.DictReader`. See the [csv module](https://docs.python.org/3/library/csv.html) documentation for more information of what csv args are supported."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "bf07fdee-d3a6-49c3-a517-bcba6819e8ea",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='MLB Team: Team\\nPayroll in millions: \"Payroll (millions)\"\\nWins: \"Wins\"' metadata={'source': '../../../docs/integrations/document_loaders/example_data/mlb_teams_2012.csv', 'row': 0}\n",
"page_content='MLB Team: Nationals\\nPayroll in millions: 81.34\\nWins: 98' metadata={'source': '../../../docs/integrations/document_loaders/example_data/mlb_teams_2012.csv', 'row': 1}\n"
]
}
],
"source": [
"loader = CSVLoader(\n",
" file_path=file_path,\n",
" csv_args={\n",
" \"delimiter\": \",\",\n",
" \"quotechar\": '\"',\n",
" \"fieldnames\": [\"MLB Team\", \"Payroll in millions\", \"Wins\"],\n",
" },\n",
")\n",
"\n",
"data = loader.load()\n",
"for record in data[:2]:\n",
" print(record)"
]
},
{
"cell_type": "markdown",
"id": "433536be-1531-43ae-920a-14fe4deef844",
"metadata": {},
"source": [
"## Specify a column to identify the document source\n",
"\n",
"The `\"source\"` key on [Document](https://python.langchain.com/v0.2/api_reference/core/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) metadata can be set using a column of the CSV. Use the `source_column` argument to specify a source for the document created from each row. Otherwise `file_path` will be used as the source for all documents created from the CSV file.\n",
"\n",
"This is useful when using documents loaded from CSV files for chains that answer questions using sources."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "d927392c-95e6-4a82-86c2-978387ebe91a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='Team: Nationals\\n\"Payroll (millions)\": 81.34\\n\"Wins\": 98' metadata={'source': 'Nationals', 'row': 0}\n",
"page_content='Team: Reds\\n\"Payroll (millions)\": 82.20\\n\"Wins\": 97' metadata={'source': 'Reds', 'row': 1}\n"
]
}
],
"source": [
"loader = CSVLoader(file_path=file_path, source_column=\"Team\")\n",
"\n",
"data = loader.load()\n",
"for record in data[:2]:\n",
" print(record)"
]
},
{
"cell_type": "markdown",
"id": "cab6a4bd-476b-4f4c-92e0-5d1cbcd1f6bf",
"metadata": {},
"source": [
"## Load from a string\n",
"\n",
"Python's `tempfile` can be used when working with CSV strings directly."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "f3fb28b7-8ebe-4af9-9b7d-719e9a252a46",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='Team: Nationals\\n\"Payroll (millions)\": 81.34\\n\"Wins\": 98' metadata={'source': 'Nationals', 'row': 0}\n",
"page_content='Team: Reds\\n\"Payroll (millions)\": 82.20\\n\"Wins\": 97' metadata={'source': 'Reds', 'row': 1}\n"
]
}
],
"source": [
"import tempfile\n",
"from io import StringIO\n",
"\n",
"string_data = \"\"\"\n",
"\"Team\", \"Payroll (millions)\", \"Wins\"\n",
"\"Nationals\", 81.34, 98\n",
"\"Reds\", 82.20, 97\n",
"\"Yankees\", 197.96, 95\n",
"\"Giants\", 117.62, 94\n",
"\"\"\".strip()\n",
"\n",
"\n",
"with tempfile.NamedTemporaryFile(delete=False, mode=\"w+\") as temp_file:\n",
" temp_file.write(string_data)\n",
" temp_file_path = temp_file.name\n",
"\n",
"loader = CSVLoader(file_path=temp_file_path)\n",
"loader.load()\n",
"for record in data[:2]:\n",
" print(record)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}