mirror of
https://github.com/hwchase17/langchain.git
synced 2026-02-21 06:33:41 +00:00
``` https://api\.python\.langchain\.com/en/latest/([^/]*)/langchain_([^.]*)\.(.*)\.html([^"]*) https://python.langchain.com/v0.2/api_reference/$2/$1/langchain_$2.$3.html$4 ``` --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
190 lines
6.5 KiB
Plaintext
190 lines
6.5 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "dfc274c4-0c24-4c5f-865a-ee7fcdaafdac",
|
|
"metadata": {},
|
|
"source": [
|
|
"# How to load CSVs\n",
|
|
"\n",
|
|
"A [comma-separated values (CSV)](https://en.wikipedia.org/wiki/Comma-separated_values) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.\n",
|
|
"\n",
|
|
"LangChain implements a [CSV Loader](https://python.langchain.com/v0.2/api_reference/community/document_loaders/langchain_community.document_loaders.csv_loader.CSVLoader.html) that will load CSV files into a sequence of [Document](https://python.langchain.com/v0.2/api_reference/core/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) objects. Each row of the CSV file is translated to one document."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "64a25376-c31a-422e-845b-6538dcc68898",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"page_content='Team: Nationals\\n\"Payroll (millions)\": 81.34\\n\"Wins\": 98' metadata={'source': '../../../docs/integrations/document_loaders/example_data/mlb_teams_2012.csv', 'row': 0}\n",
|
|
"page_content='Team: Reds\\n\"Payroll (millions)\": 82.20\\n\"Wins\": 97' metadata={'source': '../../../docs/integrations/document_loaders/example_data/mlb_teams_2012.csv', 'row': 1}\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain_community.document_loaders.csv_loader import CSVLoader\n",
|
|
"\n",
|
|
"file_path = (\n",
|
|
" \"../../../docs/integrations/document_loaders/example_data/mlb_teams_2012.csv\"\n",
|
|
")\n",
|
|
"\n",
|
|
"loader = CSVLoader(file_path=file_path)\n",
|
|
"data = loader.load()\n",
|
|
"\n",
|
|
"for record in data[:2]:\n",
|
|
" print(record)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1c716f76-364d-4515-ada9-0ae7c75e61b2",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Customizing the CSV parsing and loading\n",
|
|
"\n",
|
|
"`CSVLoader` will accept a `csv_args` kwarg that supports customization of arguments passed to Python's `csv.DictReader`. See the [csv module](https://docs.python.org/3/library/csv.html) documentation for more information of what csv args are supported."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "bf07fdee-d3a6-49c3-a517-bcba6819e8ea",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"page_content='MLB Team: Team\\nPayroll in millions: \"Payroll (millions)\"\\nWins: \"Wins\"' metadata={'source': '../../../docs/integrations/document_loaders/example_data/mlb_teams_2012.csv', 'row': 0}\n",
|
|
"page_content='MLB Team: Nationals\\nPayroll in millions: 81.34\\nWins: 98' metadata={'source': '../../../docs/integrations/document_loaders/example_data/mlb_teams_2012.csv', 'row': 1}\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"loader = CSVLoader(\n",
|
|
" file_path=file_path,\n",
|
|
" csv_args={\n",
|
|
" \"delimiter\": \",\",\n",
|
|
" \"quotechar\": '\"',\n",
|
|
" \"fieldnames\": [\"MLB Team\", \"Payroll in millions\", \"Wins\"],\n",
|
|
" },\n",
|
|
")\n",
|
|
"\n",
|
|
"data = loader.load()\n",
|
|
"for record in data[:2]:\n",
|
|
" print(record)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "433536be-1531-43ae-920a-14fe4deef844",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Specify a column to identify the document source\n",
|
|
"\n",
|
|
"The `\"source\"` key on [Document](https://python.langchain.com/v0.2/api_reference/core/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) metadata can be set using a column of the CSV. Use the `source_column` argument to specify a source for the document created from each row. Otherwise `file_path` will be used as the source for all documents created from the CSV file.\n",
|
|
"\n",
|
|
"This is useful when using documents loaded from CSV files for chains that answer questions using sources."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "d927392c-95e6-4a82-86c2-978387ebe91a",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"page_content='Team: Nationals\\n\"Payroll (millions)\": 81.34\\n\"Wins\": 98' metadata={'source': 'Nationals', 'row': 0}\n",
|
|
"page_content='Team: Reds\\n\"Payroll (millions)\": 82.20\\n\"Wins\": 97' metadata={'source': 'Reds', 'row': 1}\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"loader = CSVLoader(file_path=file_path, source_column=\"Team\")\n",
|
|
"\n",
|
|
"data = loader.load()\n",
|
|
"for record in data[:2]:\n",
|
|
" print(record)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cab6a4bd-476b-4f4c-92e0-5d1cbcd1f6bf",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Load from a string\n",
|
|
"\n",
|
|
"Python's `tempfile` can be used when working with CSV strings directly."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "f3fb28b7-8ebe-4af9-9b7d-719e9a252a46",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"page_content='Team: Nationals\\n\"Payroll (millions)\": 81.34\\n\"Wins\": 98' metadata={'source': 'Nationals', 'row': 0}\n",
|
|
"page_content='Team: Reds\\n\"Payroll (millions)\": 82.20\\n\"Wins\": 97' metadata={'source': 'Reds', 'row': 1}\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import tempfile\n",
|
|
"from io import StringIO\n",
|
|
"\n",
|
|
"string_data = \"\"\"\n",
|
|
"\"Team\", \"Payroll (millions)\", \"Wins\"\n",
|
|
"\"Nationals\", 81.34, 98\n",
|
|
"\"Reds\", 82.20, 97\n",
|
|
"\"Yankees\", 197.96, 95\n",
|
|
"\"Giants\", 117.62, 94\n",
|
|
"\"\"\".strip()\n",
|
|
"\n",
|
|
"\n",
|
|
"with tempfile.NamedTemporaryFile(delete=False, mode=\"w+\") as temp_file:\n",
|
|
" temp_file.write(string_data)\n",
|
|
" temp_file_path = temp_file.name\n",
|
|
"\n",
|
|
"loader = CSVLoader(file_path=temp_file_path)\n",
|
|
"loader.load()\n",
|
|
"for record in data[:2]:\n",
|
|
" print(record)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.4"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|