langchain/docs/docs/how_to/document_loader_directory.ipynb
Erick Friis 580fbd9ada
docs: api ref to new site somewheres (#25679)
```
https://api\.python\.langchain\.com/en/latest/([^/]*)/langchain_([^.]*)\.(.*)\.html([^"]*)
https://python.langchain.com/v0.2/api_reference/$2/$1/langchain_$2.$3.html$4
```

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2024-08-23 10:01:16 -07:00

393 lines
18 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "9122e4b9-4883-4e6e-940b-ab44a70f0951",
"metadata": {},
"source": [
"# How to load documents from a directory\n",
"\n",
"LangChain's [DirectoryLoader](https://python.langchain.com/v0.2/api_reference/community/document_loaders/langchain_community.document_loaders.directory.DirectoryLoader.html) implements functionality for reading files from disk into LangChain [Document](https://python.langchain.com/v0.2/api_reference/core/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) objects. Here we demonstrate:\n",
"\n",
"- How to load from a filesystem, including use of wildcard patterns;\n",
"- How to use multithreading for file I/O;\n",
"- How to use custom loader classes to parse specific file types (e.g., code);\n",
"- How to handle errors, such as those due to decoding."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "1c1e3796-bee8-4882-8065-6b98e48ec53a",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import DirectoryLoader"
]
},
{
"cell_type": "markdown",
"id": "e3cdb7bb-1f58-4a7a-af83-599443127834",
"metadata": {},
"source": [
"`DirectoryLoader` accepts a `loader_cls` kwarg, which defaults to [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file). [Unstructured](https://unstructured-io.github.io/unstructured/) supports parsing for a number of formats, such as PDF and HTML. Here we use it to read in a markdown (.md) file.\n",
"\n",
"We can use the `glob` parameter to control which files to load. Note that here it doesn't load the `.rst` file or the `.html` files."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "bd2fcd1f-8286-499b-b43a-0c17084ae8ee",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"20"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"loader = DirectoryLoader(\"../\", glob=\"**/*.md\")\n",
"docs = loader.load()\n",
"len(docs)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9ff1503d-3ac0-4172-99ec-15c9a4a707d8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Security\n",
"\n",
"LangChain has a large ecosystem of integrations with various external resources like local\n"
]
}
],
"source": [
"print(docs[0].page_content[:100])"
]
},
{
"cell_type": "markdown",
"id": "b8b1cee8-626a-461a-8d33-1c56120f1cc0",
"metadata": {},
"source": [
"## Show a progress bar\n",
"\n",
"By default a progress bar will not be shown. To show a progress bar, install the `tqdm` library (e.g. `pip install tqdm`), and set the `show_progress` parameter to `True`."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "cfa48224-5d02-4aa7-93c7-ce48241645d5",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 54.56it/s]\n"
]
}
],
"source": [
"loader = DirectoryLoader(\"../\", glob=\"**/*.md\", show_progress=True)\n",
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "5e02c922-6a4b-48e6-8c46-5015553eafbe",
"metadata": {},
"source": [
"## Use multithreading\n",
"\n",
"By default the loading happens in one thread. In order to utilize several threads set the `use_multithreading` flag to true."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "aae1c580-6d7c-409c-bfc8-3049fa8bdbf9",
"metadata": {},
"outputs": [],
"source": [
"loader = DirectoryLoader(\"../\", glob=\"**/*.md\", use_multithreading=True)\n",
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "5add3f54-f303-4006-90c9-540a90ab8c46",
"metadata": {},
"source": [
"## Change loader class\n",
"By default this uses the `UnstructuredLoader` class. To customize the loader, specify the loader class in the `loader_cls` kwarg. Below we show an example using [TextLoader](https://python.langchain.com/v0.2/api_reference/community/document_loaders/langchain_community.document_loaders.text.TextLoader.html):"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "d369ee78-ea24-48cc-9f46-1f5cd4b56f48",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import TextLoader\n",
"\n",
"loader = DirectoryLoader(\"../\", glob=\"**/*.md\", loader_cls=TextLoader)\n",
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "2863d7dd-2d56-4fef-8bfd-95c48a6b4a71",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# Security\n",
"\n",
"LangChain has a large ecosystem of integrations with various external resources like loc\n"
]
}
],
"source": [
"print(docs[0].page_content[:100])"
]
},
{
"cell_type": "markdown",
"id": "c97ed37b-38c0-4f31-9403-d3a5d5444f78",
"metadata": {},
"source": [
"Notice that while the `UnstructuredLoader` parses Markdown headers, `TextLoader` does not.\n",
"\n",
"If you need to load Python source code files, use the `PythonLoader`:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "5ef483a8-57d3-45e5-93be-37c8416c543c",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import PythonLoader\n",
"\n",
"loader = DirectoryLoader(\"../../../../../\", glob=\"**/*.py\", loader_cls=PythonLoader)"
]
},
{
"cell_type": "markdown",
"id": "61dd1428-8246-47e3-b1da-f6a3d6f05566",
"metadata": {},
"source": [
"## Auto-detect file encodings with TextLoader\n",
"\n",
"`DirectoryLoader` can help manage errors due to variations in file encodings. Below we will attempt to load in a collection of files, one of which includes non-UTF8 encodings."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "e69db7ae-0385-4129-968f-17c42c7a635c",
"metadata": {},
"outputs": [],
"source": [
"path = \"../../../../libs/langchain/tests/unit_tests/examples/\"\n",
"\n",
"loader = DirectoryLoader(path, glob=\"**/*.txt\", loader_cls=TextLoader)"
]
},
{
"cell_type": "markdown",
"id": "e3b61cf0-809b-4c97-b1a4-17c6aa4343e1",
"metadata": {},
"source": [
"### A. Default Behavior\n",
"\n",
"By default we raise an error:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "4b8f56be-122a-4c56-86a5-a70631a78ec7",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Error loading file ../../../../libs/langchain/tests/unit_tests/examples/example-non-utf8.txt\n"
]
},
{
"ename": "RuntimeError",
"evalue": "Error loading ../../../../libs/langchain/tests/unit_tests/examples/example-non-utf8.txt",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)",
"File \u001b[0;32m~/repos/langchain/libs/community/langchain_community/document_loaders/text.py:43\u001b[0m, in \u001b[0;36mTextLoader.lazy_load\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 42\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28mopen\u001b[39m(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfile_path, encoding\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mencoding) \u001b[38;5;28;01mas\u001b[39;00m f:\n\u001b[0;32m---> 43\u001b[0m text \u001b[38;5;241m=\u001b[39m \u001b[43mf\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 44\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mUnicodeDecodeError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n",
"File \u001b[0;32m~/.pyenv/versions/3.10.4/lib/python3.10/codecs.py:322\u001b[0m, in \u001b[0;36mBufferedIncrementalDecoder.decode\u001b[0;34m(self, input, final)\u001b[0m\n\u001b[1;32m 321\u001b[0m data \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mbuffer \u001b[38;5;241m+\u001b[39m \u001b[38;5;28minput\u001b[39m\n\u001b[0;32m--> 322\u001b[0m (result, consumed) \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_buffer_decode\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43merrors\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mfinal\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 323\u001b[0m \u001b[38;5;66;03m# keep undecoded input until the next call\u001b[39;00m\n",
"\u001b[0;31mUnicodeDecodeError\u001b[0m: 'utf-8' codec can't decode byte 0xca in position 0: invalid continuation byte",
"\nThe above exception was the direct cause of the following exception:\n",
"\u001b[0;31mRuntimeError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[10], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mloader\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mload\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[0;32m~/repos/langchain/libs/community/langchain_community/document_loaders/directory.py:117\u001b[0m, in \u001b[0;36mDirectoryLoader.load\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 115\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mload\u001b[39m(\u001b[38;5;28mself\u001b[39m) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m List[Document]:\n\u001b[1;32m 116\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"Load documents.\"\"\"\u001b[39;00m\n\u001b[0;32m--> 117\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mlist\u001b[39;49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlazy_load\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[0;32m~/repos/langchain/libs/community/langchain_community/document_loaders/directory.py:182\u001b[0m, in \u001b[0;36mDirectoryLoader.lazy_load\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 180\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 181\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m i \u001b[38;5;129;01min\u001b[39;00m items:\n\u001b[0;32m--> 182\u001b[0m \u001b[38;5;28;01myield from\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_lazy_load_file(i, p, pbar)\n\u001b[1;32m 184\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m pbar:\n\u001b[1;32m 185\u001b[0m pbar\u001b[38;5;241m.\u001b[39mclose()\n",
"File \u001b[0;32m~/repos/langchain/libs/community/langchain_community/document_loaders/directory.py:220\u001b[0m, in \u001b[0;36mDirectoryLoader._lazy_load_file\u001b[0;34m(self, item, path, pbar)\u001b[0m\n\u001b[1;32m 218\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 219\u001b[0m logger\u001b[38;5;241m.\u001b[39merror(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mError loading file \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mstr\u001b[39m(item)\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m--> 220\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m e\n\u001b[1;32m 221\u001b[0m \u001b[38;5;28;01mfinally\u001b[39;00m:\n\u001b[1;32m 222\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m pbar:\n",
"File \u001b[0;32m~/repos/langchain/libs/community/langchain_community/document_loaders/directory.py:210\u001b[0m, in \u001b[0;36mDirectoryLoader._lazy_load_file\u001b[0;34m(self, item, path, pbar)\u001b[0m\n\u001b[1;32m 208\u001b[0m loader \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mloader_cls(\u001b[38;5;28mstr\u001b[39m(item), \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mloader_kwargs)\n\u001b[1;32m 209\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 210\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m subdoc \u001b[38;5;129;01min\u001b[39;00m loader\u001b[38;5;241m.\u001b[39mlazy_load():\n\u001b[1;32m 211\u001b[0m \u001b[38;5;28;01myield\u001b[39;00m subdoc\n\u001b[1;32m 212\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mNotImplementedError\u001b[39;00m:\n",
"File \u001b[0;32m~/repos/langchain/libs/community/langchain_community/document_loaders/text.py:56\u001b[0m, in \u001b[0;36mTextLoader.lazy_load\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 54\u001b[0m \u001b[38;5;28;01mcontinue\u001b[39;00m\n\u001b[1;32m 55\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m---> 56\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mError loading \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfile_path\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01me\u001b[39;00m\n\u001b[1;32m 57\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mException\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[1;32m 58\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mError loading \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfile_path\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01me\u001b[39;00m\n",
"\u001b[0;31mRuntimeError\u001b[0m: Error loading ../../../../libs/langchain/tests/unit_tests/examples/example-non-utf8.txt"
]
}
],
"source": [
"loader.load()"
]
},
{
"cell_type": "markdown",
"id": "48308077-2d99-4dd6-9bf1-dd1ad6c64b0f",
"metadata": {},
"source": [
"The file `example-non-utf8.txt` uses a different encoding, so the `load()` function fails with a helpful message indicating which file failed decoding.\n",
"\n",
"With the default behavior of `TextLoader` any failure to load any of the documents will fail the whole loading process and no documents are loaded.\n",
"\n",
"### B. Silent fail\n",
"\n",
"We can pass the parameter `silent_errors` to the `DirectoryLoader` to skip the files which could not be loaded and continue the load process."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "b333c652-a7ad-47f4-8be8-d27c18ef11b7",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Error loading file ../../../../libs/langchain/tests/unit_tests/examples/example-non-utf8.txt: Error loading ../../../../libs/langchain/tests/unit_tests/examples/example-non-utf8.txt\n"
]
}
],
"source": [
"loader = DirectoryLoader(\n",
" path, glob=\"**/*.txt\", loader_cls=TextLoader, silent_errors=True\n",
")\n",
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "b99ef682-b892-4790-8964-40185fea41a2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['../../../../libs/langchain/tests/unit_tests/examples/example-utf8.txt']"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"doc_sources = [doc.metadata[\"source\"] for doc in docs]\n",
"doc_sources"
]
},
{
"cell_type": "markdown",
"id": "da475bff-2f4f-4ea3-a058-2979042c5326",
"metadata": {},
"source": [
"### C. Auto detect encodings\n",
"\n",
"We can also ask `TextLoader` to auto detect the file encoding before failing, by passing the `autodetect_encoding` to the loader class."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "832760da-ed9f-4e68-a67c-35493bde2214",
"metadata": {},
"outputs": [],
"source": [
"text_loader_kwargs = {\"autodetect_encoding\": True}\n",
"loader = DirectoryLoader(\n",
" path, glob=\"**/*.txt\", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs\n",
")\n",
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "5c4f4dba-f84f-496e-9378-3e6858305619",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['../../../../libs/langchain/tests/unit_tests/examples/example-utf8.txt',\n",
" '../../../../libs/langchain/tests/unit_tests/examples/example-non-utf8.txt']"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"doc_sources = [doc.metadata[\"source\"] for doc in docs]\n",
"doc_sources"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}