mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-01 11:02:37 +00:00
community: Allow other than default parsers in SharePointLoader and OneDriveLoader (#27716)
## What this PR does?
### Currently `O365BaseLoader` (and consequently both derived loaders)
are limited to `pdf`, `doc`, `docx` files.
- **Solution: here we introduce _handlers_ attribute that allows for
custom handlers to be passed in. This is done in _dict_ form:**
**Example:**
```python
from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser
# PR for DocumentLoaderAsParser here: https://github.com/langchain-ai/langchain/pull/27749
from langchain_community.document_loaders.excel import UnstructuredExcelLoader
xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged")
# create dictionary mapping file types to handlers (parsers)
handlers = {
"doc": MsWordParser()
"pdf": PDFMinerParser()
"txt": TextParser()
"xlsx": xlsx_parser
}
loader = SharePointLoader(document_library_id="...",
handlers=handlers # pass handlers to SharePointLoader
)
documents = loader.load()
# works the same in OneDriveLoader
loader = OneDriveLoader(document_library_id="...",
handlers=handlers
)
```
This dictionary is then passed to `MimeTypeBasedParser` same as in the
[current
implementation](5a2cfb49e0/libs/community/langchain_community/document_loaders/parsers/registry.py (L13)
).
### Currently `SharePointLoader` and `OneDriveLoader` are separate
loaders that both inherit from `O365BaseLoader`
However both of these implement the same functionality. The only
differences are:
- `SharePointLoader` requires argument `document_library_id` whereas
`OneDriveLoader` requires `drive_id`. These are just different names for
the same thing.
- `SharePointLoader` implements significantly more features.
- **Solution: `OneDriveLoader` is replaced with an empty shell just
renaming `drive_id` to `document_library_id` and inheriting from
`SharePointLoader`**
**Dependencies:** None
**Twitter handle:** @martintriska1
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
This commit is contained in:
@@ -8,7 +8,7 @@
|
||||
"\n",
|
||||
">[Microsoft OneDrive](https://en.wikipedia.org/wiki/OneDrive) (formerly `SkyDrive`) is a file hosting service operated by Microsoft.\n",
|
||||
"\n",
|
||||
"This notebook covers how to load documents from `OneDrive`. Currently, only docx, doc, and pdf files are supported.\n",
|
||||
"This notebook covers how to load documents from `OneDrive`. By default the document loader loads `pdf`, `doc`, `docx` and `txt` files. You can load other file types by providing appropriate parsers (see more below).\n",
|
||||
"\n",
|
||||
"## Prerequisites\n",
|
||||
"1. Register an application with the [Microsoft identity platform](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app) instructions.\n",
|
||||
@@ -77,15 +77,64 @@
|
||||
"\n",
|
||||
"loader = OneDriveLoader(drive_id=\"YOUR DRIVE ID\", object_ids=[\"ID_1\", \"ID_2\"], auth_with_token=True)\n",
|
||||
"documents = loader.load()\n",
|
||||
"```\n"
|
||||
"```\n",
|
||||
"\n",
|
||||
"#### 📑 Choosing supported file types and preffered parsers\n",
|
||||
"By default `OneDriveLoader` loads file types defined in [`document_loaders/parsers/registry`](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/parsers/registry.py#L10-L22) using the default parsers (see below).\n",
|
||||
"```python\n",
|
||||
"def _get_default_parser() -> BaseBlobParser:\n",
|
||||
" \"\"\"Get default mime-type based parser.\"\"\"\n",
|
||||
" return MimeTypeBasedParser(\n",
|
||||
" handlers={\n",
|
||||
" \"application/pdf\": PyMuPDFParser(),\n",
|
||||
" \"text/plain\": TextParser(),\n",
|
||||
" \"application/msword\": MsWordParser(),\n",
|
||||
" \"application/vnd.openxmlformats-officedocument.wordprocessingml.document\": (\n",
|
||||
" MsWordParser()\n",
|
||||
" ),\n",
|
||||
" },\n",
|
||||
" fallback_parser=None,\n",
|
||||
" )\n",
|
||||
"```\n",
|
||||
"You can override this behavior by passing `handlers` argument to `OneDriveLoader`. \n",
|
||||
"Pass a dictionary mapping either file extensions (like `\"doc\"`, `\"pdf\"`, etc.) \n",
|
||||
"or MIME types (like `\"application/pdf\"`, `\"text/plain\"`, etc.) to parsers. \n",
|
||||
"Note that you must use either file extensions or MIME types exclusively and \n",
|
||||
"cannot mix them.\n",
|
||||
"\n",
|
||||
"Do not include the leading dot for file extensions.\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"# using file extensions:\n",
|
||||
"handlers = {\n",
|
||||
" \"doc\": MsWordParser(),\n",
|
||||
" \"pdf\": PDFMinerParser(),\n",
|
||||
" \"mp3\": OpenAIWhisperParser()\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"# using MIME types:\n",
|
||||
"handlers = {\n",
|
||||
" \"application/msword\": MsWordParser(),\n",
|
||||
" \"application/pdf\": PDFMinerParser(),\n",
|
||||
" \"audio/mpeg\": OpenAIWhisperParser()\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"loader = OneDriveLoader(document_library_id=\"...\",\n",
|
||||
" handlers=handlers # pass handlers to OneDriveLoader\n",
|
||||
" )\n",
|
||||
"```\n",
|
||||
"In case multiple file extensions map to the same MIME type, the last dictionary item will\n",
|
||||
"apply.\n",
|
||||
"Example:\n",
|
||||
"```python\n",
|
||||
"# 'jpg' and 'jpeg' both map to 'image/jpeg' MIME type. SecondParser() will be used \n",
|
||||
"# to parse all jpg/jpeg files.\n",
|
||||
"handlers = {\n",
|
||||
" \"jpg\": FirstParser(),\n",
|
||||
" \"jpeg\": SecondParser()\n",
|
||||
"}\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
@@ -9,7 +9,7 @@
|
||||
"\n",
|
||||
"> [Microsoft SharePoint](https://en.wikipedia.org/wiki/SharePoint) is a website-based collaboration system that uses workflow applications, “list” databases, and other web parts and security features to empower business teams to work together developed by Microsoft.\n",
|
||||
"\n",
|
||||
"This notebook covers how to load documents from the [SharePoint Document Library](https://support.microsoft.com/en-us/office/what-is-a-document-library-3b5976dd-65cf-4c9e-bf5a-713c10ca2872). Currently, only docx, doc, and pdf files are supported.\n",
|
||||
"This notebook covers how to load documents from the [SharePoint Document Library](https://support.microsoft.com/en-us/office/what-is-a-document-library-3b5976dd-65cf-4c9e-bf5a-713c10ca2872). By default the document loader loads `pdf`, `doc`, `docx` and `txt` files. You can load other file types by providing appropriate parsers (see more below).\n",
|
||||
"\n",
|
||||
"## Prerequisites\n",
|
||||
"1. Register an application with the [Microsoft identity platform](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app) instructions.\n",
|
||||
@@ -100,7 +100,63 @@
|
||||
"\n",
|
||||
"loader = SharePointLoader(document_library_id=\"YOUR DOCUMENT LIBRARY ID\", object_ids=[\"ID_1\", \"ID_2\"], auth_with_token=True)\n",
|
||||
"documents = loader.load()\n",
|
||||
"```\n"
|
||||
"```\n",
|
||||
"\n",
|
||||
"#### 📑 Choosing supported file types and preffered parsers\n",
|
||||
"By default `SharePointLoader` loads file types defined in [`document_loaders/parsers/registry`](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/parsers/registry.py#L10-L22) using the default parsers (see below).\n",
|
||||
"```python\n",
|
||||
"def _get_default_parser() -> BaseBlobParser:\n",
|
||||
" \"\"\"Get default mime-type based parser.\"\"\"\n",
|
||||
" return MimeTypeBasedParser(\n",
|
||||
" handlers={\n",
|
||||
" \"application/pdf\": PyMuPDFParser(),\n",
|
||||
" \"text/plain\": TextParser(),\n",
|
||||
" \"application/msword\": MsWordParser(),\n",
|
||||
" \"application/vnd.openxmlformats-officedocument.wordprocessingml.document\": (\n",
|
||||
" MsWordParser()\n",
|
||||
" ),\n",
|
||||
" },\n",
|
||||
" fallback_parser=None,\n",
|
||||
" )\n",
|
||||
"```\n",
|
||||
"You can override this behavior by passing `handlers` argument to `SharePointLoader`. \n",
|
||||
"Pass a dictionary mapping either file extensions (like `\"doc\"`, `\"pdf\"`, etc.) \n",
|
||||
"or MIME types (like `\"application/pdf\"`, `\"text/plain\"`, etc.) to parsers. \n",
|
||||
"Note that you must use either file extensions or MIME types exclusively and \n",
|
||||
"cannot mix them.\n",
|
||||
"\n",
|
||||
"Do not include the leading dot for file extensions.\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"# using file extensions:\n",
|
||||
"handlers = {\n",
|
||||
" \"doc\": MsWordParser(),\n",
|
||||
" \"pdf\": PDFMinerParser(),\n",
|
||||
" \"mp3\": OpenAIWhisperParser()\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"# using MIME types:\n",
|
||||
"handlers = {\n",
|
||||
" \"application/msword\": MsWordParser(),\n",
|
||||
" \"application/pdf\": PDFMinerParser(),\n",
|
||||
" \"audio/mpeg\": OpenAIWhisperParser()\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"loader = SharePointLoader(document_library_id=\"...\",\n",
|
||||
" handlers=handlers # pass handlers to SharePointLoader\n",
|
||||
" )\n",
|
||||
"```\n",
|
||||
"In case multiple file extensions map to the same MIME type, the last dictionary item will\n",
|
||||
"apply.\n",
|
||||
"Example:\n",
|
||||
"```python\n",
|
||||
"# 'jpg' and 'jpeg' both map to 'image/jpeg' MIME type. SecondParser() will be used \n",
|
||||
"# to parse all jpg/jpeg files.\n",
|
||||
"handlers = {\n",
|
||||
" \"jpg\": FirstParser(),\n",
|
||||
" \"jpeg\": SecondParser()\n",
|
||||
"}\n",
|
||||
"```"
|
||||
]
|
||||
}
|
||||
],
|
||||
|
Reference in New Issue
Block a user