community: Allow other than default parsers in SharePointLoader and OneDriveLoader (#27716)

## What this PR does?

### Currently `O365BaseLoader` (and consequently both derived loaders)
are limited to `pdf`, `doc`, `docx` files.
- **Solution: here we introduce _handlers_ attribute that allows for
custom handlers to be passed in. This is done in _dict_ form:**

**Example:**
```python
from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser
# PR for DocumentLoaderAsParser here: https://github.com/langchain-ai/langchain/pull/27749
from langchain_community.document_loaders.excel import UnstructuredExcelLoader

xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged")

# create dictionary mapping file types to handlers (parsers)
handlers = {
    "doc": MsWordParser()
    "pdf": PDFMinerParser()
    "txt": TextParser()
    "xlsx": xlsx_parser
}
loader = SharePointLoader(document_library_id="...",
                            handlers=handlers # pass handlers to SharePointLoader
                            )
documents = loader.load()

# works the same in OneDriveLoader
loader = OneDriveLoader(document_library_id="...",
                            handlers=handlers
                            )
```
This dictionary is then passed to `MimeTypeBasedParser` same as in the
[current
implementation](5a2cfb49e0/libs/community/langchain_community/document_loaders/parsers/registry.py (L13)).


### Currently `SharePointLoader` and `OneDriveLoader` are separate
loaders that both inherit from `O365BaseLoader`
However both of these implement the same functionality. The only
differences are:
- `SharePointLoader` requires argument `document_library_id` whereas
`OneDriveLoader` requires `drive_id`. These are just different names for
the same thing.
  - `SharePointLoader` implements significantly more features.
- **Solution: `OneDriveLoader` is replaced with an empty shell just
renaming `drive_id` to `document_library_id` and inheriting from
`SharePointLoader`**

**Dependencies:** None
**Twitter handle:** @martintriska1

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
This commit is contained in:
Martin Triska
2024-11-06 23:44:34 +01:00
committed by GitHub
parent 482c168b3e
commit 90189f5639
5 changed files with 229 additions and 138 deletions

View File

@@ -8,7 +8,7 @@
"\n",
">[Microsoft OneDrive](https://en.wikipedia.org/wiki/OneDrive) (formerly `SkyDrive`) is a file hosting service operated by Microsoft.\n",
"\n",
"This notebook covers how to load documents from `OneDrive`. Currently, only docx, doc, and pdf files are supported.\n",
"This notebook covers how to load documents from `OneDrive`. By default the document loader loads `pdf`, `doc`, `docx` and `txt` files. You can load other file types by providing appropriate parsers (see more below).\n",
"\n",
"## Prerequisites\n",
"1. Register an application with the [Microsoft identity platform](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app) instructions.\n",
@@ -77,15 +77,64 @@
"\n",
"loader = OneDriveLoader(drive_id=\"YOUR DRIVE ID\", object_ids=[\"ID_1\", \"ID_2\"], auth_with_token=True)\n",
"documents = loader.load()\n",
"```\n"
"```\n",
"\n",
"#### 📑 Choosing supported file types and preffered parsers\n",
"By default `OneDriveLoader` loads file types defined in [`document_loaders/parsers/registry`](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/parsers/registry.py#L10-L22) using the default parsers (see below).\n",
"```python\n",
"def _get_default_parser() -> BaseBlobParser:\n",
" \"\"\"Get default mime-type based parser.\"\"\"\n",
" return MimeTypeBasedParser(\n",
" handlers={\n",
" \"application/pdf\": PyMuPDFParser(),\n",
" \"text/plain\": TextParser(),\n",
" \"application/msword\": MsWordParser(),\n",
" \"application/vnd.openxmlformats-officedocument.wordprocessingml.document\": (\n",
" MsWordParser()\n",
" ),\n",
" },\n",
" fallback_parser=None,\n",
" )\n",
"```\n",
"You can override this behavior by passing `handlers` argument to `OneDriveLoader`. \n",
"Pass a dictionary mapping either file extensions (like `\"doc\"`, `\"pdf\"`, etc.) \n",
"or MIME types (like `\"application/pdf\"`, `\"text/plain\"`, etc.) to parsers. \n",
"Note that you must use either file extensions or MIME types exclusively and \n",
"cannot mix them.\n",
"\n",
"Do not include the leading dot for file extensions.\n",
"\n",
"```python\n",
"# using file extensions:\n",
"handlers = {\n",
" \"doc\": MsWordParser(),\n",
" \"pdf\": PDFMinerParser(),\n",
" \"mp3\": OpenAIWhisperParser()\n",
"}\n",
"\n",
"# using MIME types:\n",
"handlers = {\n",
" \"application/msword\": MsWordParser(),\n",
" \"application/pdf\": PDFMinerParser(),\n",
" \"audio/mpeg\": OpenAIWhisperParser()\n",
"}\n",
"\n",
"loader = OneDriveLoader(document_library_id=\"...\",\n",
" handlers=handlers # pass handlers to OneDriveLoader\n",
" )\n",
"```\n",
"In case multiple file extensions map to the same MIME type, the last dictionary item will\n",
"apply.\n",
"Example:\n",
"```python\n",
"# 'jpg' and 'jpeg' both map to 'image/jpeg' MIME type. SecondParser() will be used \n",
"# to parse all jpg/jpeg files.\n",
"handlers = {\n",
" \"jpg\": FirstParser(),\n",
" \"jpeg\": SecondParser()\n",
"}\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

View File

@@ -9,7 +9,7 @@
"\n",
"> [Microsoft SharePoint](https://en.wikipedia.org/wiki/SharePoint) is a website-based collaboration system that uses workflow applications, “list” databases, and other web parts and security features to empower business teams to work together developed by Microsoft.\n",
"\n",
"This notebook covers how to load documents from the [SharePoint Document Library](https://support.microsoft.com/en-us/office/what-is-a-document-library-3b5976dd-65cf-4c9e-bf5a-713c10ca2872). Currently, only docx, doc, and pdf files are supported.\n",
"This notebook covers how to load documents from the [SharePoint Document Library](https://support.microsoft.com/en-us/office/what-is-a-document-library-3b5976dd-65cf-4c9e-bf5a-713c10ca2872). By default the document loader loads `pdf`, `doc`, `docx` and `txt` files. You can load other file types by providing appropriate parsers (see more below).\n",
"\n",
"## Prerequisites\n",
"1. Register an application with the [Microsoft identity platform](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app) instructions.\n",
@@ -100,7 +100,63 @@
"\n",
"loader = SharePointLoader(document_library_id=\"YOUR DOCUMENT LIBRARY ID\", object_ids=[\"ID_1\", \"ID_2\"], auth_with_token=True)\n",
"documents = loader.load()\n",
"```\n"
"```\n",
"\n",
"#### 📑 Choosing supported file types and preffered parsers\n",
"By default `SharePointLoader` loads file types defined in [`document_loaders/parsers/registry`](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/parsers/registry.py#L10-L22) using the default parsers (see below).\n",
"```python\n",
"def _get_default_parser() -> BaseBlobParser:\n",
" \"\"\"Get default mime-type based parser.\"\"\"\n",
" return MimeTypeBasedParser(\n",
" handlers={\n",
" \"application/pdf\": PyMuPDFParser(),\n",
" \"text/plain\": TextParser(),\n",
" \"application/msword\": MsWordParser(),\n",
" \"application/vnd.openxmlformats-officedocument.wordprocessingml.document\": (\n",
" MsWordParser()\n",
" ),\n",
" },\n",
" fallback_parser=None,\n",
" )\n",
"```\n",
"You can override this behavior by passing `handlers` argument to `SharePointLoader`. \n",
"Pass a dictionary mapping either file extensions (like `\"doc\"`, `\"pdf\"`, etc.) \n",
"or MIME types (like `\"application/pdf\"`, `\"text/plain\"`, etc.) to parsers. \n",
"Note that you must use either file extensions or MIME types exclusively and \n",
"cannot mix them.\n",
"\n",
"Do not include the leading dot for file extensions.\n",
"\n",
"```python\n",
"# using file extensions:\n",
"handlers = {\n",
" \"doc\": MsWordParser(),\n",
" \"pdf\": PDFMinerParser(),\n",
" \"mp3\": OpenAIWhisperParser()\n",
"}\n",
"\n",
"# using MIME types:\n",
"handlers = {\n",
" \"application/msword\": MsWordParser(),\n",
" \"application/pdf\": PDFMinerParser(),\n",
" \"audio/mpeg\": OpenAIWhisperParser()\n",
"}\n",
"\n",
"loader = SharePointLoader(document_library_id=\"...\",\n",
" handlers=handlers # pass handlers to SharePointLoader\n",
" )\n",
"```\n",
"In case multiple file extensions map to the same MIME type, the last dictionary item will\n",
"apply.\n",
"Example:\n",
"```python\n",
"# 'jpg' and 'jpeg' both map to 'image/jpeg' MIME type. SecondParser() will be used \n",
"# to parse all jpg/jpeg files.\n",
"handlers = {\n",
" \"jpg\": FirstParser(),\n",
" \"jpeg\": SecondParser()\n",
"}\n",
"```"
]
}
],