mirror of
https://github.com/hwchase17/langchain.git
synced 2025-07-06 21:20:33 +00:00
feat: enable UnstructuredEmailLoader
to process attachments (#6977)
### Summary Updates `UnstructuredEmailLoader` so that it can process attachments in addition to the e-mail content. The loader will process attachments if the `process_attachments` kwarg is passed when the loader is instantiated. ### Testing ```python file_path = "fake-email-attachment.eml" loader = UnstructuredEmailLoader( file_path, mode="elements", process_attachments=True ) docs = loader.load() docs[-1] ``` ### Reviewers - @rlancemartin - @eyurtsev - @hwchase17
This commit is contained in:
parent
59697b406d
commit
0498dad562
@ -32,7 +32,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 5,
|
"execution_count": 1,
|
||||||
"id": "40cd9806",
|
"id": "40cd9806",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"tags": []
|
"tags": []
|
||||||
@ -44,7 +44,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 6,
|
"execution_count": 2,
|
||||||
"id": "2d20b852",
|
"id": "2d20b852",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"tags": []
|
"tags": []
|
||||||
@ -56,7 +56,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": null,
|
"execution_count": 3,
|
||||||
"id": "579fa702",
|
"id": "579fa702",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"tags": []
|
"tags": []
|
||||||
@ -68,7 +68,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 8,
|
"execution_count": 4,
|
||||||
"id": "90c1d899",
|
"id": "90c1d899",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"tags": []
|
"tags": []
|
||||||
@ -80,7 +80,7 @@
|
|||||||
"[Document(page_content='This is a test email to use for unit tests.\\n\\nImportant points:\\n\\nRoses are red\\n\\nViolets are blue', metadata={'source': 'example_data/fake-email.eml'})]"
|
"[Document(page_content='This is a test email to use for unit tests.\\n\\nImportant points:\\n\\nRoses are red\\n\\nViolets are blue', metadata={'source': 'example_data/fake-email.eml'})]"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"execution_count": 8,
|
"execution_count": 4,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"output_type": "execute_result"
|
"output_type": "execute_result"
|
||||||
}
|
}
|
||||||
@ -128,7 +128,7 @@
|
|||||||
{
|
{
|
||||||
"data": {
|
"data": {
|
||||||
"text/plain": [
|
"text/plain": [
|
||||||
"Document(page_content='This is a test email to use for unit tests.', lookup_str='', metadata={'source': 'example_data/fake-email.eml'}, lookup_index=0)"
|
"Document(page_content='This is a test email to use for unit tests.', metadata={'source': 'example_data/fake-email.eml', 'filename': 'fake-email.eml', 'file_directory': 'example_data', 'date': '2022-12-16T17:04:16-05:00', 'filetype': 'message/rfc822', 'sent_from': ['Matthew Robinson <mrobinson@unstructured.io>'], 'sent_to': ['Matthew Robinson <mrobinson@unstructured.io>'], 'subject': 'Test Email', 'category': 'NarrativeText'})"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"execution_count": 7,
|
"execution_count": 7,
|
||||||
@ -140,6 +140,61 @@
|
|||||||
"data[0]"
|
"data[0]"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "5021f20a",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Processing Attachments\n",
|
||||||
|
"\n",
|
||||||
|
"You can process attachments with `UnstructuredEmailLoader` by setting `process_attachments=True` in the constructor. By default, attachments will be partitioned using the `partition` function from `unstructured`. You can use a different partitioning function by passing the function to the `attachment_partitioner` kwarg."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 8,
|
||||||
|
"id": "6539f166",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"loader = UnstructuredEmailLoader(\n",
|
||||||
|
" \"example_data/fake-email.eml\",\n",
|
||||||
|
" mode=\"elements\",\n",
|
||||||
|
" process_attachments=True,\n",
|
||||||
|
")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 9,
|
||||||
|
"id": "aebead38",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"data = loader.load()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 10,
|
||||||
|
"id": "ddeb60f4",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"Document(page_content='This is a test email to use for unit tests.', metadata={'source': 'example_data/fake-email.eml', 'filename': 'fake-email.eml', 'file_directory': 'example_data', 'date': '2022-12-16T17:04:16-05:00', 'filetype': 'message/rfc822', 'sent_from': ['Matthew Robinson <mrobinson@unstructured.io>'], 'sent_to': ['Matthew Robinson <mrobinson@unstructured.io>'], 'subject': 'Test Email', 'category': 'NarrativeText'})"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 10,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"data[0]"
|
||||||
|
]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "6a074515",
|
"id": "6a074515",
|
||||||
@ -234,7 +289,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.10.6"
|
"version": "3.8.13"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
|
@ -0,0 +1,50 @@
|
|||||||
|
MIME-Version: 1.0
|
||||||
|
Date: Fri, 23 Dec 2022 12:08:48 -0600
|
||||||
|
Message-ID: <CAPgNNXSzLVJ-d1OCX_TjFgJU7ugtQrjFybPtAMmmYZzphxNFYg@mail.gmail.com>
|
||||||
|
Subject: Fake email with attachment
|
||||||
|
From: Mallori Harrell <mallori@unstructured.io>
|
||||||
|
To: Mallori Harrell <mallori@unstructured.io>
|
||||||
|
Content-Type: multipart/mixed; boundary="0000000000005d654405f082adb7"
|
||||||
|
|
||||||
|
--0000000000005d654405f082adb7
|
||||||
|
Content-Type: multipart/alternative; boundary="0000000000005d654205f082adb5"
|
||||||
|
|
||||||
|
--0000000000005d654205f082adb5
|
||||||
|
Content-Type: text/plain; charset="UTF-8"
|
||||||
|
|
||||||
|
Hello!
|
||||||
|
|
||||||
|
Here's the attachments!
|
||||||
|
|
||||||
|
It includes:
|
||||||
|
|
||||||
|
- Lots of whitespace
|
||||||
|
- Little to no content
|
||||||
|
- and is a quick read
|
||||||
|
|
||||||
|
Best,
|
||||||
|
|
||||||
|
Mallori
|
||||||
|
|
||||||
|
--0000000000005d654205f082adb5
|
||||||
|
Content-Type: text/html; charset="UTF-8"
|
||||||
|
Content-Transfer-Encoding: quoted-printable
|
||||||
|
|
||||||
|
<div dir=3D"ltr">Hello!=C2=A0<div><br></div><div>Here's the attachments=
|
||||||
|
!</div><div><br></div><div>It includes:</div><div><ul><li style=3D"margin-l=
|
||||||
|
eft:15px">Lots of whitespace</li><li style=3D"margin-left:15px">Little=C2=
|
||||||
|
=A0to no content</li><li style=3D"margin-left:15px">and is a quick read</li=
|
||||||
|
></ul><div>Best,</div></div><div><br></div><div>Mallori</div><div dir=3D"lt=
|
||||||
|
r" class=3D"gmail_signature" data-smartmail=3D"gmail_signature"><div dir=3D=
|
||||||
|
"ltr"><div><div><br></div></div></div></div></div>
|
||||||
|
|
||||||
|
--0000000000005d654205f082adb5--
|
||||||
|
--0000000000005d654405f082adb7
|
||||||
|
Content-Type: text/plain; charset="US-ASCII"; name="fake-attachment.txt"
|
||||||
|
Content-Disposition: attachment; filename="fake-attachment.txt"
|
||||||
|
Content-Transfer-Encoding: base64
|
||||||
|
X-Attachment-Id: f_lc0tto5j0
|
||||||
|
Content-ID: <f_lc0tto5j0>
|
||||||
|
|
||||||
|
SGV5IHRoaXMgaXMgYSBmYWtlIGF0dGFjaG1lbnQh
|
||||||
|
--0000000000005d654405f082adb7--
|
@ -1,6 +1,6 @@
|
|||||||
"""Loader that loads email files."""
|
"""Loader that loads email files."""
|
||||||
import os
|
import os
|
||||||
from typing import List
|
from typing import Any, List
|
||||||
|
|
||||||
from langchain.docstore.document import Document
|
from langchain.docstore.document import Document
|
||||||
from langchain.document_loaders.base import BaseLoader
|
from langchain.document_loaders.base import BaseLoader
|
||||||
@ -11,7 +11,45 @@ from langchain.document_loaders.unstructured import (
|
|||||||
|
|
||||||
|
|
||||||
class UnstructuredEmailLoader(UnstructuredFileLoader):
|
class UnstructuredEmailLoader(UnstructuredFileLoader):
|
||||||
"""Loader that uses unstructured to load email files."""
|
"""Loader that uses unstructured to load email files. Works with both
|
||||||
|
.eml and .msg files. You can process attachments in addition to the
|
||||||
|
e-mail message itself by passing process_attachments=True into the
|
||||||
|
constructor for the loader. By default, attachments will be processed
|
||||||
|
with the unstructured partition function. If you already know the document
|
||||||
|
types of the attachments, you can specify another partitioning function
|
||||||
|
with the attachment partitioner kwarg.
|
||||||
|
|
||||||
|
Example
|
||||||
|
-------
|
||||||
|
from langchain.document_loaders import UnstructuredEmailLoader
|
||||||
|
|
||||||
|
loader = UnstructuredEmailLoader("example_data/fake-email.eml", mode="elements")
|
||||||
|
loader.load()
|
||||||
|
|
||||||
|
Example
|
||||||
|
-------
|
||||||
|
from langchain.document_loaders import UnstructuredEmailLoader
|
||||||
|
|
||||||
|
loader = UnstructuredEmailLoader(
|
||||||
|
"example_data/fake-email-attachment.eml",
|
||||||
|
mode="elements",
|
||||||
|
process_attachments=True,
|
||||||
|
)
|
||||||
|
loader.load()
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self, file_path: str, mode: str = "single", **unstructured_kwargs: Any
|
||||||
|
):
|
||||||
|
process_attachments = unstructured_kwargs.get("process_attachments")
|
||||||
|
attachment_partitioner = unstructured_kwargs.get("attachment_partitioner")
|
||||||
|
|
||||||
|
if process_attachments and attachment_partitioner is None:
|
||||||
|
from unstructured.partition.auto import partition
|
||||||
|
|
||||||
|
unstructured_kwargs["attachment_partitioner"] = partition
|
||||||
|
|
||||||
|
super().__init__(file_path=file_path, mode=mode, **unstructured_kwargs)
|
||||||
|
|
||||||
def _get_elements(self) -> List:
|
def _get_elements(self) -> List:
|
||||||
from unstructured.file_utils.filetype import FileType, detect_filetype
|
from unstructured.file_utils.filetype import FileType, detect_filetype
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
from langchain.document_loaders import OutlookMessageLoader
|
from langchain.document_loaders import OutlookMessageLoader, UnstructuredEmailLoader
|
||||||
|
|
||||||
|
|
||||||
def test_outlook_message_loader() -> None:
|
def test_outlook_message_loader() -> None:
|
||||||
@ -18,3 +18,15 @@ def test_outlook_message_loader() -> None:
|
|||||||
"Extractor\r\n\r\n\r\n-- \r\n\r\n\r\nKind regards"
|
"Extractor\r\n\r\n\r\n-- \r\n\r\n\r\nKind regards"
|
||||||
"\r\n\r\n\r\n\r\n\r\nBrian Zhou\r\n\r\n"
|
"\r\n\r\n\r\n\r\n\r\nBrian Zhou\r\n\r\n"
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_unstructured_email_loader_with_attachments() -> None:
|
||||||
|
file_path = Path(__file__).parent.parent / "examples/fake-email-attachment.eml"
|
||||||
|
loader = UnstructuredEmailLoader(
|
||||||
|
str(file_path), mode="elements", process_attachments=True
|
||||||
|
)
|
||||||
|
docs = loader.load()
|
||||||
|
|
||||||
|
assert docs[-1].page_content == "Hey this is a fake attachment!"
|
||||||
|
assert docs[-1].metadata["filename"] == "fake-attachment.txt"
|
||||||
|
assert docs[-1].metadata["source"].endswith("fake-email-attachment.eml")
|
||||||
|
50
tests/integration_tests/examples/fake-email-attachment.eml
Normal file
50
tests/integration_tests/examples/fake-email-attachment.eml
Normal file
@ -0,0 +1,50 @@
|
|||||||
|
MIME-Version: 1.0
|
||||||
|
Date: Fri, 23 Dec 2022 12:08:48 -0600
|
||||||
|
Message-ID: <CAPgNNXSzLVJ-d1OCX_TjFgJU7ugtQrjFybPtAMmmYZzphxNFYg@mail.gmail.com>
|
||||||
|
Subject: Fake email with attachment
|
||||||
|
From: Mallori Harrell <mallori@unstructured.io>
|
||||||
|
To: Mallori Harrell <mallori@unstructured.io>
|
||||||
|
Content-Type: multipart/mixed; boundary="0000000000005d654405f082adb7"
|
||||||
|
|
||||||
|
--0000000000005d654405f082adb7
|
||||||
|
Content-Type: multipart/alternative; boundary="0000000000005d654205f082adb5"
|
||||||
|
|
||||||
|
--0000000000005d654205f082adb5
|
||||||
|
Content-Type: text/plain; charset="UTF-8"
|
||||||
|
|
||||||
|
Hello!
|
||||||
|
|
||||||
|
Here's the attachments!
|
||||||
|
|
||||||
|
It includes:
|
||||||
|
|
||||||
|
- Lots of whitespace
|
||||||
|
- Little to no content
|
||||||
|
- and is a quick read
|
||||||
|
|
||||||
|
Best,
|
||||||
|
|
||||||
|
Mallori
|
||||||
|
|
||||||
|
--0000000000005d654205f082adb5
|
||||||
|
Content-Type: text/html; charset="UTF-8"
|
||||||
|
Content-Transfer-Encoding: quoted-printable
|
||||||
|
|
||||||
|
<div dir=3D"ltr">Hello!=C2=A0<div><br></div><div>Here's the attachments=
|
||||||
|
!</div><div><br></div><div>It includes:</div><div><ul><li style=3D"margin-l=
|
||||||
|
eft:15px">Lots of whitespace</li><li style=3D"margin-left:15px">Little=C2=
|
||||||
|
=A0to no content</li><li style=3D"margin-left:15px">and is a quick read</li=
|
||||||
|
></ul><div>Best,</div></div><div><br></div><div>Mallori</div><div dir=3D"lt=
|
||||||
|
r" class=3D"gmail_signature" data-smartmail=3D"gmail_signature"><div dir=3D=
|
||||||
|
"ltr"><div><div><br></div></div></div></div></div>
|
||||||
|
|
||||||
|
--0000000000005d654205f082adb5--
|
||||||
|
--0000000000005d654405f082adb7
|
||||||
|
Content-Type: text/plain; charset="US-ASCII"; name="fake-attachment.txt"
|
||||||
|
Content-Disposition: attachment; filename="fake-attachment.txt"
|
||||||
|
Content-Transfer-Encoding: base64
|
||||||
|
X-Attachment-Id: f_lc0tto5j0
|
||||||
|
Content-ID: <f_lc0tto5j0>
|
||||||
|
|
||||||
|
SGV5IHRoaXMgaXMgYSBmYWtlIGF0dGFjaG1lbnQh
|
||||||
|
--0000000000005d654405f082adb7--
|
Loading…
Reference in New Issue
Block a user