Harrison/msg files (#2375)

Co-authored-by: Sahil Masand <masand.sahil@gmail.com>
Co-authored-by: Sahil Masand <masands@cbh.com.au>
This commit is contained in:
Harrison Chase
2023-04-04 06:48:34 -07:00
committed by GitHub
parent 585f60a5aa
commit e90d007db3
6 changed files with 143 additions and 6 deletions

View File

@@ -7,7 +7,15 @@
"source": [
"# Email\n",
"\n",
"This notebook shows how to load email (`.eml`) files."
"This notebook shows how to load email (`.eml`) and Microsoft Outlook (`.msg`) files."
]
},
{
"cell_type": "markdown",
"id": "89caa348",
"metadata": {},
"source": [
"## Using Unstructured"
]
},
{
@@ -66,7 +74,7 @@
"id": "8bf50cba",
"metadata": {},
"source": [
"## Retain Elements\n",
"### Retain Elements\n",
"\n",
"Under the hood, Unstructured creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`."
]
@@ -112,10 +120,69 @@
"data[0]"
]
},
{
"cell_type": "markdown",
"id": "6a074515",
"metadata": {},
"source": [
"## Using OutlookMessageLoader"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "1e7a8444",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import OutlookMessageLoader"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "77a055e6",
"metadata": {},
"outputs": [],
"source": [
"loader = OutlookMessageLoader('example_data/fake-email.msg')"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "789882de",
"metadata": {},
"outputs": [],
"source": [
"data = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "46aa0632",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='This is a test email to experiment with the MS Outlook MSG Extractor\\r\\n\\r\\n\\r\\n-- \\r\\n\\r\\n\\r\\nKind regards\\r\\n\\r\\n\\r\\n\\r\\n\\r\\nBrian Zhou\\r\\n\\r\\n', metadata={'subject': 'Test for TIF files', 'sender': 'Brian Zhou <brizhou@gmail.com>', 'date': 'Mon, 18 Nov 2013 16:26:24 +0800'})"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6a074515",
"id": "2b223ce2",
"metadata": {},
"outputs": [],
"source": []