mirror of
https://github.com/hwchase17/langchain.git
synced 2025-05-19 22:11:41 +00:00
In https://github.com/hwchase17/langchain/issues/1716 , it was identified that there were two .py files performing similar tasks. As a resolution, one of the files has been removed, as its purpose had already been fulfilled by the other file. Additionally, the init has been updated accordingly. Furthermore, the how_to_guides.rst file has been updated to include links to documentation that was previously missing. This was deemed necessary as the existing list on https://langchain.readthedocs.io/en/latest/modules/document_loaders/how_to_guides.html was incomplete, causing confusion for users who rely on the full list of documentation on the left sidebar of the website.
138 lines
2.8 KiB
Plaintext
138 lines
2.8 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "39af9ecd",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Word Documents\n",
|
|
"\n",
|
|
"This covers how to load Word documents into a document format that we can use downstream."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "721c48aa",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.document_loaders import UnstructuredWordDocumentLoader"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "9d3d0e35",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"loader = UnstructuredWordDocumentLoader(\"example_data/fake.docx\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "06073f91",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"data = loader.load()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "c9adc5cb",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[Document(page_content='Lorem ipsum dolor sit amet.', lookup_str='', metadata={'source': 'fake.docx'}, lookup_index=0)]"
|
|
]
|
|
},
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "525d6b67",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Retain Elements\n",
|
|
"\n",
|
|
"Under the hood, Unstructured creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "064f9162",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"loader = UnstructuredWordDocumentLoader(\"example_data/fake.docx\", mode=\"elements\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "abefbbdb",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"data = loader.load()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "a547c534",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Document(page_content='Lorem ipsum dolor sit amet.', lookup_str='', metadata={'source': 'fake.docx', 'filename': 'fake.docx', 'category': 'Title'}, lookup_index=0)"
|
|
]
|
|
},
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data[0]"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.8.13"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|