feat: batch multiple files in a single Unstructured API request (#4525)

### Submit Multiple Files to the Unstructured API

Enables batching multiple files into a single Unstructured API requests.
Support for requests with multiple files was added to both
`UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. Note that
if you submit multiple files in "single" mode, the result will be
concatenated into a single document. We recommend using this feature in
"elements" mode.

### Testing

The following should load both documents, using two of the example docs
from the integration tests folder.

```python
    from langchain.document_loaders import UnstructuredAPIFileLoader

    file_paths = ["examples/layout-parser-paper.pdf",  "examples/whatsapp_chat.txt"]

    loader = UnstructuredAPIFileLoader(
        file_paths=file_paths,
        api_key="FAKE_API_KEY",
        strategy="fast",
        mode="elements",
    )
    docs = loader.load()
```
This commit is contained in:
Matt Robinson
2023-05-21 23:48:20 -04:00
committed by GitHub
parent 0c3de0a0b3
commit bf3f554357
5 changed files with 259 additions and 28 deletions

View File

@@ -287,10 +287,118 @@
"docs[:5]"
]
},
{
"cell_type": "markdown",
"id": "b066cb5a",
"metadata": {},
"source": [
"## Unstructured API\n",
"\n",
"If you want to get up and running with less set up, you can simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or `UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API. Note that currently (as of 11 May 2023) the Unstructured API is open, but it will soon require an API. The [Unstructured documentation](https://unstructured-io.github.io/) page will have instructions on how to generate an API key once theyre available. Check out the instructions [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if youd like to self-host the Unstructured API or run it locally."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "b50c70bc",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import UnstructuredAPIFileLoader"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "12b6d2cf",
"metadata": {},
"outputs": [],
"source": [
"filenames = [\"example_data/fake.docx\", \"example_data/fake-email.eml\"]"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "39a9894d",
"metadata": {},
"outputs": [],
"source": [
"loader = UnstructuredAPIFileLoader(\n",
" file_path=filenames[0],\n",
" api_key=\"FAKE_API_KEY\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "386eb63c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='Lorem ipsum dolor sit amet.', metadata={'source': 'example_data/fake.docx'})"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = loader.load()\n",
"docs[0]"
]
},
{
"cell_type": "markdown",
"id": "94158999",
"metadata": {},
"source": [
"You can also batch multiple files through the Unstructured API in a single API using `UnstructuredAPIFileLoader`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "79a18e7e",
"metadata": {},
"outputs": [],
"source": [
"loader = UnstructuredAPIFileLoader(\n",
" file_path=filenames,\n",
" api_key=\"FAKE_API_KEY\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "a3d7c846",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='Lorem ipsum dolor sit amet.\\n\\nThis is a test email to use for unit tests.\\n\\nImportant points:\\n\\nRoses are red\\n\\nViolets are blue', metadata={'source': ['example_data/fake.docx', 'example_data/fake-email.eml']})"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = loader.load()\n",
"docs[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f52b04cb",
"id": "0e510495",
"metadata": {},
"outputs": [],
"source": []
@@ -312,7 +420,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
"version": "3.8.13"
}
},
"nbformat": 4,