### Summary
Allows users to pass in `**unstructured_kwargs` to Unstructured document
loaders. Implemented with the `strategy` kwargs in mind, but will pass
in other kwargs like `include_page_breaks` as well. The two currently
supported strategies are `"hi_res"`, which is more accurate but takes
longer, and `"fast"`, which processes faster but with lower accuracy.
The `"hi_res"` strategy is the default. For PDFs, if `detectron2` is not
available and the user selects `"hi_res"`, the loader will fallback to
using the `"fast"` strategy.
### Testing
#### Make sure the `strategy` kwarg works
Run the following in iPython to verify that the `"fast"` strategy is
indeed faster.
```python
from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", strategy="fast", mode="elements")
%timeit loader.load()
loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", mode="elements")
%timeit loader.load()
```
On my system I get:
```python
In [3]: from langchain.document_loaders import UnstructuredFileLoader
In [4]: loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", strategy="fast", mode="elements")
In [5]: %timeit loader.load()
247 ms ± 369 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", mode="elements")
In [7]: %timeit loader.load()
2.45 s ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
#### Make sure older versions of `unstructured` still work
Run `pip install unstructured==0.5.3` and then verify the following runs
without error:
```python
from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", mode="elements")
loader.load()
```
1.8 KiB
Unstructured
This page covers how to use the unstructured
ecosystem within LangChain. The unstructured package from
Unstructured.IO extracts clean text from raw source documents like
PDFs and Word documents.
This page is broken into two parts: installation and setup, and then references to specific
unstructured wrappers.
Installation and Setup
- Install the Python SDK with
pip install "unstructured[local-inference]" - Install the following system dependencies if they are not already available on your system.
Depending on what document types you're parsing, you may not need all of these.
libmagic-devpoppler-utilstesseract-ocrlibreoffice
- If you are parsing PDFs using the
"hi_res"strategy, run the following to install thedetectron2model, whichunstructureduses for layout detection:pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"- If
detectron2is not installed,unstructuredwill fallback to processing PDFs using the"fast"strategy, which usespdfminerdirectly and doesn't requiredetectron2.
Wrappers
Data Loaders
The primary unstructured wrappers within langchain are data loaders. The following
shows how to use the most basic unstructured data loader. There are other file-specific
data loaders available in the langchain.document_loaders module.
from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("state_of_the_union.txt")
loader.load()
If you instantiate the loader with UnstructuredFileLoader(mode="elements"), the loader
will track additional metadata like the page number and text type (i.e. title, narrative text)
when that information is available.