Files
langchain/docs/versioned_docs/version-0.2.x/integrations/providers/unstructured.mdx
Jacob Lee aff771923a Jacob/new docs (#20570)
Use docusaurus versioning with a callout, merged master as well

@hwchase17 @baskaryan

---------

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com>
Co-authored-by: Leonid Kuligin <lkuligin@yandex.ru>
Co-authored-by: Averi Kitsch <akitsch@google.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Nuno Campos <nuno@langchain.dev>
Co-authored-by: Nuno Campos <nuno@boringbits.io>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Martín Gotelli Ferenaz <martingotelliferenaz@gmail.com>
Co-authored-by: Fayfox <admin@fayfox.com>
Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>
Co-authored-by: Dawson Bauer <105886620+djbauer2@users.noreply.github.com>
Co-authored-by: Ravindu Somawansa <ravindu.somawansa@gmail.com>
Co-authored-by: Dhruv Chawla <43818888+Dominastorm@users.noreply.github.com>
Co-authored-by: ccurme <chester.curme@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: WeichenXu <weichen.xu@databricks.com>
Co-authored-by: Benito Geordie <89472452+benitoThree@users.noreply.github.com>
Co-authored-by: kartikTAI <129414343+kartikTAI@users.noreply.github.com>
Co-authored-by: Kartik Sarangmath <kartik@thirdai.com>
Co-authored-by: Sevin F. Varoglu <sfvaroglu@octoml.ai>
Co-authored-by: MacanPN <martin.triska@gmail.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Hyeongchan Kim <kozistr@gmail.com>
Co-authored-by: sdan <git@sdan.io>
Co-authored-by: Guangdong Liu <liugddx@gmail.com>
Co-authored-by: Rahul Triptahi <rahul.psit.ec@gmail.com>
Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: pjb157 <84070455+pjb157@users.noreply.github.com>
Co-authored-by: Eun Hye Kim <ehkim1440@gmail.com>
Co-authored-by: kaijietti <43436010+kaijietti@users.noreply.github.com>
Co-authored-by: Pengcheng Liu <pcliu.fd@gmail.com>
Co-authored-by: Tomer Cagan <tomer@tomercagan.com>
Co-authored-by: Christophe Bornet <cbornet@hotmail.com>
2024-04-18 11:10:55 -07:00

244 lines
7.6 KiB
Plaintext

# Unstructured
>The `unstructured` package from
[Unstructured.IO](https://www.unstructured.io/) extracts clean text from raw source documents like
PDFs and Word documents.
This page covers how to use the [`unstructured`](https://github.com/Unstructured-IO/unstructured)
ecosystem within LangChain.
## Installation and Setup
If you are using a loader that runs locally, use the following steps to get `unstructured` and
its dependencies running locally.
- Install the Python SDK with `pip install unstructured`.
- You can install document specific dependencies with extras, i.e. `pip install "unstructured[docx]"`.
- To install the dependencies for all document types, use `pip install "unstructured[all-docs]"`.
- Install the following system dependencies if they are not already available on your system.
Depending on what document types you're parsing, you may not need all of these.
- `libmagic-dev` (filetype detection)
- `poppler-utils` (images and PDFs)
- `tesseract-ocr`(images and PDFs)
- `libreoffice` (MS Office docs)
- `pandoc` (EPUBs)
If you want to get up and running with less set up, you can
simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or
`UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API.
The `Unstructured API` requires API keys to make requests.
You can request an API key [here](https://unstructured.io/api-key-hosted) and start using it today!
Checkout the README [here](https://github.com/Unstructured-IO/unstructured-api) here to get started making API calls.
We'd love to hear your feedback, let us know how it goes in our [community slack](https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-1x7cgo0pg-PTptXWylzPQF9xZolzCnwQ).
And stay tuned for improvements to both quality and performance!
Check out the instructions
[here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if you'd like to self-host the Unstructured API or run it locally.
## Data Loaders
The primary usage of the `Unstructured` is in data loaders.
### UnstructuredAPIFileIOLoader
See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
```python
from langchain_community.document_loaders import UnstructuredAPIFileIOLoader
```
### UnstructuredAPIFileLoader
See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
```python
from langchain_community.document_loaders import UnstructuredAPIFileLoader
```
### UnstructuredCHMLoader
`CHM` means `Microsoft Compiled HTML Help`.
See a usage example in the API documentation.
```python
from langchain_community.document_loaders import UnstructuredCHMLoader
```
### UnstructuredCSVLoader
A `comma-separated values` (`CSV`) file is a delimited text file that uses
a comma to separate values. Each line of the file is a data record.
Each record consists of one or more fields, separated by commas.
See a [usage example](/docs/integrations/document_loaders/csv#unstructuredcsvloader).
```python
from langchain_community.document_loaders import UnstructuredCSVLoader
```
### UnstructuredEmailLoader
See a [usage example](/docs/integrations/document_loaders/email).
```python
from langchain_community.document_loaders import UnstructuredEmailLoader
```
### UnstructuredEPubLoader
[EPUB](https://en.wikipedia.org/wiki/EPUB) is an `e-book file format` that uses
the “.epub” file extension. The term is short for electronic publication and
is sometimes styled `ePub`. `EPUB` is supported by many e-readers, and compatible
software is available for most smartphones, tablets, and computers.
See a [usage example](/docs/integrations/document_loaders/epub).
```python
from langchain_community.document_loaders import UnstructuredEPubLoader
```
### UnstructuredExcelLoader
See a [usage example](/docs/integrations/document_loaders/microsoft_excel).
```python
from langchain_community.document_loaders import UnstructuredExcelLoader
```
### UnstructuredFileIOLoader
See a [usage example](/docs/integrations/document_loaders/google_drive#passing-in-optional-file-loaders).
```python
from langchain_community.document_loaders import UnstructuredFileIOLoader
```
### UnstructuredFileLoader
See a [usage example](/docs/integrations/document_loaders/unstructured_file).
```python
from langchain_community.document_loaders import UnstructuredFileLoader
```
### UnstructuredHTMLLoader
See a [usage example](/docs/modules/data_connection/document_loaders/html).
```python
from langchain_community.document_loaders import UnstructuredHTMLLoader
```
### UnstructuredImageLoader
See a [usage example](/docs/integrations/document_loaders/image).
```python
from langchain_community.document_loaders import UnstructuredImageLoader
```
### UnstructuredMarkdownLoader
See a [usage example](/docs/integrations/vectorstores/starrocks).
```python
from langchain_community.document_loaders import UnstructuredMarkdownLoader
```
### UnstructuredODTLoader
The `Open Document Format for Office Applications (ODF)`, also known as `OpenDocument`,
is an open file format for word processing documents, spreadsheets, presentations
and graphics and using ZIP-compressed XML files. It was developed with the aim of
providing an open, XML-based file format specification for office applications.
See a [usage example](/docs/integrations/document_loaders/odt).
```python
from langchain_community.document_loaders import UnstructuredODTLoader
```
### UnstructuredOrgModeLoader
An [Org Mode](https://en.wikipedia.org/wiki/Org-mode) document is a document editing, formatting, and organizing mode, designed for notes, planning, and authoring within the free software text editor Emacs.
See a [usage example](/docs/integrations/document_loaders/org_mode).
```python
from langchain_community.document_loaders import UnstructuredOrgModeLoader
```
### UnstructuredPDFLoader
See a [usage example](/docs/modules/data_connection/document_loaders/pdf#using-unstructured).
```python
from langchain_community.document_loaders import UnstructuredPDFLoader
```
### UnstructuredPowerPointLoader
See a [usage example](/docs/integrations/document_loaders/microsoft_powerpoint).
```python
from langchain_community.document_loaders import UnstructuredPowerPointLoader
```
### UnstructuredRSTLoader
A `reStructured Text` (`RST`) file is a file format for textual data
used primarily in the Python programming language community for technical documentation.
See a [usage example](/docs/integrations/document_loaders/rst).
```python
from langchain_community.document_loaders import UnstructuredRSTLoader
```
### UnstructuredRTFLoader
See a usage example in the API documentation.
```python
from langchain_community.document_loaders import UnstructuredRTFLoader
```
### UnstructuredTSVLoader
A `tab-separated values` (`TSV`) file is a simple, text-based file format for storing tabular data.
Records are separated by newlines, and values within a record are separated by tab characters.
See a [usage example](/docs/integrations/document_loaders/tsv).
```python
from langchain_community.document_loaders import UnstructuredTSVLoader
```
### UnstructuredURLLoader
See a [usage example](/docs/integrations/document_loaders/url).
```python
from langchain_community.document_loaders import UnstructuredURLLoader
```
### UnstructuredWordDocumentLoader
See a [usage example](/docs/integrations/document_loaders/microsoft_word#using-unstructured).
```python
from langchain_community.document_loaders import UnstructuredWordDocumentLoader
```
### UnstructuredXMLLoader
See a [usage example](/docs/integrations/document_loaders/xml).
```python
from langchain_community.document_loaders import UnstructuredXMLLoader
```