langchain/docs/versioned_docs/version-0.2.x/integrations/providers/unstructured.mdx

# Unstructured

>The `unstructured` package from
[Unstructured.IO](https://www.unstructured.io/) extracts clean text from raw source documents like
PDFs and Word documents.
This page covers how to use the [`unstructured`](https://github.com/Unstructured-IO/unstructured)
ecosystem within LangChain.

## Installation and Setup

If you are using a loader that runs locally, use the following steps to get `unstructured` and
its dependencies running locally.

- Install the Python SDK with `pip install unstructured`.
    - You can install document specific dependencies with extras, i.e. `pip install "unstructured[docx]"`.
    - To install the dependencies for all document types, use `pip install "unstructured[all-docs]"`.
- Install the following system dependencies if they are not already available on your system.
  Depending on what document types you're parsing, you may not need all of these.
    - `libmagic-dev` (filetype detection)
    - `poppler-utils` (images and PDFs)
    - `tesseract-ocr`(images and PDFs)
    - `libreoffice` (MS Office docs)
    - `pandoc` (EPUBs)

If you want to get up and running with less set up, you can
simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or
`UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API.


The `Unstructured API` requires API keys to make requests.
You can request an API key [here](https://unstructured.io/api-key-hosted) and start using it today!
Checkout the README [here](https://github.com/Unstructured-IO/unstructured-api) here to get started making API calls.
We'd love to hear your feedback, let us know how it goes in our [community slack](https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-1x7cgo0pg-PTptXWylzPQF9xZolzCnwQ).
And stay tuned for improvements to both quality and performance!
Check out the instructions
[here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if you'd like to self-host the Unstructured API or run it locally.


## Data Loaders

The primary usage of the `Unstructured` is in data loaders.

### UnstructuredAPIFileIOLoader

See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).

```python
from langchain_community.document_loaders import UnstructuredAPIFileIOLoader
```

### UnstructuredAPIFileLoader

See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).

```python
from langchain_community.document_loaders import UnstructuredAPIFileLoader
```

### UnstructuredCHMLoader

`CHM` means `Microsoft Compiled HTML Help`.

See a usage example in the API documentation.

```python
from langchain_community.document_loaders import UnstructuredCHMLoader
```

### UnstructuredCSVLoader

A `comma-separated values` (`CSV`) file is a delimited text file that uses
a comma to separate values. Each line of the file is a data record.
Each record consists of one or more fields, separated by commas.

See a [usage example](/docs/integrations/document_loaders/csv#unstructuredcsvloader).

```python
from langchain_community.document_loaders import UnstructuredCSVLoader
```

### UnstructuredEmailLoader

See a [usage example](/docs/integrations/document_loaders/email).

```python
from langchain_community.document_loaders import UnstructuredEmailLoader
```

### UnstructuredEPubLoader

[EPUB](https://en.wikipedia.org/wiki/EPUB) is an `e-book file format` that uses
the “.epub” file extension. The term is short for electronic publication and
is sometimes styled `ePub`. `EPUB` is supported by many e-readers, and compatible
software is available for most smartphones, tablets, and computers.

See a [usage example](/docs/integrations/document_loaders/epub).

```python
from langchain_community.document_loaders import UnstructuredEPubLoader
```

### UnstructuredExcelLoader

See a [usage example](/docs/integrations/document_loaders/microsoft_excel).

```python
from langchain_community.document_loaders import UnstructuredExcelLoader
```

### UnstructuredFileIOLoader

See a [usage example](/docs/integrations/document_loaders/google_drive#passing-in-optional-file-loaders).

```python
from langchain_community.document_loaders import UnstructuredFileIOLoader
```

### UnstructuredFileLoader

See a [usage example](/docs/integrations/document_loaders/unstructured_file).


```python
from langchain_community.document_loaders import UnstructuredFileLoader
```

### UnstructuredHTMLLoader

See a [usage example](/docs/modules/data_connection/document_loaders/html).

```python
from langchain_community.document_loaders import UnstructuredHTMLLoader
```

### UnstructuredImageLoader

See a [usage example](/docs/integrations/document_loaders/image).

```python
from langchain_community.document_loaders import UnstructuredImageLoader
```

### UnstructuredMarkdownLoader

See a [usage example](/docs/integrations/vectorstores/starrocks).

```python
from langchain_community.document_loaders import UnstructuredMarkdownLoader
```

### UnstructuredODTLoader

The `Open Document Format for Office Applications (ODF)`, also known as `OpenDocument`,
is an open file format for word processing documents, spreadsheets, presentations
and graphics and using ZIP-compressed XML files. It was developed with the aim of
providing an open, XML-based file format specification for office applications.

See a [usage example](/docs/integrations/document_loaders/odt).

```python
from langchain_community.document_loaders import UnstructuredODTLoader
```

### UnstructuredOrgModeLoader

An [Org Mode](https://en.wikipedia.org/wiki/Org-mode) document is a document editing, formatting, and organizing mode, designed for notes, planning, and authoring within the free software text editor Emacs.

See a [usage example](/docs/integrations/document_loaders/org_mode).

```python
from langchain_community.document_loaders import UnstructuredOrgModeLoader
```

### UnstructuredPDFLoader

See a [usage example](/docs/modules/data_connection/document_loaders/pdf#using-unstructured).

```python
from langchain_community.document_loaders import UnstructuredPDFLoader
```

### UnstructuredPowerPointLoader

See a [usage example](/docs/integrations/document_loaders/microsoft_powerpoint).

```python
from langchain_community.document_loaders import UnstructuredPowerPointLoader
```

### UnstructuredRSTLoader

A `reStructured Text` (`RST`) file is a file format for textual data
used primarily in the Python programming language community for technical documentation.

See a [usage example](/docs/integrations/document_loaders/rst).

```python
from langchain_community.document_loaders import UnstructuredRSTLoader
```

### UnstructuredRTFLoader

See a usage example in the API documentation.

```python
from langchain_community.document_loaders import UnstructuredRTFLoader
```

### UnstructuredTSVLoader

A `tab-separated values` (`TSV`) file is a simple, text-based file format for storing tabular data.
Records are separated by newlines, and values within a record are separated by tab characters.

See a [usage example](/docs/integrations/document_loaders/tsv).

```python
from langchain_community.document_loaders import UnstructuredTSVLoader
```

### UnstructuredURLLoader

See a [usage example](/docs/integrations/document_loaders/url).

```python
from langchain_community.document_loaders import UnstructuredURLLoader
```

### UnstructuredWordDocumentLoader

See a [usage example](/docs/integrations/document_loaders/microsoft_word#using-unstructured).

```python
from langchain_community.document_loaders import UnstructuredWordDocumentLoader
```

### UnstructuredXMLLoader

See a [usage example](/docs/integrations/document_loaders/xml).

```python
from langchain_community.document_loaders import UnstructuredXMLLoader
```