mirror of
https://github.com/hwchase17/langchain.git
synced 2026-04-03 10:55:08 +00:00
Use docusaurus versioning with a callout, merged master as well @hwchase17 @baskaryan --------- Signed-off-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com> Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> Co-authored-by: Leonid Kuligin <lkuligin@yandex.ru> Co-authored-by: Averi Kitsch <akitsch@google.com> Co-authored-by: Erick Friis <erick@langchain.dev> Co-authored-by: Nuno Campos <nuno@langchain.dev> Co-authored-by: Nuno Campos <nuno@boringbits.io> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com> Co-authored-by: Martín Gotelli Ferenaz <martingotelliferenaz@gmail.com> Co-authored-by: Fayfox <admin@fayfox.com> Co-authored-by: Eugene Yurtsev <eugene@langchain.dev> Co-authored-by: Dawson Bauer <105886620+djbauer2@users.noreply.github.com> Co-authored-by: Ravindu Somawansa <ravindu.somawansa@gmail.com> Co-authored-by: Dhruv Chawla <43818888+Dominastorm@users.noreply.github.com> Co-authored-by: ccurme <chester.curme@gmail.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: WeichenXu <weichen.xu@databricks.com> Co-authored-by: Benito Geordie <89472452+benitoThree@users.noreply.github.com> Co-authored-by: kartikTAI <129414343+kartikTAI@users.noreply.github.com> Co-authored-by: Kartik Sarangmath <kartik@thirdai.com> Co-authored-by: Sevin F. Varoglu <sfvaroglu@octoml.ai> Co-authored-by: MacanPN <martin.triska@gmail.com> Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com> Co-authored-by: Hyeongchan Kim <kozistr@gmail.com> Co-authored-by: sdan <git@sdan.io> Co-authored-by: Guangdong Liu <liugddx@gmail.com> Co-authored-by: Rahul Triptahi <rahul.psit.ec@gmail.com> Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com> Co-authored-by: pjb157 <84070455+pjb157@users.noreply.github.com> Co-authored-by: Eun Hye Kim <ehkim1440@gmail.com> Co-authored-by: kaijietti <43436010+kaijietti@users.noreply.github.com> Co-authored-by: Pengcheng Liu <pcliu.fd@gmail.com> Co-authored-by: Tomer Cagan <tomer@tomercagan.com> Co-authored-by: Christophe Bornet <cbornet@hotmail.com>
244 lines
7.6 KiB
Plaintext
244 lines
7.6 KiB
Plaintext
# Unstructured
|
|
|
|
>The `unstructured` package from
|
|
[Unstructured.IO](https://www.unstructured.io/) extracts clean text from raw source documents like
|
|
PDFs and Word documents.
|
|
This page covers how to use the [`unstructured`](https://github.com/Unstructured-IO/unstructured)
|
|
ecosystem within LangChain.
|
|
|
|
## Installation and Setup
|
|
|
|
If you are using a loader that runs locally, use the following steps to get `unstructured` and
|
|
its dependencies running locally.
|
|
|
|
- Install the Python SDK with `pip install unstructured`.
|
|
- You can install document specific dependencies with extras, i.e. `pip install "unstructured[docx]"`.
|
|
- To install the dependencies for all document types, use `pip install "unstructured[all-docs]"`.
|
|
- Install the following system dependencies if they are not already available on your system.
|
|
Depending on what document types you're parsing, you may not need all of these.
|
|
- `libmagic-dev` (filetype detection)
|
|
- `poppler-utils` (images and PDFs)
|
|
- `tesseract-ocr`(images and PDFs)
|
|
- `libreoffice` (MS Office docs)
|
|
- `pandoc` (EPUBs)
|
|
|
|
If you want to get up and running with less set up, you can
|
|
simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or
|
|
`UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API.
|
|
|
|
|
|
The `Unstructured API` requires API keys to make requests.
|
|
You can request an API key [here](https://unstructured.io/api-key-hosted) and start using it today!
|
|
Checkout the README [here](https://github.com/Unstructured-IO/unstructured-api) here to get started making API calls.
|
|
We'd love to hear your feedback, let us know how it goes in our [community slack](https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-1x7cgo0pg-PTptXWylzPQF9xZolzCnwQ).
|
|
And stay tuned for improvements to both quality and performance!
|
|
Check out the instructions
|
|
[here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if you'd like to self-host the Unstructured API or run it locally.
|
|
|
|
|
|
## Data Loaders
|
|
|
|
The primary usage of the `Unstructured` is in data loaders.
|
|
|
|
### UnstructuredAPIFileIOLoader
|
|
|
|
See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredAPIFileIOLoader
|
|
```
|
|
|
|
### UnstructuredAPIFileLoader
|
|
|
|
See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredAPIFileLoader
|
|
```
|
|
|
|
### UnstructuredCHMLoader
|
|
|
|
`CHM` means `Microsoft Compiled HTML Help`.
|
|
|
|
See a usage example in the API documentation.
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredCHMLoader
|
|
```
|
|
|
|
### UnstructuredCSVLoader
|
|
|
|
A `comma-separated values` (`CSV`) file is a delimited text file that uses
|
|
a comma to separate values. Each line of the file is a data record.
|
|
Each record consists of one or more fields, separated by commas.
|
|
|
|
See a [usage example](/docs/integrations/document_loaders/csv#unstructuredcsvloader).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredCSVLoader
|
|
```
|
|
|
|
### UnstructuredEmailLoader
|
|
|
|
See a [usage example](/docs/integrations/document_loaders/email).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredEmailLoader
|
|
```
|
|
|
|
### UnstructuredEPubLoader
|
|
|
|
[EPUB](https://en.wikipedia.org/wiki/EPUB) is an `e-book file format` that uses
|
|
the “.epub” file extension. The term is short for electronic publication and
|
|
is sometimes styled `ePub`. `EPUB` is supported by many e-readers, and compatible
|
|
software is available for most smartphones, tablets, and computers.
|
|
|
|
See a [usage example](/docs/integrations/document_loaders/epub).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredEPubLoader
|
|
```
|
|
|
|
### UnstructuredExcelLoader
|
|
|
|
See a [usage example](/docs/integrations/document_loaders/microsoft_excel).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredExcelLoader
|
|
```
|
|
|
|
### UnstructuredFileIOLoader
|
|
|
|
See a [usage example](/docs/integrations/document_loaders/google_drive#passing-in-optional-file-loaders).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredFileIOLoader
|
|
```
|
|
|
|
### UnstructuredFileLoader
|
|
|
|
See a [usage example](/docs/integrations/document_loaders/unstructured_file).
|
|
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredFileLoader
|
|
```
|
|
|
|
### UnstructuredHTMLLoader
|
|
|
|
See a [usage example](/docs/modules/data_connection/document_loaders/html).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredHTMLLoader
|
|
```
|
|
|
|
### UnstructuredImageLoader
|
|
|
|
See a [usage example](/docs/integrations/document_loaders/image).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredImageLoader
|
|
```
|
|
|
|
### UnstructuredMarkdownLoader
|
|
|
|
See a [usage example](/docs/integrations/vectorstores/starrocks).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredMarkdownLoader
|
|
```
|
|
|
|
### UnstructuredODTLoader
|
|
|
|
The `Open Document Format for Office Applications (ODF)`, also known as `OpenDocument`,
|
|
is an open file format for word processing documents, spreadsheets, presentations
|
|
and graphics and using ZIP-compressed XML files. It was developed with the aim of
|
|
providing an open, XML-based file format specification for office applications.
|
|
|
|
See a [usage example](/docs/integrations/document_loaders/odt).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredODTLoader
|
|
```
|
|
|
|
### UnstructuredOrgModeLoader
|
|
|
|
An [Org Mode](https://en.wikipedia.org/wiki/Org-mode) document is a document editing, formatting, and organizing mode, designed for notes, planning, and authoring within the free software text editor Emacs.
|
|
|
|
See a [usage example](/docs/integrations/document_loaders/org_mode).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredOrgModeLoader
|
|
```
|
|
|
|
### UnstructuredPDFLoader
|
|
|
|
See a [usage example](/docs/modules/data_connection/document_loaders/pdf#using-unstructured).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredPDFLoader
|
|
```
|
|
|
|
### UnstructuredPowerPointLoader
|
|
|
|
See a [usage example](/docs/integrations/document_loaders/microsoft_powerpoint).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredPowerPointLoader
|
|
```
|
|
|
|
### UnstructuredRSTLoader
|
|
|
|
A `reStructured Text` (`RST`) file is a file format for textual data
|
|
used primarily in the Python programming language community for technical documentation.
|
|
|
|
See a [usage example](/docs/integrations/document_loaders/rst).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredRSTLoader
|
|
```
|
|
|
|
### UnstructuredRTFLoader
|
|
|
|
See a usage example in the API documentation.
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredRTFLoader
|
|
```
|
|
|
|
### UnstructuredTSVLoader
|
|
|
|
A `tab-separated values` (`TSV`) file is a simple, text-based file format for storing tabular data.
|
|
Records are separated by newlines, and values within a record are separated by tab characters.
|
|
|
|
See a [usage example](/docs/integrations/document_loaders/tsv).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredTSVLoader
|
|
```
|
|
|
|
### UnstructuredURLLoader
|
|
|
|
See a [usage example](/docs/integrations/document_loaders/url).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredURLLoader
|
|
```
|
|
|
|
### UnstructuredWordDocumentLoader
|
|
|
|
See a [usage example](/docs/integrations/document_loaders/microsoft_word#using-unstructured).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredWordDocumentLoader
|
|
```
|
|
|
|
### UnstructuredXMLLoader
|
|
|
|
See a [usage example](/docs/integrations/document_loaders/xml).
|
|
|
|
```python
|
|
from langchain_community.document_loaders import UnstructuredXMLLoader
|
|
```
|
|
|