mirror of
https://github.com/hwchase17/langchain.git
synced 2025-06-23 15:19:33 +00:00
docs: integrations/providers/unstructured
update (#19892)
Updated a page with existing document loaders with links to examples. Fixed formatting of one example. Co-authored-by: Erick Friis <erick@langchain.dev>
This commit is contained in:
parent
1b7ed6071a
commit
69bf6262aa
@ -7,7 +7,35 @@
|
||||
"source": [
|
||||
"# URL\n",
|
||||
"\n",
|
||||
"This covers how to load HTML documents from a list of URLs into a document format that we can use downstream."
|
||||
"This example covers how to load `HTML` documents from a list of `URLs` into the `Document` format that we can use downstream."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5ccca101-b167-43bc-849e-9d456b16a123",
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2024-04-02T00:13:43.279309Z",
|
||||
"iopub.status.busy": "2024-04-02T00:13:43.278977Z",
|
||||
"iopub.status.idle": "2024-04-02T00:13:43.282230Z",
|
||||
"shell.execute_reply": "2024-04-02T00:13:43.281907Z",
|
||||
"shell.execute_reply.started": "2024-04-02T00:13:43.279282Z"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Unstructured URL Loader\n",
|
||||
"\n",
|
||||
"You have to install the `unstructured` library:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "cb26084d-a2b0-4685-9ec4-346139ffe0fb",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install -U unstructured"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -67,15 +95,24 @@
|
||||
"id": "f3afa135",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Selenium URL Loader\n",
|
||||
"## Selenium URL Loader\n",
|
||||
"\n",
|
||||
"This covers how to load HTML documents from a list of URLs using the `SeleniumURLLoader`.\n",
|
||||
"\n",
|
||||
"Using selenium allows us to load pages that require JavaScript to render.\n",
|
||||
"Using `Selenium` allows us to load pages that require JavaScript to render.\n",
|
||||
"\n",
|
||||
"## Setup\n",
|
||||
"\n",
|
||||
"To use the `SeleniumURLLoader`, you will need to install `selenium` and `unstructured`.\n"
|
||||
"To use the `SeleniumURLLoader`, you have to install `selenium` and `unstructured`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "4d2b86cf-55c6-430d-bf31-45591a1aa25a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install -U selenium unstructured"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -127,15 +164,25 @@
|
||||
"id": "a2c1c79f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Playwright URL Loader\n",
|
||||
"## Playwright URL Loader\n",
|
||||
"\n",
|
||||
"This covers how to load HTML documents from a list of URLs using the `PlaywrightURLLoader`.\n",
|
||||
"\n",
|
||||
"As in the Selenium case, Playwright allows us to load pages that need JavaScript to render.\n",
|
||||
"[Playwright](https://playwright.dev/) enables reliable end-to-end testing for modern web apps.\n",
|
||||
"\n",
|
||||
"## Setup\n",
|
||||
"As in the Selenium case, `Playwright` allows us to load and render the JavaScript pages.\n",
|
||||
"\n",
|
||||
"To use the `PlaywrightURLLoader`, you will need to install `playwright` and `unstructured`. Additionally, you will need to install the Playwright Chromium browser:"
|
||||
"To use the `PlaywrightURLLoader`, you have to install `playwright` and `unstructured`. Additionally, you have to install the `Playwright Chromium` browser:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "017ba3d2-ccb0-4c24-a079-44a8e524b2fa",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install -U playwright unstructured"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -145,9 +192,6 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Install playwright\n",
|
||||
"%pip install --upgrade --quiet \"playwright\"\n",
|
||||
"%pip install --upgrade --quiet \"unstructured\"\n",
|
||||
"!playwright install"
|
||||
]
|
||||
},
|
||||
@ -211,7 +255,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.6"
|
||||
"version": "3.10.12"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
@ -27,7 +27,7 @@ simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or
|
||||
`UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API.
|
||||
|
||||
|
||||
The Unstructured API requires API keys to make requests.
|
||||
The `Unstructured API` requires API keys to make requests.
|
||||
You can request an API key [here](https://unstructured.io/api-key-hosted) and start using it today!
|
||||
Checkout the README [here](https://github.com/Unstructured-IO/unstructured-api) here to get started making API calls.
|
||||
We'd love to hear your feedback, let us know how it goes in our [community slack](https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-1x7cgo0pg-PTptXWylzPQF9xZolzCnwQ).
|
||||
@ -35,21 +35,209 @@ And stay tuned for improvements to both quality and performance!
|
||||
Check out the instructions
|
||||
[here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if you'd like to self-host the Unstructured API or run it locally.
|
||||
|
||||
## Wrappers
|
||||
|
||||
### Data Loaders
|
||||
## Data Loaders
|
||||
|
||||
The primary usage of the `Unstructured` is in data loaders.
|
||||
|
||||
### UnstructuredAPIFileIOLoader
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredAPIFileIOLoader
|
||||
```
|
||||
|
||||
### UnstructuredAPIFileLoader
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredAPIFileLoader
|
||||
```
|
||||
|
||||
### UnstructuredCHMLoader
|
||||
|
||||
`CHM` means `Microsoft Compiled HTML Help`.
|
||||
|
||||
See a usage example in the API documentation.
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredCHMLoader
|
||||
```
|
||||
|
||||
### UnstructuredCSVLoader
|
||||
|
||||
A `comma-separated values` (`CSV`) file is a delimited text file that uses
|
||||
a comma to separate values. Each line of the file is a data record.
|
||||
Each record consists of one or more fields, separated by commas.
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/csv#unstructuredcsvloader).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredCSVLoader
|
||||
```
|
||||
|
||||
### UnstructuredEmailLoader
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/email).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredEmailLoader
|
||||
```
|
||||
|
||||
### UnstructuredEPubLoader
|
||||
|
||||
[EPUB](https://en.wikipedia.org/wiki/EPUB) is an `e-book file format` that uses
|
||||
the “.epub” file extension. The term is short for electronic publication and
|
||||
is sometimes styled `ePub`. `EPUB` is supported by many e-readers, and compatible
|
||||
software is available for most smartphones, tablets, and computers.
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/epub).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredEPubLoader
|
||||
```
|
||||
|
||||
### UnstructuredExcelLoader
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/microsoft_excel).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredExcelLoader
|
||||
```
|
||||
|
||||
### UnstructuredFileIOLoader
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/google_drive#passing-in-optional-file-loaders).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredFileIOLoader
|
||||
```
|
||||
|
||||
### UnstructuredFileLoader
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/unstructured_file).
|
||||
|
||||
The primary `unstructured` wrappers within `langchain` are data loaders. The following
|
||||
shows how to use the most basic unstructured data loader. There are other file-specific
|
||||
data loaders available in the `langchain_community.document_loaders` module.
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredFileLoader
|
||||
|
||||
loader = UnstructuredFileLoader("state_of_the_union.txt")
|
||||
loader.load()
|
||||
```
|
||||
|
||||
If you instantiate the loader with `UnstructuredFileLoader(mode="elements")`, the loader
|
||||
will track additional metadata like the page number and text type (i.e. title, narrative text)
|
||||
when that information is available.
|
||||
### UnstructuredHTMLLoader
|
||||
|
||||
See a [usage example](/docs/modules/data_connection/document_loaders/html).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredHTMLLoader
|
||||
```
|
||||
|
||||
### UnstructuredImageLoader
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/image).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredImageLoader
|
||||
```
|
||||
|
||||
### UnstructuredMarkdownLoader
|
||||
|
||||
See a [usage example](/docs/integrations/vectorstores/starrocks).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredMarkdownLoader
|
||||
```
|
||||
|
||||
### UnstructuredODTLoader
|
||||
|
||||
The `Open Document Format for Office Applications (ODF)`, also known as `OpenDocument`,
|
||||
is an open file format for word processing documents, spreadsheets, presentations
|
||||
and graphics and using ZIP-compressed XML files. It was developed with the aim of
|
||||
providing an open, XML-based file format specification for office applications.
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/odt).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredODTLoader
|
||||
```
|
||||
|
||||
### UnstructuredOrgModeLoader
|
||||
|
||||
An [Org Mode](https://en.wikipedia.org/wiki/Org-mode) document is a document editing, formatting, and organizing mode, designed for notes, planning, and authoring within the free software text editor Emacs.
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/org_mode).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredOrgModeLoader
|
||||
```
|
||||
|
||||
### UnstructuredPDFLoader
|
||||
|
||||
See a [usage example](/docs/modules/data_connection/document_loaders/pdf#using-unstructured).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredPDFLoader
|
||||
```
|
||||
|
||||
### UnstructuredPowerPointLoader
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/microsoft_powerpoint).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredPowerPointLoader
|
||||
```
|
||||
|
||||
### UnstructuredRSTLoader
|
||||
|
||||
A `reStructured Text` (`RST`) file is a file format for textual data
|
||||
used primarily in the Python programming language community for technical documentation.
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/rst).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredRSTLoader
|
||||
```
|
||||
|
||||
### UnstructuredRTFLoader
|
||||
|
||||
See a usage example in the API documentation.
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredRTFLoader
|
||||
```
|
||||
|
||||
### UnstructuredTSVLoader
|
||||
|
||||
A `tab-separated values` (`TSV`) file is a simple, text-based file format for storing tabular data.
|
||||
Records are separated by newlines, and values within a record are separated by tab characters.
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/tsv).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredTSVLoader
|
||||
```
|
||||
|
||||
### UnstructuredURLLoader
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/url).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredURLLoader
|
||||
```
|
||||
|
||||
### UnstructuredWordDocumentLoader
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/microsoft_word#using-unstructured).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredWordDocumentLoader
|
||||
```
|
||||
|
||||
### UnstructuredXMLLoader
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/xml).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredXMLLoader
|
||||
```
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user