[docs]: doc loader changes (#25417)

This commit is contained in:
Isaac Francisco 2024-08-14 19:46:33 -07:00 committed by GitHub
parent bd261456f6
commit 966b408634
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
10 changed files with 1064 additions and 22 deletions

View File

@ -182,7 +182,7 @@ pprint(data)
</CodeOutputBlock>
Another option is set `jq_schema='.'` and provide `content_key`:
Another option is to set `jq_schema='.'` and provide `content_key`:
```python
loader = JSONLoader(

View File

@ -0,0 +1,243 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# BSHTMLLoader\n",
"\n",
"\n",
"This notebook provides a quick overview for getting started with BeautifulSoup4 [document loader](https://python.langchain.com/v0.2/docs/concepts/#document-loaders). For detailed documentation of all __ModuleName__Loader features and configurations head to the [API reference](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.html_bs.BSHTMLLoader.html).\n",
"\n",
"\n",
"## Overview\n",
"### Integration details\n",
"\n",
"\n",
"| Class | Package | Local | Serializable | JS support|\n",
"| :--- | :--- | :---: | :---: | :---: |\n",
"| [BSHTMLLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.html_bs.BSHTMLLoader.html) | [langchain_community](https://api.python.langchain.com/en/latest/community_api_reference.html) | ✅ | ❌ | ❌ | \n",
"### Loader features\n",
"| Source | Document Lazy Loading | Native Async Support\n",
"| :---: | :---: | :---: | \n",
"| BSHTMLLoader | ✅ | ❌ | \n",
"\n",
"## Setup\n",
"\n",
"To access BSHTMLLoader document loader you'll need to install the `langchain-community` integration package and the `bs4` python package.\n",
"\n",
"### Credentials\n",
"\n",
"No credentials are needed to use the `BSHTMLLoader` class."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to get automated best in-class tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installation\n",
"\n",
"Install **langchain_community** and **bs4**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain_community bs4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialization\n",
"\n",
"Now we can instantiate our model object and load documents:\n",
"\n",
"- TODO: Update model instantiation with relevant params."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import BSHTMLLoader\n",
"\n",
"loader = BSHTMLLoader(\n",
" file_path=\"./example_data/fake-content.html\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(metadata={'source': './example_data/fake-content.html', 'title': 'Test Title'}, page_content='\\nTest Title\\n\\n\\nMy First Heading\\nMy first paragraph.\\n\\n\\n')"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = loader.load()\n",
"docs[0]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'source': './example_data/fake-content.html', 'title': 'Test Title'}\n"
]
}
],
"source": [
"print(docs[0].metadata)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lazy Load"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(metadata={'source': './example_data/fake-content.html', 'title': 'Test Title'}, page_content='\\nTest Title\\n\\n\\nMy First Heading\\nMy first paragraph.\\n\\n\\n')"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"page = []\n",
"for doc in loader.lazy_load():\n",
" page.append(doc)\n",
" if len(page) >= 10:\n",
" # do some paged operation, e.g.\n",
" # index.upsert(page)\n",
"\n",
" page = []\n",
"page[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Adding separator to BS4\n",
"\n",
"We can also pass a separator to use when calling get_text on the soup"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='\n",
", Test Title, \n",
", \n",
", \n",
", My First Heading, \n",
", My first paragraph., \n",
", \n",
", \n",
"' metadata={'source': './example_data/fake-content.html', 'title': 'Test Title'}\n"
]
}
],
"source": [
"loader = BSHTMLLoader(\n",
" file_path=\"./example_data/fake-content.html\", get_text_separator=\", \"\n",
")\n",
"\n",
"docs = loader.load()\n",
"print(docs[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all BSHTMLLoader features and configurations head to the API reference: https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.html_bs.BSHTMLLoader.html"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -0,0 +1,55 @@
# Sample Markdown Document
## Introduction
Welcome to this sample Markdown document. Markdown is a lightweight markup language used for formatting text. It's widely used for documentation, readme files, and more.
## Features
### Headers
Markdown supports multiple levels of headers:
- **Header 1**: `# Header 1`
- **Header 2**: `## Header 2`
- **Header 3**: `### Header 3`
### Lists
#### Unordered List
- Item 1
- Item 2
- Subitem 2.1
- Subitem 2.2
#### Ordered List
1. First item
2. Second item
3. Third item
### Links
[OpenAI](https://www.openai.com) is an AI research organization.
### Images
Here's an example image:
![Sample Image](https://via.placeholder.com/150)
### Code
#### Inline Code
Use `code` for inline code snippets.
#### Code Block
```python
def greet(name):
return f"Hello, {name}!"
print(greet("World"))
```

View File

@ -30,6 +30,7 @@
{
"sender_name": "User 2",
"timestamp_ms": 1675595060730,
"content": "",
"photos": [
{"uri": "url_of_some_picture.jpg", "creation_timestamp": 1675595059}
]

View File

@ -21,24 +21,24 @@ loader = CSVLoader(
data = loader.load()
```
## Common File Types
The below document loaders allow you to load data from common data formats.
<CategoryTable category="common_loaders" />
## PDFs
The below document loaders allow you to load documents.
<CategoryTable category="pdf_loaders" />
## Webpages
The below document loaders allow you to load webpages.
<CategoryTable category="webpage_loaders" />
## PDFs
The below document loaders allow you to load PDF documents.
<CategoryTable category="pdf_loaders" />
## Common File Types
The below document loaders allow you to load data from common data formats.
<CategoryTable category="common_loaders" />
## All document loaders

View File

@ -0,0 +1,348 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# JSONLoader\n",
"\n",
"This notebook provides a quick overview for getting started with JSON [document loader](https://python.langchain.com/v0.2/docs/concepts/#document-loaders). For detailed documentation of all JSONLoader features and configurations head to the [API reference](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.json_loader.JSONLoader.html).\n",
"\n",
"- TODO: Add any other relevant links, like information about underlying API, etc.\n",
"\n",
"## Overview\n",
"### Integration details\n",
"\n",
"| Class | Package | Local | Serializable | [JS support](https://js.langchain.com/v0.2/docs/integrations/document_loaders/file_loaders/json/)|\n",
"| :--- | :--- | :---: | :---: | :---: |\n",
"| [JSONLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.json_loader.JSONLoader.html) | [langchain_community](https://api.python.langchain.com/en/latest/community_api_reference.html) | ✅ | ❌ | ✅ | \n",
"### Loader features\n",
"| Source | Document Lazy Loading | Native Async Support\n",
"| :---: | :---: | :---: | \n",
"| JSONLoader | ✅ | ❌ | \n",
"\n",
"## Setup\n",
"\n",
"To access JSON document loader you'll need to install the `langchain-community` integration package as well as the ``jq`` python package.\n",
"\n",
"### Credentials\n",
"\n",
"No credentials are required to use the `JSONLoader` class."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to get automated best in-class tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installation\n",
"\n",
"Install **langchain_community** and **jq**:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain_community jq "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialization\n",
"\n",
"Now we can instantiate our model object and load documents:\n",
"\n",
"- TODO: Update model instantiation with relevant params."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import JSONLoader\n",
"\n",
"loader = JSONLoader(\n",
" file_path=\"./example_data/facebook_chat.json\",\n",
" jq_schema=\".messages[].content\",\n",
" text_content=False,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(metadata={'source': '/Users/isaachershenson/Documents/langchain/docs/docs/integrations/document_loaders/example_data/facebook_chat.json', 'seq_num': 1}, page_content='Bye!')"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = loader.load()\n",
"docs[0]"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'source': '/Users/isaachershenson/Documents/langchain/docs/docs/integrations/document_loaders/example_data/facebook_chat.json', 'seq_num': 1}\n"
]
}
],
"source": [
"print(docs[0].metadata)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lazy Load"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"pages = []\n",
"for doc in loader.lazy_load():\n",
" pages.append(doc)\n",
" if len(pages) >= 10:\n",
" # do some paged operation, e.g.\n",
" # index.upsert(pages)\n",
"\n",
" pages = []"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read from JSON Lines file\n",
"\n",
"If you want to load documents from a JSON Lines file, you pass `json_lines=True`\n",
"and specify `jq_schema` to extract `page_content` from a single JSON object."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='Bye!' metadata={'source': '/Users/isaachershenson/Documents/langchain/docs/docs/integrations/document_loaders/example_data/facebook_chat_messages.jsonl', 'seq_num': 1}\n"
]
}
],
"source": [
"loader = JSONLoader(\n",
" file_path=\"./example_data/facebook_chat_messages.jsonl\",\n",
" jq_schema=\".content\",\n",
" text_content=False,\n",
" json_lines=True,\n",
")\n",
"\n",
"docs = loader.load()\n",
"print(docs[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read specific content keys\n",
"\n",
"Another option is to set `jq_schema='.'` and provide a `content_key` in order to only load specific content:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='User 2' metadata={'source': '/Users/isaachershenson/Documents/langchain/docs/docs/integrations/document_loaders/example_data/facebook_chat_messages.jsonl', 'seq_num': 1}\n"
]
}
],
"source": [
"loader = JSONLoader(\n",
" file_path=\"./example_data/facebook_chat_messages.jsonl\",\n",
" jq_schema=\".\",\n",
" content_key=\"sender_name\",\n",
" json_lines=True,\n",
")\n",
"\n",
"docs = loader.load()\n",
"print(docs[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## JSON file with jq schema `content_key`\n",
"\n",
"To load documents from a JSON file using the `content_key` within the jq schema, set `is_content_key_jq_parsable=True`. Ensure that `content_key` is compatible and can be parsed using the jq schema."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='Bye!' metadata={'source': '/Users/isaachershenson/Documents/langchain/docs/docs/integrations/document_loaders/example_data/facebook_chat.json', 'seq_num': 1}\n"
]
}
],
"source": [
"loader = JSONLoader(\n",
" file_path=\"./example_data/facebook_chat.json\",\n",
" jq_schema=\".messages[]\",\n",
" content_key=\".content\",\n",
" is_content_key_jq_parsable=True,\n",
")\n",
"\n",
"docs = loader.load()\n",
"print(docs[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extracting metadata\n",
"\n",
"Generally, we want to include metadata available in the JSON file into the documents that we create from the content.\n",
"\n",
"The following demonstrates how metadata can be extracted using the `JSONLoader`.\n",
"\n",
"There are some key changes to be noted. In the previous example where we didn't collect the metadata, we managed to directly specify in the schema where the value for the `page_content` can be extracted from.\n",
"\n",
"In this example, we have to tell the loader to iterate over the records in the `messages` field. The jq_schema then has to be `.messages[]`\n",
"\n",
"This allows us to pass the records (dict) into the `metadata_func` that has to be implemented. The `metadata_func` is responsible for identifying which pieces of information in the record should be included in the metadata stored in the final `Document` object.\n",
"\n",
"Additionally, we now have to explicitly specify in the loader, via the `content_key` argument, the key from the record where the value for the `page_content` needs to be extracted from."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'source': '/Users/isaachershenson/Documents/langchain/docs/docs/integrations/document_loaders/example_data/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}\n"
]
}
],
"source": [
"# Define the metadata extraction function.\n",
"def metadata_func(record: dict, metadata: dict) -> dict:\n",
" metadata[\"sender_name\"] = record.get(\"sender_name\")\n",
" metadata[\"timestamp_ms\"] = record.get(\"timestamp_ms\")\n",
"\n",
" return metadata\n",
"\n",
"\n",
"loader = JSONLoader(\n",
" file_path=\"./example_data/facebook_chat.json\",\n",
" jq_schema=\".messages[]\",\n",
" content_key=\"content\",\n",
" metadata_func=metadata_func,\n",
")\n",
"\n",
"docs = loader.load()\n",
"print(docs[0].metadata)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all JSONLoader features and configurations head to the API reference: https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.json_loader.JSONLoader.html"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -0,0 +1,269 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# UnstructuredMarkdownLoader\n",
"\n",
"This notebook provides a quick overview for getting started with UnstructuredMarkdown [document loader](https://python.langchain.com/v0.2/docs/concepts/#document-loaders). For detailed documentation of all __ModuleName__Loader features and configurations head to the [API reference](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.markdown.UnstructuredMarkdownLoader.html).\n",
"\n",
"## Overview\n",
"### Integration details\n",
"\n",
"\n",
"| Class | Package | Local | Serializable | [JS support](https://js.langchain.com/v0.2/docs/integrations/document_loaders/file_loaders/unstructured/)|\n",
"| :--- | :--- | :---: | :---: | :---: |\n",
"| [UnstructuredMarkdownLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.markdown.UnstructuredMarkdownLoader.html) | [langchain_community](https://api.python.langchain.com/en/latest/community_api_reference.html) | ❌ | ❌ | ✅ | \n",
"### Loader features\n",
"| Source | Document Lazy Loading | Native Async Support\n",
"| :---: | :---: | :---: | \n",
"| UnstructuredMarkdownLoader | ✅ | ❌ | \n",
"\n",
"## Setup\n",
"\n",
"To access UnstructuredMarkdownLoader document loader you'll need to install the `langchain-community` integration package and the `unstructured` python package.\n",
"\n",
"### Credentials\n",
"\n",
"No credentials are needed to use this loader."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to get automated best in-class tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installation\n",
"\n",
"Install **langchain_community** and **unstructured**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain_community unstructured"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialization\n",
"\n",
"Now we can instantiate our model object and load documents. \n",
"\n",
"You can run the loader in one of two modes: \"single\" and \"elements\". If you use \"single\" mode, the document will be returned as a single `Document` object. If you use \"elements\" mode, the unstructured library will split the document into elements such as `Title` and `NarrativeText`. You can pass in additional `unstructured` kwargs after mode to apply different `unstructured` settings."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import UnstructuredMarkdownLoader\n",
"\n",
"loader = UnstructuredMarkdownLoader(\n",
" \"./example_data/example.md\",\n",
" mode=\"single\",\n",
" strategy=\"fast\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(metadata={'source': './example_data/example.md'}, page_content='Sample Markdown Document\\n\\nIntroduction\\n\\nWelcome to this sample Markdown document. Markdown is a lightweight markup language used for formatting text. It\\'s widely used for documentation, readme files, and more.\\n\\nFeatures\\n\\nHeaders\\n\\nMarkdown supports multiple levels of headers:\\n\\nHeader 1: # Header 1\\n\\nHeader 2: ## Header 2\\n\\nHeader 3: ### Header 3\\n\\nLists\\n\\nUnordered List\\n\\nItem 1\\n\\nItem 2\\n\\nSubitem 2.1\\n\\nSubitem 2.2\\n\\nOrdered List\\n\\nFirst item\\n\\nSecond item\\n\\nThird item\\n\\nLinks\\n\\nOpenAI is an AI research organization.\\n\\nImages\\n\\nHere\\'s an example image:\\n\\nCode\\n\\nInline Code\\n\\nUse code for inline code snippets.\\n\\nCode Block\\n\\n```python def greet(name): return f\"Hello, {name}!\"\\n\\nprint(greet(\"World\")) ```')"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = loader.load()\n",
"docs[0]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'source': './example_data/example.md'}\n"
]
}
],
"source": [
"print(docs[0].metadata)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lazy Load"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(metadata={'source': './example_data/example.md', 'link_texts': ['OpenAI'], 'link_urls': ['https://www.openai.com'], 'last_modified': '2024-08-14T15:04:18', 'languages': ['eng'], 'parent_id': 'de1f74bf226224377ab4d8b54f215bb9', 'filetype': 'text/markdown', 'file_directory': './example_data', 'filename': 'example.md', 'category': 'NarrativeText', 'element_id': '898a542a261f7dc65e0072d1e847d535'}, page_content='OpenAI is an AI research organization.')"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"page = []\n",
"for doc in loader.lazy_load():\n",
" page.append(doc)\n",
" if len(page) >= 10:\n",
" # do some paged operation, e.g.\n",
" # index.upsert(page)\n",
"\n",
" page = []\n",
"page[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Elements\n",
"\n",
"In this example we will load in the `elements` mode, which will return a list of the different elements in the markdown document:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"29"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_community.document_loaders import UnstructuredMarkdownLoader\n",
"\n",
"loader = UnstructuredMarkdownLoader(\n",
" \"./example_data/example.md\",\n",
" mode=\"elements\",\n",
" strategy=\"fast\",\n",
")\n",
"\n",
"docs = loader.load()\n",
"len(docs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you see there are 29 elements that were pulled from the `example.md` file. The first element is the title of the document as expected:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Sample Markdown Document'"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0].page_content"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all UnstructuredMarkdownLoader features and configurations head to the API reference: https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.markdown.UnstructuredMarkdownLoader.html"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -540,6 +540,24 @@ const FEATURE_TABLES = {
source: "All file types",
apiLink: "https://api.python.langchain.com/en/latest/document_loaders/langchain_unstructured.document_loaders.UnstructuredLoader.html"
},
{
name: "JSONLoader",
link: "json",
source: "JSON files",
apiLink: "https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.json_loader.JSONLoader.html"
},
{
name: "UnstructuredMarkdownLoader",
link: "unstructured_markdown",
source: "Markdown files",
apiLink: "https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.markdown.UnstructuredMarkdownLoader.html"
},
{
name: "BSHTMLLoader",
link: "bshtml",
source: "HTML files",
apiLink: "https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.html_bs.BSHTMLLoader.html"
}
]
},
vectorstores: {

View File

@ -10,7 +10,74 @@ logger = logging.getLogger(__name__)
class BSHTMLLoader(BaseLoader):
"""Load `HTML` files and parse them with `beautiful soup`."""
"""
__ModuleName__ document loader integration
Setup:
Install ``langchain-community`` and ``bs4``.
.. code-block:: bash
pip install -U langchain-community bs4
Instantiate:
.. code-block:: python
from langchain_community.document_loaders import BSHTMLLoader
loader = BSHTMLLoader(
file_path="./example_data/fake-content.html",
)
Lazy load:
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
# async variant:
# docs_lazy = await loader.alazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
.. code-block:: python
Test Title
My First Heading
My first paragraph.
{'source': './example_data/fake-content.html', 'title': 'Test Title'}
Async load:
.. code-block:: python
docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)
.. code-block:: python
Test Title
My First Heading
My first paragraph.
{'source': './example_data/fake-content.html', 'title': 'Test Title'}
""" # noqa: E501
def __init__(
self,

View File

@ -13,19 +13,60 @@ class UnstructuredMarkdownLoader(UnstructuredFileLoader):
You can pass in additional unstructured kwargs after mode to apply
different unstructured settings.
Examples
--------
from langchain_community.document_loaders import UnstructuredMarkdownLoader
Setup:
Install ``langchain-community``.
loader = UnstructuredMarkdownLoader(
"example.md", mode="elements", strategy="fast",
)
docs = loader.load()
.. code-block:: bash
pip install -U langchain-community
Instantiate:
.. code-block:: python
from langchain_community.document_loaders import UnstructuredMarkdownLoader
loader = UnstructuredMarkdownLoader(
"./example_data/example.md",
mode="elements",
strategy="fast",
)
Lazy load:
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
# async variant:
# docs_lazy = await loader.alazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
.. code-block:: python
Sample Markdown Document
{'source': './example_data/example.md', 'category_depth': 0, 'last_modified': '2024-08-14T15:04:18', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': './example_data', 'filename': 'example.md', 'category': 'Title', 'element_id': '3d0b313864598e704aa26c728ecb61e5'}
Async load:
.. code-block:: python
docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)
.. code-block:: python
Sample Markdown Document
{'source': './example_data/example.md', 'category_depth': 0, 'last_modified': '2024-08-14T15:04:18', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': './example_data', 'filename': 'example.md', 'category': 'Title', 'element_id': '3d0b313864598e704aa26c728ecb61e5'}
References
----------
https://unstructured-io.github.io/unstructured/core/partition.html#partition-md
"""
""" # noqa: E501
def _get_elements(self) -> List:
from unstructured.__version__ import __version__ as __unstructured_version__