mirror of
https://github.com/hwchase17/langchain.git
synced 2026-04-03 02:44:10 +00:00
163 lines
5.1 KiB
Plaintext
163 lines
5.1 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d836a98a-ad14-4bed-af76-e1877f7ef8a4",
|
|
"metadata": {},
|
|
"source": [
|
|
"# How to load Markdown\n",
|
|
"\n",
|
|
"[Markdown](https://en.wikipedia.org/wiki/Markdown) is a lightweight markup language for creating formatted text using a plain-text editor.\n",
|
|
"\n",
|
|
"Here we cover how to load `Markdown` documents into LangChain [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) objects that we can use downstream.\n",
|
|
"\n",
|
|
"We will cover:\n",
|
|
"\n",
|
|
"- Basic usage;\n",
|
|
"- Parsing of Markdown into elements such as titles, list items, and text.\n",
|
|
"\n",
|
|
"LangChain implements an [UnstructuredMarkdownLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.markdown.UnstructuredMarkdownLoader.html) object which requires the [Unstructured](https://unstructured-io.github.io/unstructured/) package. First we install it:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 19,
|
|
"id": "c8b147fb-6877-4f7a-b2ee-ee971c7bc662",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# !pip install \"unstructured[md]\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ea8c41f8-a8dc-48cc-b78d-7b3e2427a34c",
|
|
"metadata": {},
|
|
"source": [
|
|
"Basic usage will ingest a Markdown file to a single document. Here we demonstrate on LangChain's readme:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "80c50cc4-7ce9-4418-81b9-29c52c7b3627",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"🦜️🔗 LangChain\n",
|
|
"\n",
|
|
"⚡ Build context-aware reasoning applications ⚡\n",
|
|
"\n",
|
|
"Looking for the JS/TS library? Check out LangChain.js.\n",
|
|
"\n",
|
|
"To help you ship LangChain apps to production faster, check out LangSmith. \n",
|
|
"LangSmith is a unified developer platform for building,\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain_community.document_loaders import UnstructuredMarkdownLoader\n",
|
|
"from langchain_core.documents import Document\n",
|
|
"\n",
|
|
"markdown_path = \"../../../../README.md\"\n",
|
|
"loader = UnstructuredMarkdownLoader(markdown_path)\n",
|
|
"\n",
|
|
"data = loader.load()\n",
|
|
"assert len(data) == 1\n",
|
|
"assert isinstance(data[0], Document)\n",
|
|
"readme_content = data[0].page_content\n",
|
|
"print(readme_content[:250])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b7560a6e-ca5d-47e1-b176-a9c40e763ff3",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Retain Elements\n",
|
|
"\n",
|
|
"Under the hood, Unstructured creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "a986bbce-7fd3-41d1-bc47-49f9f57c7cd1",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Number of documents: 65\n",
|
|
"\n",
|
|
"page_content='🦜️🔗 LangChain' metadata={'source': '../../../../README.md', 'last_modified': '2024-04-29T13:40:19', 'page_number': 1, 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '../../../..', 'filename': 'README.md', 'category': 'Title'}\n",
|
|
"\n",
|
|
"page_content='⚡ Build context-aware reasoning applications ⚡' metadata={'source': '../../../../README.md', 'last_modified': '2024-04-29T13:40:19', 'page_number': 1, 'languages': ['eng'], 'parent_id': 'c3223b6f7100be08a78f1e8c0c28fde1', 'filetype': 'text/markdown', 'file_directory': '../../../..', 'filename': 'README.md', 'category': 'NarrativeText'}\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"loader = UnstructuredMarkdownLoader(markdown_path, mode=\"elements\")\n",
|
|
"\n",
|
|
"data = loader.load()\n",
|
|
"print(f\"Number of documents: {len(data)}\\n\")\n",
|
|
"\n",
|
|
"for document in data[:2]:\n",
|
|
" print(f\"{document}\\n\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "117dc6b0-9baa-44a2-9d1d-fc38ecf7a233",
|
|
"metadata": {},
|
|
"source": [
|
|
"Note that in this case we recover three distinct element types:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "75abc139-3ded-4e8e-9f21-d0c8ec40fdfc",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"{'Title', 'NarrativeText', 'ListItem'}\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(set(document.metadata[\"category\"] for document in data))"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.4"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|