{ "cells": [ { "cell_type": "markdown", "id": "d836a98a-ad14-4bed-af76-e1877f7ef8a4", "metadata": {}, "source": [ "# How to load Markdown\n", "\n", "[Markdown](https://en.wikipedia.org/wiki/Markdown) is a lightweight markup language for creating formatted text using a plain-text editor.\n", "\n", "Here we cover how to load `Markdown` documents into LangChain [Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) objects that we can use downstream.\n", "\n", "We will cover:\n", "\n", "- Basic usage;\n", "- Parsing of Markdown into elements such as titles, list items, and text.\n", "\n", "LangChain implements an [UnstructuredMarkdownLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.markdown.UnstructuredMarkdownLoader.html) object which requires the [Unstructured](https://unstructured-io.github.io/unstructured/) package. First we install it:" ] }, { "cell_type": "code", "execution_count": null, "id": "c8b147fb-6877-4f7a-b2ee-ee971c7bc662", "metadata": {}, "outputs": [], "source": [ "%pip install \"unstructured[md]\" nltk" ] }, { "cell_type": "markdown", "id": "ea8c41f8-a8dc-48cc-b78d-7b3e2427a34c", "metadata": {}, "source": [ "Basic usage will ingest a Markdown file to a single document. Here we demonstrate on LangChain's readme:" ] }, { "cell_type": "code", "execution_count": 4, "id": "80c50cc4-7ce9-4418-81b9-29c52c7b3627", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "🦜️🔗 LangChain\n", "\n", "⚡ Build context-aware reasoning applications ⚡\n", "\n", "Looking for the JS/TS library? Check out LangChain.js.\n", "\n", "To help you ship LangChain apps to production faster, check out LangSmith. \n", "LangSmith is a unified developer platform for building,\n" ] } ], "source": [ "from langchain_community.document_loaders import UnstructuredMarkdownLoader\n", "from langchain_core.documents import Document\n", "\n", "markdown_path = \"../../../README.md\"\n", "loader = UnstructuredMarkdownLoader(markdown_path)\n", "\n", "data = loader.load()\n", "assert len(data) == 1\n", "assert isinstance(data[0], Document)\n", "readme_content = data[0].page_content\n", "print(readme_content[:250])" ] }, { "cell_type": "markdown", "id": "b7560a6e-ca5d-47e1-b176-a9c40e763ff3", "metadata": {}, "source": [ "## Retain Elements\n", "\n", "Under the hood, Unstructured creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`." ] }, { "cell_type": "code", "execution_count": 5, "id": "a986bbce-7fd3-41d1-bc47-49f9f57c7cd1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of documents: 66\n", "\n", "page_content='🦜️🔗 LangChain' metadata={'source': '../../../README.md', 'category_depth': 0, 'last_modified': '2024-06-28T15:20:01', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '../../..', 'filename': 'README.md', 'category': 'Title'}\n", "\n", "page_content='⚡ Build context-aware reasoning applications ⚡' metadata={'source': '../../../README.md', 'last_modified': '2024-06-28T15:20:01', 'languages': ['eng'], 'parent_id': '200b8a7d0dd03f66e4f13456566d2b3a', 'filetype': 'text/markdown', 'file_directory': '../../..', 'filename': 'README.md', 'category': 'NarrativeText'}\n", "\n" ] } ], "source": [ "loader = UnstructuredMarkdownLoader(markdown_path, mode=\"elements\")\n", "\n", "data = loader.load()\n", "print(f\"Number of documents: {len(data)}\\n\")\n", "\n", "for document in data[:2]:\n", " print(f\"{document}\\n\")" ] }, { "cell_type": "markdown", "id": "117dc6b0-9baa-44a2-9d1d-fc38ecf7a233", "metadata": {}, "source": [ "Note that in this case we recover three distinct element types:" ] }, { "cell_type": "code", "execution_count": 6, "id": "75abc139-3ded-4e8e-9f21-d0c8ec40fdfc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'ListItem', 'NarrativeText', 'Title'}\n" ] } ], "source": [ "print(set(document.metadata[\"category\"] for document in data))" ] }, { "cell_type": "code", "execution_count": null, "id": "223b4c11", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 5 }