docs: add langchain-pull-md Markdown loader (#29024)

- [x] **PR title**: "docs: add langchain-pull-md Markdown loader"

- [x] **PR message**: 
- **Description:** This PR introduces the `langchain-pull-md` package to
the LangChain community. It includes a new document loader that utilizes
the pull.md service to convert URLs into Markdown format, particularly
useful for handling web pages rendered with JavaScript frameworks like
React, Angular, or Vue.js. This loader helps in efficient and reliable
Markdown conversion directly from URLs without local rendering, reducing
server load.
    - **Issue:** NA
    - **Dependencies:** requests >=2.25.1
    - **Twitter handle:** https://x.com/eugeneevstafev?s=21

- [x] **Add tests and docs**: 
1. Added unit tests to verify URL checking and conversion
functionalities.
2. Created a comprehensive example notebook detailing the usage of the
new loader.

- [x] **Lint and test**: 
- Completed local testing using `make format`, `make lint`, and `make
test` commands as per the LangChain contribution guidelines.


**Related Links:**
- [Package Repository](https://github.com/chigwell/langchain-pull-md)
- [PyPI Package](https://pypi.org/project/langchain-pull-md/)

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
This commit is contained in:
Eugene Evstafiev 2025-01-06 19:32:43 +00:00 committed by GitHub
parent 20a715a103
commit 6a152ce245
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 186 additions and 0 deletions

View File

@ -0,0 +1,140 @@
{
"cells": [
{
"cell_type": "raw",
"metadata": {},
"source": [
"---\n",
"sidebar_label: PullMdLoader\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# PullMdLoader\n",
"\n",
"Loader for converting URLs into Markdown using the pull.md service.\n",
"\n",
"This package implements a [document loader](/docs/concepts/document_loaders/) for web content. Unlike traditional web scrapers, PullMdLoader can handle web pages built with dynamic JavaScript frameworks like React, Angular, or Vue.js, converting them into Markdown without local rendering.\n",
"\n",
"## Overview\n",
"### Integration details\n",
"\n",
"| Class | Package | Local | Serializable | JS Support |\n",
"| :--- | :--- | :---: | :---: | :---: |\n",
"| PullMdLoader | langchain-pull-md | ✅ | ✅ | ❌ |\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"### Installation\n",
"\n",
"```bash\n",
"pip install langchain-pull-md\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Initialization"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"from langchain_pull_md.markdown_loader import PullMdLoader\n",
"\n",
"# Instantiate the loader with a URL\n",
"loader = PullMdLoader(url=\"https://example.com\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"documents = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'source': 'https://example.com',\n",
" 'page_content': '# Example Domain\\nThis domain is used for illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.'}"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"documents[0].metadata"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lazy Load\n",
"\n",
"No lazy loading is implemented. `PullMdLoader` performs a real-time conversion of the provided URL into Markdown format whenever the `load` method is called."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API reference:\n",
"\n",
"- [GitHub](https://github.com/chigwell/langchain-pull-md)\n",
"- [PyPi](https://pypi.org/project/langchain-pull-md/)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -0,0 +1,42 @@
# PullMd Loader
>[PullMd](https://pull.md/) is a service that converts web pages into Markdown format. The `langchain-pull-md` package utilizes this service to convert URLs, especially those rendered with JavaScript frameworks like React, Angular, or Vue.js, into Markdown without the need for local rendering.
## Installation and Setup
To get started with `langchain-pull-md`, you need to install the package via pip:
```bash
pip install langchain-pull-md
```
See the [usage example](/docs/integrations/document_loaders/pull_md) for detailed integration and usage instructions.
## Document Loader
The `PullMdLoader` class in `langchain-pull-md` provides an easy way to convert URLs to Markdown. It's particularly useful for loading content from modern web applications for use within LangChain's processing capabilities.
```python
from langchain_pull_md import PullMdLoader
# Initialize the loader with a URL of a JavaScript-rendered webpage
loader = PullMdLoader(url='https://example.com')
# Load the content as a Document
documents = loader.load()
# Access the Markdown content
for document in documents:
print(document.page_content)
```
This loader supports any URL and is particularly adept at handling sites built with dynamic JavaScript, making it a versatile tool for markdown extraction in data processing workflows.
## API Reference
For a comprehensive guide to all available functions and their parameters, visit the [API reference](https://github.com/chigwell/langchain-pull-md).
## Additional Resources
- [GitHub Repository](https://github.com/chigwell/langchain-pull-md)
- [PyPi Package](https://pypi.org/project/langchain-pull-md/)

View File

@ -320,4 +320,8 @@ packages:
- name: langchain-dappier
path: .
repo: DappierAI/langchain-dappier
downloads: 0
- name: langchain-pull-md
path: .
repo: chigwell/langchain-pull-md
downloads: 0