mirror of
https://github.com/hwchase17/langchain.git
synced 2025-06-22 14:49:29 +00:00
docs: add langchain-pull-md Markdown loader (#29024)
- [x] **PR title**: "docs: add langchain-pull-md Markdown loader" - [x] **PR message**: - **Description:** This PR introduces the `langchain-pull-md` package to the LangChain community. It includes a new document loader that utilizes the pull.md service to convert URLs into Markdown format, particularly useful for handling web pages rendered with JavaScript frameworks like React, Angular, or Vue.js. This loader helps in efficient and reliable Markdown conversion directly from URLs without local rendering, reducing server load. - **Issue:** NA - **Dependencies:** requests >=2.25.1 - **Twitter handle:** https://x.com/eugeneevstafev?s=21 - [x] **Add tests and docs**: 1. Added unit tests to verify URL checking and conversion functionalities. 2. Created a comprehensive example notebook detailing the usage of the new loader. - [x] **Lint and test**: - Completed local testing using `make format`, `make lint`, and `make test` commands as per the LangChain contribution guidelines. **Related Links:** - [Package Repository](https://github.com/chigwell/langchain-pull-md) - [PyPI Package](https://pypi.org/project/langchain-pull-md/) --------- Co-authored-by: Erick Friis <erick@langchain.dev>
This commit is contained in:
parent
20a715a103
commit
6a152ce245
140
docs/docs/integrations/document_loaders/pull_md.ipynb
Normal file
140
docs/docs/integrations/document_loaders/pull_md.ipynb
Normal file
@ -0,0 +1,140 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "raw",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"---\n",
|
||||
"sidebar_label: PullMdLoader\n",
|
||||
"---"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# PullMdLoader\n",
|
||||
"\n",
|
||||
"Loader for converting URLs into Markdown using the pull.md service.\n",
|
||||
"\n",
|
||||
"This package implements a [document loader](/docs/concepts/document_loaders/) for web content. Unlike traditional web scrapers, PullMdLoader can handle web pages built with dynamic JavaScript frameworks like React, Angular, or Vue.js, converting them into Markdown without local rendering.\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"### Integration details\n",
|
||||
"\n",
|
||||
"| Class | Package | Local | Serializable | JS Support |\n",
|
||||
"| :--- | :--- | :---: | :---: | :---: |\n",
|
||||
"| PullMdLoader | langchain-pull-md | ✅ | ✅ | ❌ |\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setup\n",
|
||||
"\n",
|
||||
"### Installation\n",
|
||||
"\n",
|
||||
"```bash\n",
|
||||
"pip install langchain-pull-md\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Initialization"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain_pull_md.markdown_loader import PullMdLoader\n",
|
||||
"\n",
|
||||
"# Instantiate the loader with a URL\n",
|
||||
"loader = PullMdLoader(url=\"https://example.com\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Load"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"documents = loader.load()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'source': 'https://example.com',\n",
|
||||
" 'page_content': '# Example Domain\\nThis domain is used for illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"documents[0].metadata"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Lazy Load\n",
|
||||
"\n",
|
||||
"No lazy loading is implemented. `PullMdLoader` performs a real-time conversion of the provided URL into Markdown format whenever the `load` method is called."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## API reference:\n",
|
||||
"\n",
|
||||
"- [GitHub](https://github.com/chigwell/langchain-pull-md)\n",
|
||||
"- [PyPi](https://pypi.org/project/langchain-pull-md/)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.12"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
42
docs/docs/integrations/providers/pull-md.mdx
Normal file
42
docs/docs/integrations/providers/pull-md.mdx
Normal file
@ -0,0 +1,42 @@
|
||||
# PullMd Loader
|
||||
|
||||
>[PullMd](https://pull.md/) is a service that converts web pages into Markdown format. The `langchain-pull-md` package utilizes this service to convert URLs, especially those rendered with JavaScript frameworks like React, Angular, or Vue.js, into Markdown without the need for local rendering.
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
To get started with `langchain-pull-md`, you need to install the package via pip:
|
||||
|
||||
```bash
|
||||
pip install langchain-pull-md
|
||||
```
|
||||
|
||||
See the [usage example](/docs/integrations/document_loaders/pull_md) for detailed integration and usage instructions.
|
||||
|
||||
## Document Loader
|
||||
|
||||
The `PullMdLoader` class in `langchain-pull-md` provides an easy way to convert URLs to Markdown. It's particularly useful for loading content from modern web applications for use within LangChain's processing capabilities.
|
||||
|
||||
```python
|
||||
from langchain_pull_md import PullMdLoader
|
||||
|
||||
# Initialize the loader with a URL of a JavaScript-rendered webpage
|
||||
loader = PullMdLoader(url='https://example.com')
|
||||
|
||||
# Load the content as a Document
|
||||
documents = loader.load()
|
||||
|
||||
# Access the Markdown content
|
||||
for document in documents:
|
||||
print(document.page_content)
|
||||
```
|
||||
|
||||
This loader supports any URL and is particularly adept at handling sites built with dynamic JavaScript, making it a versatile tool for markdown extraction in data processing workflows.
|
||||
|
||||
## API Reference
|
||||
|
||||
For a comprehensive guide to all available functions and their parameters, visit the [API reference](https://github.com/chigwell/langchain-pull-md).
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- [GitHub Repository](https://github.com/chigwell/langchain-pull-md)
|
||||
- [PyPi Package](https://pypi.org/project/langchain-pull-md/)
|
@ -320,4 +320,8 @@ packages:
|
||||
- name: langchain-dappier
|
||||
path: .
|
||||
repo: DappierAI/langchain-dappier
|
||||
downloads: 0
|
||||
- name: langchain-pull-md
|
||||
path: .
|
||||
repo: chigwell/langchain-pull-md
|
||||
downloads: 0
|
Loading…
Reference in New Issue
Block a user