diff --git a/docs/docs/integrations/document_loaders/pull_md.ipynb b/docs/docs/integrations/document_loaders/pull_md.ipynb new file mode 100644 index 00000000000..af81b738c4a --- /dev/null +++ b/docs/docs/integrations/document_loaders/pull_md.ipynb @@ -0,0 +1,140 @@ +{ + "cells": [ + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "---\n", + "sidebar_label: PullMdLoader\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# PullMdLoader\n", + "\n", + "Loader for converting URLs into Markdown using the pull.md service.\n", + "\n", + "This package implements a [document loader](/docs/concepts/document_loaders/) for web content. Unlike traditional web scrapers, PullMdLoader can handle web pages built with dynamic JavaScript frameworks like React, Angular, or Vue.js, converting them into Markdown without local rendering.\n", + "\n", + "## Overview\n", + "### Integration details\n", + "\n", + "| Class | Package | Local | Serializable | JS Support |\n", + "| :--- | :--- | :---: | :---: | :---: |\n", + "| PullMdLoader | langchain-pull-md | ✅ | ✅ | ❌ |\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "### Installation\n", + "\n", + "```bash\n", + "pip install langchain-pull-md\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Initialization" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_pull_md.markdown_loader import PullMdLoader\n", + "\n", + "# Instantiate the loader with a URL\n", + "loader = PullMdLoader(url=\"https://example.com\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "documents = loader.load()" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'source': 'https://example.com',\n", + " 'page_content': '# Example Domain\\nThis domain is used for illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.'}" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "documents[0].metadata" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Lazy Load\n", + "\n", + "No lazy loading is implemented. `PullMdLoader` performs a real-time conversion of the provided URL into Markdown format whenever the `load` method is called." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## API reference:\n", + "\n", + "- [GitHub](https://github.com/chigwell/langchain-pull-md)\n", + "- [PyPi](https://pypi.org/project/langchain-pull-md/)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/docs/integrations/providers/pull-md.mdx b/docs/docs/integrations/providers/pull-md.mdx new file mode 100644 index 00000000000..b7384a3eda4 --- /dev/null +++ b/docs/docs/integrations/providers/pull-md.mdx @@ -0,0 +1,42 @@ +# PullMd Loader + +>[PullMd](https://pull.md/) is a service that converts web pages into Markdown format. The `langchain-pull-md` package utilizes this service to convert URLs, especially those rendered with JavaScript frameworks like React, Angular, or Vue.js, into Markdown without the need for local rendering. + +## Installation and Setup + +To get started with `langchain-pull-md`, you need to install the package via pip: + +```bash +pip install langchain-pull-md +``` + +See the [usage example](/docs/integrations/document_loaders/pull_md) for detailed integration and usage instructions. + +## Document Loader + +The `PullMdLoader` class in `langchain-pull-md` provides an easy way to convert URLs to Markdown. It's particularly useful for loading content from modern web applications for use within LangChain's processing capabilities. + +```python +from langchain_pull_md import PullMdLoader + +# Initialize the loader with a URL of a JavaScript-rendered webpage +loader = PullMdLoader(url='https://example.com') + +# Load the content as a Document +documents = loader.load() + +# Access the Markdown content +for document in documents: + print(document.page_content) +``` + +This loader supports any URL and is particularly adept at handling sites built with dynamic JavaScript, making it a versatile tool for markdown extraction in data processing workflows. + +## API Reference + +For a comprehensive guide to all available functions and their parameters, visit the [API reference](https://github.com/chigwell/langchain-pull-md). + +## Additional Resources + +- [GitHub Repository](https://github.com/chigwell/langchain-pull-md) +- [PyPi Package](https://pypi.org/project/langchain-pull-md/) diff --git a/libs/packages.yml b/libs/packages.yml index 34d8184fad1..5fd27064a36 100644 --- a/libs/packages.yml +++ b/libs/packages.yml @@ -320,4 +320,8 @@ packages: - name: langchain-dappier path: . repo: DappierAI/langchain-dappier + downloads: 0 +- name: langchain-pull-md + path: . + repo: chigwell/langchain-pull-md downloads: 0 \ No newline at end of file