docs, cli[patch]: document loaders doc template (#22862)

From: https://github.com/langchain-ai/langchain/pull/22290 --------- Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2025-10-09 07:05:38 +00:00 · 2024-06-13 19:28:57 -07:00
parent d1cdde267a
commit 75e966a2fa
4 changed files with 295 additions and 31 deletions
--- a/libs/cli/langchain_cli/integration_template/docs/document_loaders.ipynb
+++ b/libs/cli/langchain_cli/integration_template/docs/document_loaders.ipynb
@@ -0,0 +1,210 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "sidebar_label: __ModuleName__\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# __ModuleName__Loader\n",
+    "\n",
+    "- TODO: Make sure API reference link is correct.\n",
+    "\n",
+    "This notebook provides a quick overview for getting started with __ModuleName__ [document loader](/docs/integrations/document_loaders/). For detailed documentation of all __ModuleName__Loader features and configurations head to the [API reference](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.__module_name___loader.__ModuleName__Loader.html).\n",
+    "\n",
+    "- TODO: Add any other relevant links, like information about underlying API, etc.\n",
+    "\n",
+    "## Overview\n",
+    "### Integration details\n",
+    "\n",
+    "- TODO: Fill in table features.\n",
+    "- TODO: Remove JS support link if not relevant, otherwise ensure link is correct.\n",
+    "- TODO: Make sure API reference links are correct.\n",
+    "\n",
+    "| Class | Package | Local | Serializable | [JS support](https://js.langchain.com/v0.2/docs/integrations/document_loaders/web_loaders/__module_name___loader)|\n",
+    "| :--- | :--- | :---: | :---: |  :---: |\n",
+    "| [__ModuleName__Loader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.__module_name__loader.__ModuleName__Loader.html) | [langchain_community](https://api.python.langchain.com/en/latest/community_api_reference.html) | ✅/❌ | beta/❌ | ✅/❌ | \n",
+    "### Loader features\n",
+    "| Source | Document Lazy Loading | Async Support\n",
+    "| :---: | :---: | :---: | \n",
+    "| __ModuleName__Loader | ✅/❌ | ✅/❌ | \n",
+    "\n",
+    "## Setup\n",
+    "\n",
+    "- TODO: Update with relevant info.\n",
+    "\n",
+    "To access __ModuleName__ document loader you'll need to install the `__package_name__` integration package, and create a **ModuleName** account and get an API key.\n",
+    "\n",
+    "### Credentials\n",
+    "\n",
+    "- TODO: Update with relevant info.\n",
+    "\n",
+    "Head to (TODO: link) to sign up to __ModuleName__ and generate an API key. Once you've done this set the __MODULE_NAME___API_KEY environment variable:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import getpass\n",
+    "import os\n",
+    "\n",
+    "os.environ[\"__MODULE_NAME___API_KEY\"] = getpass.getpass(\"Enter your __ModuleName__ API key: \")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you want to get automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
+    "# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Installation\n",
+    "\n",
+    "Install **langchain_community**.\n",
+    "\n",
+    "- TODO: Add any other required packages"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install -qU langchain_community"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Instantiation\n",
+    "\n",
+    "Now we can instantiate our model object and load documents:\n",
+    "\n",
+    "- TODO: Update model instantiation with relevant params."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders import __ModuleName__Loader\n",
+    "\n",
+    "loader = __ModuleName__Loader(\n",
+    "    # required params = ...\n",
+    "    # optional params = ...\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load\n",
+    "\n",
+    "- TODO: Run cells to show loading capabilities"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "docs = loader.load()\n",
+    "docs[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(docs[0].metadata)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Lazy Load\n",
+    "\n",
+    "- TODO: Run cells to show lazy loading capabilities. Delete if lazy loading is not implemented."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "page = []\n",
+    "for doc in loader.lazy_load():\n",
+    "    page.append(doc)\n",
+    "    if len(page) >= 10:\n",
+    "        # do some paged operation, e.g.\n",
+    "        # index.upsert(page)\n",
+    "\n",
+    "        page = []"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## TODO: Any functionality specific to this document loader\n",
+    "\n",
+    "E.g. using specific configs for different loading behavior. Delete if not relevant."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## API reference\n",
+    "\n",
+    "For detailed documentation of all __ModuleName__Loader features and configurations head to the API reference: https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.__module_name___loader.__ModuleName__Loader.html"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/libs/cli/langchain_cli/integration_template/integration_template/document_loaders.py
+++ b/libs/cli/langchain_cli/integration_template/integration_template/document_loaders.py
@@ -0,0 +1,71 @@
+"""__ModuleName__ document loader."""
+
+from typing import Iterator
+from langchain_core.document_loaders.base import BaseLoader
+from langchain_core.documents import Document
+
+
+class __ModuleName__Loader(BaseLoader):
+    # TODO: Replace all TODOs in docstring. See example docstring:
+    # https://github.com/langchain-ai/langchain/blob/869523ad728e6b76d77f170cce13925b4ebc3c1e/libs/community/langchain_community/document_loaders/recursive_url_loader.py#L54
+    """
+    __ModuleName__ document loader integration
+    
+    # TODO: Replace with relevant packages, env vars.
+    Setup:
+        Install ``__package_name__`` and set environment variable ``__MODULE_NAME___API_KEY``.
+
+        .. code-block:: bash
+
+            pip install -U __package_name__
+            export __MODULE_NAME___API_KEY="your-api-key"
+
+    # TODO: Replace with relevant init params.
+    Instantiate:
+        .. code-block:: python
+
+            from langchain_community.document_loaders import __ModuleName__Loader
+
+            loader = __ModuleName__Loader(
+                # required params = ...
+                # other params = ...
+            )
+
+    Lazy load:
+        .. code-block:: python
+
+            docs = []
+            docs_lazy = loader.lazy_load()
+
+            # async variant:
+            # docs_lazy = await loader.alazy_load()
+
+            for doc in docs_lazy:
+                docs.append(doc)
+            print(docs[0].page_content[:100])
+            print(docs[0].metadata)
+
+        .. code-block:: python
+
+            TODO: Example output  
+
+    # TODO: Delete if async load is not implemented
+    Async load:
+        .. code-block:: python
+
+            docs = await loader.aload()
+            print(docs[0].page_content[:100])
+            print(docs[0].metadata)
+
+        .. code-block:: python
+
+            TODO: Example output
+    """
+
+    # TODO: This method must be implemented to load documents.
+    # Do not implement load(), a default implementation is already available.
+    def lazy_load(self) -> Iterator[Document]:
+        raise NotImplementedError()
+    
+    # TODO: Implement if you would like to change default BaseLoader implementation
+    # async def alazy_load(self) -> AsyncIterator[Document]:
--- a/libs/cli/langchain_cli/namespaces/integration.py
+++ b/libs/cli/langchain_cli/namespaces/integration.py
@@ -153,7 +153,7 @@ def create_doc(
    component_type: Annotated[
        str,
        typer.Option(
-            help=("The type of component. Currently only 'ChatModel' supported."),
+            help=("The type of component. Currently only 'ChatModel', 'DocumentLoader' supported."),
        ),
    ] = "ChatModel",
    destination_dir: Annotated[
@@ -196,7 +196,10 @@ def create_doc(
    )

    # copy over template from ../integration_template
-    docs_template = Path(__file__).parents[1] / "integration_template/docs/chat.ipynb"
+    if component_type == "ChatModel":
+        docs_template = Path(__file__).parents[1] / "integration_template/docs/chat.ipynb"
+    elif component_type == "DocumentLoader":
+        docs_template = Path(__file__).parents[1] / "integration_template/docs/document_loaders.ipynb"
    shutil.copy(docs_template, destination_path)

    # replacements in file
--- a/libs/community/langchain_community/document_loaders/recursive_url_loader.py
+++ b/libs/community/langchain_community/document_loaders/recursive_url_loader.py
@@ -110,14 +110,17 @@ class RecursiveUrlLoader(BaseLoader):
                # ...
            )

-    Load:
-        Use ``.load()`` to synchronously load into memory all Documents, with one
-        Document per visited URL. Starting from the initial URL, we recurse through
-        all linked URLs up to the specified max_depth.
-
+    Lazy load:
        .. code-block:: python

-            docs = loader.load()
+            docs = []
+            docs_lazy = loader.lazy_load()
+
+            # async variant:
+            # docs_lazy = await loader.alazy_load()
+
+            for doc in docs_lazy:
+                docs.append(doc)
            print(docs[0].page_content[:100])
            print(docs[0].metadata)

@@ -146,29 +149,6 @@ class RecursiveUrlLoader(BaseLoader):
                <meta charset="utf-8" /><
            {'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.19 Documentation', 'language': None}

-    Lazy load:
-        .. code-block:: python
-
-            docs = []
-            docs_lazy = loader.lazy_load()
-
-            # async variant:
-            # docs_lazy = await loader.alazy_load()
-
-            for doc in docs_lazy:
-                docs.append(doc)
-            print(docs[0].page_content[:100])
-            print(docs[0].metadata)
-
-        .. code-block:: python
-
-            <!DOCTYPE html>
-
-            <html xmlns="http://www.w3.org/1999/xhtml">
-            <head>
-                <meta charset="utf-8" /><
-            {'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.19 Documentation', 'language': None}
-
    Content parsing / extraction:
        By default the loader sets the raw HTML from each link as the Document page
        content. To parse this HTML into a more human/LLM-friendly format you can pass