docs, cli[patch]: document loaders doc template (#22862)

From: https://github.com/langchain-ai/langchain/pull/22290

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
This commit is contained in:
Isaac Francisco 2024-06-13 19:28:57 -07:00 committed by GitHub
parent d1cdde267a
commit 75e966a2fa
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
4 changed files with 295 additions and 31 deletions

View File

@ -0,0 +1,210 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"sidebar_label: __ModuleName__\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# __ModuleName__Loader\n",
"\n",
"- TODO: Make sure API reference link is correct.\n",
"\n",
"This notebook provides a quick overview for getting started with __ModuleName__ [document loader](/docs/integrations/document_loaders/). For detailed documentation of all __ModuleName__Loader features and configurations head to the [API reference](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.__module_name___loader.__ModuleName__Loader.html).\n",
"\n",
"- TODO: Add any other relevant links, like information about underlying API, etc.\n",
"\n",
"## Overview\n",
"### Integration details\n",
"\n",
"- TODO: Fill in table features.\n",
"- TODO: Remove JS support link if not relevant, otherwise ensure link is correct.\n",
"- TODO: Make sure API reference links are correct.\n",
"\n",
"| Class | Package | Local | Serializable | [JS support](https://js.langchain.com/v0.2/docs/integrations/document_loaders/web_loaders/__module_name___loader)|\n",
"| :--- | :--- | :---: | :---: | :---: |\n",
"| [__ModuleName__Loader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.__module_name__loader.__ModuleName__Loader.html) | [langchain_community](https://api.python.langchain.com/en/latest/community_api_reference.html) | ✅/❌ | beta/❌ | ✅/❌ | \n",
"### Loader features\n",
"| Source | Document Lazy Loading | Async Support\n",
"| :---: | :---: | :---: | \n",
"| __ModuleName__Loader | ✅/❌ | ✅/❌ | \n",
"\n",
"## Setup\n",
"\n",
"- TODO: Update with relevant info.\n",
"\n",
"To access __ModuleName__ document loader you'll need to install the `__package_name__` integration package, and create a **ModuleName** account and get an API key.\n",
"\n",
"### Credentials\n",
"\n",
"- TODO: Update with relevant info.\n",
"\n",
"Head to (TODO: link) to sign up to __ModuleName__ and generate an API key. Once you've done this set the __MODULE_NAME___API_KEY environment variable:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ[\"__MODULE_NAME___API_KEY\"] = getpass.getpass(\"Enter your __ModuleName__ API key: \")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to get automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installation\n",
"\n",
"Install **langchain_community**.\n",
"\n",
"- TODO: Add any other required packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain_community"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Instantiation\n",
"\n",
"Now we can instantiate our model object and load documents:\n",
"\n",
"- TODO: Update model instantiation with relevant params."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import __ModuleName__Loader\n",
"\n",
"loader = __ModuleName__Loader(\n",
" # required params = ...\n",
" # optional params = ...\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load\n",
"\n",
"- TODO: Run cells to show loading capabilities"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"docs = loader.load()\n",
"docs[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(docs[0].metadata)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lazy Load\n",
"\n",
"- TODO: Run cells to show lazy loading capabilities. Delete if lazy loading is not implemented."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"page = []\n",
"for doc in loader.lazy_load():\n",
" page.append(doc)\n",
" if len(page) >= 10:\n",
" # do some paged operation, e.g.\n",
" # index.upsert(page)\n",
"\n",
" page = []"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TODO: Any functionality specific to this document loader\n",
"\n",
"E.g. using specific configs for different loading behavior. Delete if not relevant."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all __ModuleName__Loader features and configurations head to the API reference: https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.__module_name___loader.__ModuleName__Loader.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -0,0 +1,71 @@
"""__ModuleName__ document loader."""
from typing import Iterator
from langchain_core.document_loaders.base import BaseLoader
from langchain_core.documents import Document
class __ModuleName__Loader(BaseLoader):
# TODO: Replace all TODOs in docstring. See example docstring:
# https://github.com/langchain-ai/langchain/blob/869523ad728e6b76d77f170cce13925b4ebc3c1e/libs/community/langchain_community/document_loaders/recursive_url_loader.py#L54
"""
__ModuleName__ document loader integration
# TODO: Replace with relevant packages, env vars.
Setup:
Install ``__package_name__`` and set environment variable ``__MODULE_NAME___API_KEY``.
.. code-block:: bash
pip install -U __package_name__
export __MODULE_NAME___API_KEY="your-api-key"
# TODO: Replace with relevant init params.
Instantiate:
.. code-block:: python
from langchain_community.document_loaders import __ModuleName__Loader
loader = __ModuleName__Loader(
# required params = ...
# other params = ...
)
Lazy load:
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
# async variant:
# docs_lazy = await loader.alazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
.. code-block:: python
TODO: Example output
# TODO: Delete if async load is not implemented
Async load:
.. code-block:: python
docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)
.. code-block:: python
TODO: Example output
"""
# TODO: This method must be implemented to load documents.
# Do not implement load(), a default implementation is already available.
def lazy_load(self) -> Iterator[Document]:
raise NotImplementedError()
# TODO: Implement if you would like to change default BaseLoader implementation
# async def alazy_load(self) -> AsyncIterator[Document]:

View File

@ -153,7 +153,7 @@ def create_doc(
component_type: Annotated[ component_type: Annotated[
str, str,
typer.Option( typer.Option(
help=("The type of component. Currently only 'ChatModel' supported."), help=("The type of component. Currently only 'ChatModel', 'DocumentLoader' supported."),
), ),
] = "ChatModel", ] = "ChatModel",
destination_dir: Annotated[ destination_dir: Annotated[
@ -196,7 +196,10 @@ def create_doc(
) )
# copy over template from ../integration_template # copy over template from ../integration_template
if component_type == "ChatModel":
docs_template = Path(__file__).parents[1] / "integration_template/docs/chat.ipynb" docs_template = Path(__file__).parents[1] / "integration_template/docs/chat.ipynb"
elif component_type == "DocumentLoader":
docs_template = Path(__file__).parents[1] / "integration_template/docs/document_loaders.ipynb"
shutil.copy(docs_template, destination_path) shutil.copy(docs_template, destination_path)
# replacements in file # replacements in file

View File

@ -110,14 +110,17 @@ class RecursiveUrlLoader(BaseLoader):
# ... # ...
) )
Load: Lazy load:
Use ``.load()`` to synchronously load into memory all Documents, with one
Document per visited URL. Starting from the initial URL, we recurse through
all linked URLs up to the specified max_depth.
.. code-block:: python .. code-block:: python
docs = loader.load() docs = []
docs_lazy = loader.lazy_load()
# async variant:
# docs_lazy = await loader.alazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100]) print(docs[0].page_content[:100])
print(docs[0].metadata) print(docs[0].metadata)
@ -146,29 +149,6 @@ class RecursiveUrlLoader(BaseLoader):
<meta charset="utf-8" />< <meta charset="utf-8" /><
{'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.19 Documentation', 'language': None} {'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.19 Documentation', 'language': None}
Lazy load:
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
# async variant:
# docs_lazy = await loader.alazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
.. code-block:: python
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" /><
{'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.19 Documentation', 'language': None}
Content parsing / extraction: Content parsing / extraction:
By default the loader sets the raw HTML from each link as the Document page By default the loader sets the raw HTML from each link as the Document page
content. To parse this HTML into a more human/LLM-friendly format you can pass content. To parse this HTML into a more human/LLM-friendly format you can pass