docs, cli[patch]: document loaders doc template (#22862)

From: https://github.com/langchain-ai/langchain/pull/22290

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
This commit is contained in:
Isaac Francisco 2024-06-13 19:28:57 -07:00 committed by GitHub
parent d1cdde267a
commit 75e966a2fa
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
4 changed files with 295 additions and 31 deletions

View File

@ -0,0 +1,210 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"sidebar_label: __ModuleName__\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# __ModuleName__Loader\n",
"\n",
"- TODO: Make sure API reference link is correct.\n",
"\n",
"This notebook provides a quick overview for getting started with __ModuleName__ [document loader](/docs/integrations/document_loaders/). For detailed documentation of all __ModuleName__Loader features and configurations head to the [API reference](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.__module_name___loader.__ModuleName__Loader.html).\n",
"\n",
"- TODO: Add any other relevant links, like information about underlying API, etc.\n",
"\n",
"## Overview\n",
"### Integration details\n",
"\n",
"- TODO: Fill in table features.\n",
"- TODO: Remove JS support link if not relevant, otherwise ensure link is correct.\n",
"- TODO: Make sure API reference links are correct.\n",
"\n",
"| Class | Package | Local | Serializable | [JS support](https://js.langchain.com/v0.2/docs/integrations/document_loaders/web_loaders/__module_name___loader)|\n",
"| :--- | :--- | :---: | :---: | :---: |\n",
"| [__ModuleName__Loader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.__module_name__loader.__ModuleName__Loader.html) | [langchain_community](https://api.python.langchain.com/en/latest/community_api_reference.html) | ✅/❌ | beta/❌ | ✅/❌ | \n",
"### Loader features\n",
"| Source | Document Lazy Loading | Async Support\n",
"| :---: | :---: | :---: | \n",
"| __ModuleName__Loader | ✅/❌ | ✅/❌ | \n",
"\n",
"## Setup\n",
"\n",
"- TODO: Update with relevant info.\n",
"\n",
"To access __ModuleName__ document loader you'll need to install the `__package_name__` integration package, and create a **ModuleName** account and get an API key.\n",
"\n",
"### Credentials\n",
"\n",
"- TODO: Update with relevant info.\n",
"\n",
"Head to (TODO: link) to sign up to __ModuleName__ and generate an API key. Once you've done this set the __MODULE_NAME___API_KEY environment variable:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ[\"__MODULE_NAME___API_KEY\"] = getpass.getpass(\"Enter your __ModuleName__ API key: \")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to get automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installation\n",
"\n",
"Install **langchain_community**.\n",
"\n",
"- TODO: Add any other required packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain_community"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Instantiation\n",
"\n",
"Now we can instantiate our model object and load documents:\n",
"\n",
"- TODO: Update model instantiation with relevant params."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import __ModuleName__Loader\n",
"\n",
"loader = __ModuleName__Loader(\n",
" # required params = ...\n",
" # optional params = ...\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load\n",
"\n",
"- TODO: Run cells to show loading capabilities"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"docs = loader.load()\n",
"docs[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(docs[0].metadata)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lazy Load\n",
"\n",
"- TODO: Run cells to show lazy loading capabilities. Delete if lazy loading is not implemented."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"page = []\n",
"for doc in loader.lazy_load():\n",
" page.append(doc)\n",
" if len(page) >= 10:\n",
" # do some paged operation, e.g.\n",
" # index.upsert(page)\n",
"\n",
" page = []"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TODO: Any functionality specific to this document loader\n",
"\n",
"E.g. using specific configs for different loading behavior. Delete if not relevant."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all __ModuleName__Loader features and configurations head to the API reference: https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.__module_name___loader.__ModuleName__Loader.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -0,0 +1,71 @@
"""__ModuleName__ document loader."""
from typing import Iterator
from langchain_core.document_loaders.base import BaseLoader
from langchain_core.documents import Document
class __ModuleName__Loader(BaseLoader):
# TODO: Replace all TODOs in docstring. See example docstring:
# https://github.com/langchain-ai/langchain/blob/869523ad728e6b76d77f170cce13925b4ebc3c1e/libs/community/langchain_community/document_loaders/recursive_url_loader.py#L54
"""
__ModuleName__ document loader integration
# TODO: Replace with relevant packages, env vars.
Setup:
Install ``__package_name__`` and set environment variable ``__MODULE_NAME___API_KEY``.
.. code-block:: bash
pip install -U __package_name__
export __MODULE_NAME___API_KEY="your-api-key"
# TODO: Replace with relevant init params.
Instantiate:
.. code-block:: python
from langchain_community.document_loaders import __ModuleName__Loader
loader = __ModuleName__Loader(
# required params = ...
# other params = ...
)
Lazy load:
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
# async variant:
# docs_lazy = await loader.alazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
.. code-block:: python
TODO: Example output
# TODO: Delete if async load is not implemented
Async load:
.. code-block:: python
docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)
.. code-block:: python
TODO: Example output
"""
# TODO: This method must be implemented to load documents.
# Do not implement load(), a default implementation is already available.
def lazy_load(self) -> Iterator[Document]:
raise NotImplementedError()
# TODO: Implement if you would like to change default BaseLoader implementation
# async def alazy_load(self) -> AsyncIterator[Document]:

View File

@ -153,7 +153,7 @@ def create_doc(
component_type: Annotated[
str,
typer.Option(
help=("The type of component. Currently only 'ChatModel' supported."),
help=("The type of component. Currently only 'ChatModel', 'DocumentLoader' supported."),
),
] = "ChatModel",
destination_dir: Annotated[
@ -196,7 +196,10 @@ def create_doc(
)
# copy over template from ../integration_template
docs_template = Path(__file__).parents[1] / "integration_template/docs/chat.ipynb"
if component_type == "ChatModel":
docs_template = Path(__file__).parents[1] / "integration_template/docs/chat.ipynb"
elif component_type == "DocumentLoader":
docs_template = Path(__file__).parents[1] / "integration_template/docs/document_loaders.ipynb"
shutil.copy(docs_template, destination_path)
# replacements in file

View File

@ -110,14 +110,17 @@ class RecursiveUrlLoader(BaseLoader):
# ...
)
Load:
Use ``.load()`` to synchronously load into memory all Documents, with one
Document per visited URL. Starting from the initial URL, we recurse through
all linked URLs up to the specified max_depth.
Lazy load:
.. code-block:: python
docs = loader.load()
docs = []
docs_lazy = loader.lazy_load()
# async variant:
# docs_lazy = await loader.alazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
@ -146,29 +149,6 @@ class RecursiveUrlLoader(BaseLoader):
<meta charset="utf-8" /><
{'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.19 Documentation', 'language': None}
Lazy load:
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
# async variant:
# docs_lazy = await loader.alazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
.. code-block:: python
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" /><
{'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.19 Documentation', 'language': None}
Content parsing / extraction:
By default the loader sets the raw HTML from each link as the Document page
content. To parse this HTML into a more human/LLM-friendly format you can pass