mirror of
https://github.com/hwchase17/langchain.git
synced 2025-08-09 21:08:59 +00:00
docs, cli[patch]: document loaders doc template (#22862)
From: https://github.com/langchain-ai/langchain/pull/22290 --------- Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
This commit is contained in:
parent
d1cdde267a
commit
75e966a2fa
@ -0,0 +1,210 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"---\n",
|
||||||
|
"sidebar_label: __ModuleName__\n",
|
||||||
|
"---"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# __ModuleName__Loader\n",
|
||||||
|
"\n",
|
||||||
|
"- TODO: Make sure API reference link is correct.\n",
|
||||||
|
"\n",
|
||||||
|
"This notebook provides a quick overview for getting started with __ModuleName__ [document loader](/docs/integrations/document_loaders/). For detailed documentation of all __ModuleName__Loader features and configurations head to the [API reference](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.__module_name___loader.__ModuleName__Loader.html).\n",
|
||||||
|
"\n",
|
||||||
|
"- TODO: Add any other relevant links, like information about underlying API, etc.\n",
|
||||||
|
"\n",
|
||||||
|
"## Overview\n",
|
||||||
|
"### Integration details\n",
|
||||||
|
"\n",
|
||||||
|
"- TODO: Fill in table features.\n",
|
||||||
|
"- TODO: Remove JS support link if not relevant, otherwise ensure link is correct.\n",
|
||||||
|
"- TODO: Make sure API reference links are correct.\n",
|
||||||
|
"\n",
|
||||||
|
"| Class | Package | Local | Serializable | [JS support](https://js.langchain.com/v0.2/docs/integrations/document_loaders/web_loaders/__module_name___loader)|\n",
|
||||||
|
"| :--- | :--- | :---: | :---: | :---: |\n",
|
||||||
|
"| [__ModuleName__Loader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.__module_name__loader.__ModuleName__Loader.html) | [langchain_community](https://api.python.langchain.com/en/latest/community_api_reference.html) | ✅/❌ | beta/❌ | ✅/❌ | \n",
|
||||||
|
"### Loader features\n",
|
||||||
|
"| Source | Document Lazy Loading | Async Support\n",
|
||||||
|
"| :---: | :---: | :---: | \n",
|
||||||
|
"| __ModuleName__Loader | ✅/❌ | ✅/❌ | \n",
|
||||||
|
"\n",
|
||||||
|
"## Setup\n",
|
||||||
|
"\n",
|
||||||
|
"- TODO: Update with relevant info.\n",
|
||||||
|
"\n",
|
||||||
|
"To access __ModuleName__ document loader you'll need to install the `__package_name__` integration package, and create a **ModuleName** account and get an API key.\n",
|
||||||
|
"\n",
|
||||||
|
"### Credentials\n",
|
||||||
|
"\n",
|
||||||
|
"- TODO: Update with relevant info.\n",
|
||||||
|
"\n",
|
||||||
|
"Head to (TODO: link) to sign up to __ModuleName__ and generate an API key. Once you've done this set the __MODULE_NAME___API_KEY environment variable:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import getpass\n",
|
||||||
|
"import os\n",
|
||||||
|
"\n",
|
||||||
|
"os.environ[\"__MODULE_NAME___API_KEY\"] = getpass.getpass(\"Enter your __ModuleName__ API key: \")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"If you want to get automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
|
||||||
|
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Installation\n",
|
||||||
|
"\n",
|
||||||
|
"Install **langchain_community**.\n",
|
||||||
|
"\n",
|
||||||
|
"- TODO: Add any other required packages"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"%pip install -qU langchain_community"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Instantiation\n",
|
||||||
|
"\n",
|
||||||
|
"Now we can instantiate our model object and load documents:\n",
|
||||||
|
"\n",
|
||||||
|
"- TODO: Update model instantiation with relevant params."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from langchain_community.document_loaders import __ModuleName__Loader\n",
|
||||||
|
"\n",
|
||||||
|
"loader = __ModuleName__Loader(\n",
|
||||||
|
" # required params = ...\n",
|
||||||
|
" # optional params = ...\n",
|
||||||
|
")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Load\n",
|
||||||
|
"\n",
|
||||||
|
"- TODO: Run cells to show loading capabilities"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"docs = loader.load()\n",
|
||||||
|
"docs[0]"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"print(docs[0].metadata)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Lazy Load\n",
|
||||||
|
"\n",
|
||||||
|
"- TODO: Run cells to show lazy loading capabilities. Delete if lazy loading is not implemented."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"page = []\n",
|
||||||
|
"for doc in loader.lazy_load():\n",
|
||||||
|
" page.append(doc)\n",
|
||||||
|
" if len(page) >= 10:\n",
|
||||||
|
" # do some paged operation, e.g.\n",
|
||||||
|
" # index.upsert(page)\n",
|
||||||
|
"\n",
|
||||||
|
" page = []"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## TODO: Any functionality specific to this document loader\n",
|
||||||
|
"\n",
|
||||||
|
"E.g. using specific configs for different loading behavior. Delete if not relevant."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## API reference\n",
|
||||||
|
"\n",
|
||||||
|
"For detailed documentation of all __ModuleName__Loader features and configurations head to the API reference: https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.__module_name___loader.__ModuleName__Loader.html"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": []
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"language_info": {
|
||||||
|
"name": "python"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 2
|
||||||
|
}
|
@ -0,0 +1,71 @@
|
|||||||
|
"""__ModuleName__ document loader."""
|
||||||
|
|
||||||
|
from typing import Iterator
|
||||||
|
from langchain_core.document_loaders.base import BaseLoader
|
||||||
|
from langchain_core.documents import Document
|
||||||
|
|
||||||
|
|
||||||
|
class __ModuleName__Loader(BaseLoader):
|
||||||
|
# TODO: Replace all TODOs in docstring. See example docstring:
|
||||||
|
# https://github.com/langchain-ai/langchain/blob/869523ad728e6b76d77f170cce13925b4ebc3c1e/libs/community/langchain_community/document_loaders/recursive_url_loader.py#L54
|
||||||
|
"""
|
||||||
|
__ModuleName__ document loader integration
|
||||||
|
|
||||||
|
# TODO: Replace with relevant packages, env vars.
|
||||||
|
Setup:
|
||||||
|
Install ``__package_name__`` and set environment variable ``__MODULE_NAME___API_KEY``.
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
pip install -U __package_name__
|
||||||
|
export __MODULE_NAME___API_KEY="your-api-key"
|
||||||
|
|
||||||
|
# TODO: Replace with relevant init params.
|
||||||
|
Instantiate:
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from langchain_community.document_loaders import __ModuleName__Loader
|
||||||
|
|
||||||
|
loader = __ModuleName__Loader(
|
||||||
|
# required params = ...
|
||||||
|
# other params = ...
|
||||||
|
)
|
||||||
|
|
||||||
|
Lazy load:
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
docs = []
|
||||||
|
docs_lazy = loader.lazy_load()
|
||||||
|
|
||||||
|
# async variant:
|
||||||
|
# docs_lazy = await loader.alazy_load()
|
||||||
|
|
||||||
|
for doc in docs_lazy:
|
||||||
|
docs.append(doc)
|
||||||
|
print(docs[0].page_content[:100])
|
||||||
|
print(docs[0].metadata)
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
TODO: Example output
|
||||||
|
|
||||||
|
# TODO: Delete if async load is not implemented
|
||||||
|
Async load:
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
docs = await loader.aload()
|
||||||
|
print(docs[0].page_content[:100])
|
||||||
|
print(docs[0].metadata)
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
TODO: Example output
|
||||||
|
"""
|
||||||
|
|
||||||
|
# TODO: This method must be implemented to load documents.
|
||||||
|
# Do not implement load(), a default implementation is already available.
|
||||||
|
def lazy_load(self) -> Iterator[Document]:
|
||||||
|
raise NotImplementedError()
|
||||||
|
|
||||||
|
# TODO: Implement if you would like to change default BaseLoader implementation
|
||||||
|
# async def alazy_load(self) -> AsyncIterator[Document]:
|
@ -153,7 +153,7 @@ def create_doc(
|
|||||||
component_type: Annotated[
|
component_type: Annotated[
|
||||||
str,
|
str,
|
||||||
typer.Option(
|
typer.Option(
|
||||||
help=("The type of component. Currently only 'ChatModel' supported."),
|
help=("The type of component. Currently only 'ChatModel', 'DocumentLoader' supported."),
|
||||||
),
|
),
|
||||||
] = "ChatModel",
|
] = "ChatModel",
|
||||||
destination_dir: Annotated[
|
destination_dir: Annotated[
|
||||||
@ -196,7 +196,10 @@ def create_doc(
|
|||||||
)
|
)
|
||||||
|
|
||||||
# copy over template from ../integration_template
|
# copy over template from ../integration_template
|
||||||
|
if component_type == "ChatModel":
|
||||||
docs_template = Path(__file__).parents[1] / "integration_template/docs/chat.ipynb"
|
docs_template = Path(__file__).parents[1] / "integration_template/docs/chat.ipynb"
|
||||||
|
elif component_type == "DocumentLoader":
|
||||||
|
docs_template = Path(__file__).parents[1] / "integration_template/docs/document_loaders.ipynb"
|
||||||
shutil.copy(docs_template, destination_path)
|
shutil.copy(docs_template, destination_path)
|
||||||
|
|
||||||
# replacements in file
|
# replacements in file
|
||||||
|
@ -110,14 +110,17 @@ class RecursiveUrlLoader(BaseLoader):
|
|||||||
# ...
|
# ...
|
||||||
)
|
)
|
||||||
|
|
||||||
Load:
|
Lazy load:
|
||||||
Use ``.load()`` to synchronously load into memory all Documents, with one
|
|
||||||
Document per visited URL. Starting from the initial URL, we recurse through
|
|
||||||
all linked URLs up to the specified max_depth.
|
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
docs = loader.load()
|
docs = []
|
||||||
|
docs_lazy = loader.lazy_load()
|
||||||
|
|
||||||
|
# async variant:
|
||||||
|
# docs_lazy = await loader.alazy_load()
|
||||||
|
|
||||||
|
for doc in docs_lazy:
|
||||||
|
docs.append(doc)
|
||||||
print(docs[0].page_content[:100])
|
print(docs[0].page_content[:100])
|
||||||
print(docs[0].metadata)
|
print(docs[0].metadata)
|
||||||
|
|
||||||
@ -146,29 +149,6 @@ class RecursiveUrlLoader(BaseLoader):
|
|||||||
<meta charset="utf-8" /><
|
<meta charset="utf-8" /><
|
||||||
{'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.19 Documentation', 'language': None}
|
{'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.19 Documentation', 'language': None}
|
||||||
|
|
||||||
Lazy load:
|
|
||||||
.. code-block:: python
|
|
||||||
|
|
||||||
docs = []
|
|
||||||
docs_lazy = loader.lazy_load()
|
|
||||||
|
|
||||||
# async variant:
|
|
||||||
# docs_lazy = await loader.alazy_load()
|
|
||||||
|
|
||||||
for doc in docs_lazy:
|
|
||||||
docs.append(doc)
|
|
||||||
print(docs[0].page_content[:100])
|
|
||||||
print(docs[0].metadata)
|
|
||||||
|
|
||||||
.. code-block:: python
|
|
||||||
|
|
||||||
<!DOCTYPE html>
|
|
||||||
|
|
||||||
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
||||||
<head>
|
|
||||||
<meta charset="utf-8" /><
|
|
||||||
{'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.19 Documentation', 'language': None}
|
|
||||||
|
|
||||||
Content parsing / extraction:
|
Content parsing / extraction:
|
||||||
By default the loader sets the raw HTML from each link as the Document page
|
By default the loader sets the raw HTML from each link as the Document page
|
||||||
content. To parse this HTML into a more human/LLM-friendly format you can pass
|
content. To parse this HTML into a more human/LLM-friendly format you can pass
|
||||||
|
Loading…
Reference in New Issue
Block a user