Files
langchain/docs/docs/integrations/document_loaders/box.ipynb
Oskar Stark 0d2cea747c docs: streamline LangSmith teasing (#30302)
This can only be reviewed by [hiding
whitespaces](https://github.com/langchain-ai/langchain/pull/30302/files?diff=unified&w=1).

The motivation behind this PR is to get my hands on the docs and make
the LangSmith teasing short and clear.

Right now I don't know how to do it, but this could be an include in the
future.

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2025-03-28 15:13:22 -04:00

465 lines
17 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"sidebar_label: Box\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# BoxLoader and BoxBlobLoader\n",
"\n",
"\n",
"The `langchain-box` package provides two methods to index your files from Box: `BoxLoader` and `BoxBlobLoader`. `BoxLoader` allows you to ingest text representations of files that have a text representation in Box. The `BoxBlobLoader` allows you download the blob for any document or image file for processing with the blob parser of your choice.\n",
"\n",
"This notebook details getting started with both of these. For detailed documentation of all BoxLoader features and configurations head to the API Reference pages for [BoxLoader](https://python.langchain.com/api_reference/box/document_loaders/langchain_box.document_loaders.box.BoxLoader.html) and [BoxBlobLoader](https://python.langchain.com/api_reference/box/document_loaders/langchain_box.blob_loaders.box.BoxBlobLoader.html).\n",
"\n",
"## Overview\n",
"\n",
"The `BoxLoader` class helps you get your unstructured content from Box in Langchain's `Document` format. You can do this with either a `List[str]` containing Box file IDs, or with a `str` containing a Box folder ID.\n",
"\n",
"The `BoxBlobLoader` class helps you get your unstructured content from Box in Langchain's `Blob` format. You can do this with a `List[str]` containing Box file IDs, a `str` containing a Box folder ID, a search query, or a `BoxMetadataQuery`.\n",
"\n",
"If getting files from a folder with folder ID, you can also set a `Bool` to tell the loader to get all sub-folders in that folder, as well.\n",
"\n",
":::info\n",
"A Box instance can contain Petabytes of files, and folders can contain millions of files. Be intentional when choosing what folders you choose to index. And we recommend never getting all files from folder 0 recursively. Folder ID 0 is your root folder.\n",
":::\n",
"\n",
"The `BoxLoader` will skip files without a text representation, while the `BoxBlobLoader` will return blobs for all document and image files.\n",
"\n",
"### Integration details\n",
"\n",
"| Class | Package | Local | Serializable | JS support|\n",
"| :--- | :--- | :---: | :---: | :---: |\n",
"| [BoxLoader](https://python.langchain.com/api_reference/box/document_loaders/langchain_box.document_loaders.box.BoxLoader.html) | [langchain_box](https://python.langchain.com/api_reference/box/index.html) | ✅ | ❌ | ❌ |\n",
"| [BoxBlobLoader](https://python.langchain.com/api_reference/box/document_loaders/langchain_box.blob_loaders.box.BoxBlobLoader.html) | [langchain_box](https://python.langchain.com/api_reference/box/index.html) | ✅ | ❌ | ❌ |\n",
"### Loader features\n",
"| Source | Document Lazy Loading | Async Support\n",
"| :---: | :---: | :---: |\n",
"| BoxLoader | ✅ | ❌ |\n",
"| BoxBlobLoader | ✅ | ❌ |\n",
"\n",
"## Setup\n",
"\n",
"In order to use the Box package, you will need a few things:\n",
"\n",
"* A Box account — If you are not a current Box customer or want to test outside of your production Box instance, you can use a [free developer account](https://account.box.com/signup/n/developer#ty9l3).\n",
"* [A Box app](https://developer.box.com/guides/getting-started/first-application/) — This is configured in the [developer console](https://account.box.com/developers/console), and for Box AI, must have the `Manage AI` scope enabled. Here you will also select your authentication method\n",
"* The app must be [enabled by the administrator](https://developer.box.com/guides/authorization/custom-app-approval/#manual-approval). For free developer accounts, this is whomever signed up for the account.\n",
"\n",
"### Credentials\n",
"\n",
"For these examples, we will use [token authentication](https://developer.box.com/guides/authentication/tokens/developer-tokens). This can be used with any [authentication method](https://developer.box.com/guides/authentication/). Just get the token with whatever methodology. If you want to learn more about how to use other authentication types with `langchain-box`, visit the [Box provider](/docs/integrations/providers/box) document.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdin",
"output_type": "stream",
"text": [
"Enter your Box Developer Token: ········\n"
]
}
],
"source": [
"import getpass\n",
"import os\n",
"\n",
"box_developer_token = getpass.getpass(\"Enter your Box Developer Token: \")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": "To enable automated tracing of your model calls, set your [LangSmith](https://docs.smith.langchain.com/) API key:"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installation\n",
"\n",
"Install **langchain_box**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain_box"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialization\n",
"\n",
"### Load files\n",
"\n",
"If you wish to load files, you must provide the `List` of file ids at instantiation time.\n",
"\n",
"This requires 1 piece of information:\n",
"\n",
"* **box_file_ids** (`List[str]`)- A list of Box file IDs.\n",
"\n",
"#### BoxLoader"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from langchain_box.document_loaders import BoxLoader\n",
"\n",
"box_file_ids = [\"1514555423624\", \"1514553902288\"]\n",
"\n",
"loader = BoxLoader(\n",
" box_developer_token=box_developer_token,\n",
" box_file_ids=box_file_ids,\n",
" character_limit=10000, # Optional. Defaults to no limit\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### BoxBlobLoader"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from langchain_box.blob_loaders import BoxBlobLoader\n",
"\n",
"box_file_ids = [\"1514555423624\", \"1514553902288\"]\n",
"\n",
"loader = BoxBlobLoader(\n",
" box_developer_token=box_developer_token, box_file_ids=box_file_ids\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load from folder\n",
"\n",
"If you wish to load files from a folder, you must provide a `str` with the Box folder ID at instantiation time.\n",
"\n",
"This requires 1 piece of information:\n",
"\n",
"* **box_folder_id** (`str`)- A string containing a Box folder ID.\n",
"\n",
"#### BoxLoader"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_box.document_loaders import BoxLoader\n",
"\n",
"box_folder_id = \"260932470532\"\n",
"\n",
"loader = BoxLoader(\n",
" box_folder_id=box_folder_id,\n",
" recursive=False, # Optional. return entire tree, defaults to False\n",
" character_limit=10000, # Optional. Defaults to no limit\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### BoxBlobLoader"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_box.blob_loaders import BoxBlobLoader\n",
"\n",
"box_folder_id = \"260932470532\"\n",
"\n",
"loader = BoxBlobLoader(\n",
" box_folder_id=box_folder_id,\n",
" recursive=False, # Optional. return entire tree, defaults to False\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Search for files with BoxBlobLoader\n",
"\n",
"If you need to search for files, the `BoxBlobLoader` offers two methods. First you can perform a full text search with optional search options to narrow down that search.\n",
"\n",
"This requires 1 piece of information:\n",
"\n",
"* **query** (`str`)- A string containing the search query to perform.\n",
"\n",
"You can also provide a `BoxSearchOptions` object to narrow down that search\n",
"* **box_search_options** (`BoxSearchOptions`)\n",
"\n",
"#### BoxBlobLoader search"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from langchain_box.blob_loaders import BoxBlobLoader\n",
"from langchain_box.utilities import BoxSearchOptions, DocumentFiles, SearchTypeFilter\n",
"\n",
"box_folder_id = \"260932470532\"\n",
"\n",
"box_search_options = BoxSearchOptions(\n",
" ancestor_folder_ids=[box_folder_id],\n",
" search_type_filter=[SearchTypeFilter.FILE_CONTENT],\n",
" created_date_range=[\"2023-01-01T00:00:00-07:00\", \"2024-08-01T00:00:00-07:00,\"],\n",
" file_extensions=[DocumentFiles.DOCX, DocumentFiles.PDF],\n",
" k=200,\n",
" size_range=[1, 1000000],\n",
" updated_data_range=None,\n",
")\n",
"\n",
"loader = BoxBlobLoader(\n",
" box_developer_token=box_developer_token,\n",
" query=\"Victor\",\n",
" box_search_options=box_search_options,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also search for content based on Box Metadata. If your Box instance uses Metadata, you can search for any documents that have a specific Metadata Template attached that meet a certain criteria, like returning any invoices with a total greater than or equal to $500 that were created last quarter.\n",
"\n",
"This requires 1 piece of information:\n",
"\n",
"* **query** (`str`)- A string containing the search query to perform.\n",
"\n",
"You can also provide a `BoxSearchOptions` object to narrow down that search\n",
"* **box_search_options** (`BoxSearchOptions`)\n",
"\n",
"#### BoxBlobLoader Metadata query"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_box.blob_loaders import BoxBlobLoader\n",
"from langchain_box.utilities import BoxMetadataQuery\n",
"\n",
"query = BoxMetadataQuery(\n",
" template_key=\"enterprise_1234.myTemplate\",\n",
" query=\"total >= :value\",\n",
" query_params={\"value\": 100},\n",
" ancestor_folder_id=\"260932470532\",\n",
")\n",
"\n",
"loader = BoxBlobLoader(box_metadata_query=query)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load\n",
"\n",
"#### BoxLoader"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(metadata={'source': 'https://dl.boxcloud.com/api/2.0/internal_files/1514555423624/versions/1663171610024/representations/extracted_text/content/', 'title': 'Invoice-A5555_txt'}, page_content='Vendor: AstroTech Solutions\\nInvoice Number: A5555\\n\\nLine Items:\\n - Gravitational Wave Detector Kit: $800\\n - Exoplanet Terrarium: $120\\nTotal: $920')"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = loader.load()\n",
"docs[0]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'source': 'https://dl.boxcloud.com/api/2.0/internal_files/1514555423624/versions/1663171610024/representations/extracted_text/content/', 'title': 'Invoice-A5555_txt'}\n"
]
}
],
"source": [
"print(docs[0].metadata)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### BoxBlobLoader"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Blob(id='1514555423624' metadata={'source': 'https://app.box.com/0/260935730128/260931903795/Invoice-A5555.txt', 'name': 'Invoice-A5555.txt', 'file_size': 150} data=\"b'Vendor: AstroTech Solutions\\\\nInvoice Number: A5555\\\\n\\\\nLine Items:\\\\n - Gravitational Wave Detector Kit: $800\\\\n - Exoplanet Terrarium: $120\\\\nTotal: $920'\" mimetype='text/plain' path='https://app.box.com/0/260935730128/260931903795/Invoice-A5555.txt')\n",
"Blob(id='1514553902288' metadata={'source': 'https://app.box.com/0/260935730128/260931903795/Invoice-B1234.txt', 'name': 'Invoice-B1234.txt', 'file_size': 168} data=\"b'Vendor: Galactic Gizmos Inc.\\\\nInvoice Number: B1234\\\\nPurchase Order Number: 001\\\\nLine Items:\\\\n - Quantum Flux Capacitor: $500\\\\n - Anti-Gravity Pen Set: $75\\\\nTotal: $575'\" mimetype='text/plain' path='https://app.box.com/0/260935730128/260931903795/Invoice-B1234.txt')\n"
]
}
],
"source": [
"for blob in loader.yield_blobs():\n",
" print(f\"Blob({blob})\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lazy Load\n",
"\n",
"#### BoxLoader only"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"page = []\n",
"for doc in loader.lazy_load():\n",
" page.append(doc)\n",
" if len(page) >= 10:\n",
" # do some paged operation, e.g.\n",
" # index.upsert(page)\n",
"\n",
" page = []"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extra fields\n",
"\n",
"All Box connectors offer the ability to select additional fields from the Box `FileFull` object to return as custom LangChain metadata. Each object accepts an optional `List[str]` called `extra_fields` containing the json key from the return object, like `extra_fields=[\"shared_link\"]`.\n",
"\n",
"The connector will add this field to the list of fields the integration needs to function and then add the results to the metadata returned in the `Document` or `Blob`, like `\"metadata\" : { \"source\" : \"source, \"shared_link\" : \"shared_link\" }`. If the field is unavailable for that file, it will be returned as an empty string, like `\"shared_link\" : \"\"`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all BoxLoader features and configurations head to the [API reference](https://python.langchain.com/api_reference/box/document_loaders/langchain_box.document_loaders.box.BoxLoader.html)\n",
"\n",
"\n",
"## Help\n",
"\n",
"If you have questions, you can check out our [developer documentation](https://developer.box.com) or reach out to use in our [developer community](https://community.box.com)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}