box: add langchain box package and DocumentLoader (#25506)

Thank you for contributing to LangChain!

-Description: Adding new package: `langchain-box`:

* `langchain_box.document_loaders.BoxLoader` — DocumentLoader
functionality
* `langchain_box.utilities.BoxAPIWrapper` — Box-specific code
* `langchain_box.utilities.BoxAuth` — Helper class for Box
authentication
* `langchain_box.utilities.BoxAuthType` — enum used by BoxAuth class

- Twitter handle: @boxplatform


- [x] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.


- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

---------

Co-authored-by: Erick Friis <erickfriis@gmail.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
This commit is contained in:
Scott Hurrey
2024-08-20 22:23:43 -04:00
committed by GitHub
parent be27e1787f
commit 55fd2e2158
31 changed files with 2917 additions and 0 deletions

View File

@@ -0,0 +1,282 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"sidebar_label: Box\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# BoxLoader\n",
"\n",
"This notebook provides a quick overview for getting started with Box [document loader](/docs/integrations/document_loaders/). For detailed documentation of all BoxLoader features and configurations head to the [API reference](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.langchain_box_loader.BoxLoader.html).\n",
"\n",
"\n",
"## Overview\n",
"\n",
"The `BoxLoader` class helps you get your unstructured content from Box in Langchain's `Document` format. You can do this with either a `List[str]` containing Box file IDs, or with a `str` containing a Box folder ID. \n",
"\n",
"You must provide either a `List[str]` containing Box file Ids, or a `str` containing a folder ID. If getting files from a folder with folder ID, you can also set a `Bool` to tell the loader to get all sub-folders in that folder, as well. \n",
"\n",
":::info\n",
"A Box instance can contain Petabytes of files, and folders can contain millions of files. Be intentional when choosing what folders you choose to index. And we recommend never getting all files from folder 0 recursively. Folder ID 0 is your root folder.\n",
":::\n",
"\n",
"Files without a text representation will be skipped.\n",
"\n",
"### Integration details\n",
"\n",
"| Class | Package | Local | Serializable | JS support|\n",
"| :--- | :--- | :---: | :---: | :---: |\n",
"| [BoxLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_box.document_loaders.langchain_boxloader.BoxLoader.html) | [langchain_box](https://api.python.langchain.com/en/latest/box_api_reference.html) | ✅ | ❌ | ❌ | \n",
"### Loader features\n",
"| Source | Document Lazy Loading | Async Support\n",
"| :---: | :---: | :---: | \n",
"| BoxLoader | ✅ | ❌ | \n",
"\n",
"## Setup\n",
"\n",
"In order to use the Box package, you will need a few things:\n",
"\n",
"* A Box account — If you are not a current Box customer or want to test outside of your production Box instance, you can use a [free developer account](https://account.box.com/signup/n/developer#ty9l3).\n",
"* [A Box app](https://developer.box.com/guides/getting-started/first-application/) — This is configured in the [developer console](https://account.box.com/developers/console), and for Box AI, must have the `Manage AI` scope enabled. Here you will also select your authentication method\n",
"* The app must be [enabled by the administrator](https://developer.box.com/guides/authorization/custom-app-approval/#manual-approval). For free developer accounts, this is whomever signed up for the account.\n",
"\n",
"### Credentials\n",
"\n",
"For these examples, we will use [token authentication](https://developer.box.com/guides/authentication/tokens/developer-tokens). This can be used with any [authentication method](https://developer.box.com/guides/authentication/). Just get the token with whatever methodology. If you want to learn more about how to use other authentication types with `langchain-box`, visit the [Box provider](/docs/integrations/providers/box) document.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Enter your Box Developer Token: ········\n"
]
}
],
"source": [
"import getpass\n",
"import os\n",
"\n",
"box_developer_token = getpass.getpass(\"Enter your Box Developer Token: \")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to get automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installation\n",
"\n",
"Install **langchain_box**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain_box"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialization\n",
"\n",
"### Load files\n",
"\n",
"If you wish to load files, you must provide the `List` of file ids at instantiation time. \n",
"\n",
"This requires 1 piece of information:\n",
"\n",
"* **box_file_ids** (`List[str]`)- A list of Box file IDs. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from langchain_box.document_loaders import BoxLoader\n",
"\n",
"box_file_ids = [\"1514555423624\", \"1514553902288\"]\n",
"\n",
"loader = BoxLoader(\n",
" box_developer_token=box_developer_token,\n",
" box_file_ids=box_file_ids,\n",
" character_limit=10000, # Optional. Defaults to no limit\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load from folder\n",
"\n",
"If you wish to load files from a folder, you must provide a `str` with the Box folder ID at instantiation time. \n",
"\n",
"This requires 1 piece of information:\n",
"\n",
"* **box_folder_id** (`str`)- A string containing a Box folder ID. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_box.document_loaders import BoxLoader\n",
"\n",
"box_folder_id = \"260932470532\"\n",
"\n",
"loader = BoxLoader(\n",
" box_folder_id=box_folder_id,\n",
" recursive=False, # Optional. return entire tree, defaults to False\n",
" character_limit=10000, # Optional. Defaults to no limit\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(metadata={'source': 'https://dl.boxcloud.com/api/2.0/internal_files/1514555423624/versions/1663171610024/representations/extracted_text/content/', 'title': 'Invoice-A5555_txt'}, page_content='Vendor: AstroTech Solutions\\nInvoice Number: A5555\\n\\nLine Items:\\n - Gravitational Wave Detector Kit: $800\\n - Exoplanet Terrarium: $120\\nTotal: $920')"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = loader.load()\n",
"docs[0]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'source': 'https://dl.boxcloud.com/api/2.0/internal_files/1514555423624/versions/1663171610024/representations/extracted_text/content/', 'title': 'Invoice-A5555_txt'}\n"
]
}
],
"source": [
"print(docs[0].metadata)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lazy Load"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"page = []\n",
"for doc in loader.lazy_load():\n",
" page.append(doc)\n",
" if len(page) >= 10:\n",
" # do some paged operation, e.g.\n",
" # index.upsert(page)\n",
"\n",
" page = []"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all BoxLoader features and configurations head to the API reference: https://api.python.langchain.com/en/latest/document_loaders/langchain_box.document_loaders.langchain_box_loader.BoxLoader.html\n",
"\n",
"\n",
"## Help\n",
"\n",
"If you have questions, you can check out our [developer documentation](https://developer.box.com) or reach out to use in our [developer community](https://community.box.com)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,174 @@
# Box
[Box](https://box.com) is the Intelligent Content Cloud, a single platform that enables
organizations to fuel collaboration, manage the entire content lifecycle, secure critical content,
and transform business workflows with enterprise AI. Founded in 2005, Box simplifies work for
leading global organizations, including AstraZeneca, JLL, Morgan Stanley, and Nationwide.
In this package, we make available a number of ways to include Box content in your AI workflows.
### Installation and setup
```text
%pip install -U langchain-box
```
# langchain-box
This package contains the LangChain integration with Box. For more information about
Box, check out our [developer documentation](https://developer.box.com).
## Pre-requisites
In order to integrate with Box, you need a few things:
* A Box instance — if you are not a current Box customer, sign up for a
[free dev account](https://account.box.com/signup/n/developer#ty9l3).
* A Box app — more on how to
[create an app](https://developer.box.com/guides/getting-started/first-application/)
* Your app approved in your Box instance — This is done by your admin.
The good news is if you are using a free developer account, you are the admin.
[Authorize your app](https://developer.box.com/guides/authorization/custom-app-approval/#manual-approval)
## Installation
```bash
pip install -U langchain-box
```
## Authentication
The `box-langchain` package offers some flexibility to authentication. The
most basic authentication method is by using a developer token. This can be
found in the [Box developer console](https://account.box.com/developers/console)
on the configuration screen. This token is purposely short-lived (1 hour) and is
intended for development. With this token, you can add it to your environment as
`BOX_DEVELOPER_TOKEN`, you can pass it directly to the loader, or you can use the
`BoxAuth` authentication helper class.
We will cover passing it directly to the loader in the section below.
### BoxAuth helper class
`BoxAuth` supports the following authentication methods:
* Token — either a developer token or any token generated through the Box SDK
* JWT with a service account
* JWT with a specified user
* CCG with a service account
* CCG with a specified user
:::note
If using JWT authentication, you will need to download the configuration from the Box
developer console after generating your public/private key pair. Place this file in your
application directory structure somewhere. You will use the path to this file when using
the `BoxAuth` helper class.
:::
For more information, learn about how to
[set up a Box application](https://developer.box.com/guides/getting-started/first-application/),
and check out the
[Box authentication guide](https://developer.box.com/guides/authentication/select/)
for more about our different authentication options.
Examples:
**Token**
```python
from langchain_box.document_loaders import BoxLoader
from langchain_box.utilities import BoxAuth, BoxAuthType
auth = BoxAuth(
auth_type=BoxAuthType.TOKEN,
box_developer_token=box_developer_token
)
loader = BoxLoader(
box_auth=auth,
...
)
```
**JWT with a service account**
```python
from langchain_box.document_loaders import BoxLoader
from langchain_box.utilities import BoxAuth, BoxAuthType
auth = BoxAuth(
auth_type=BoxAuthType.JWT,
box_jwt_path=box_jwt_path
)
loader = BoxLoader(
box_auth=auth,
...
```
**JWT with a specified user**
```python
from langchain_box.document_loaders import BoxLoader
from langchain_box.utilities import BoxAuth, BoxAuthType
auth = BoxAuth(
auth_type=BoxAuthType.JWT,
box_jwt_path=box_jwt_path,
box_user_id=box_user_id
)
loader = BoxLoader(
box_auth=auth,
...
```
**CCG with a service account**
```python
from langchain_box.document_loaders import BoxLoader
from langchain_box.utilities import BoxAuth, BoxAuthType
auth = BoxAuth(
auth_type=BoxAuthType.CCG,
box_client_id=box_client_id,
box_client_secret=box_client_secret,
box_enterprise_id=box_enterprise_id
)
loader = BoxLoader(
box_auth=auth,
...
```
**CCG with a specified user**
```python
from langchain_box.document_loaders import BoxLoader
from langchain_box.utilities import BoxAuth, BoxAuthType
auth = BoxAuth(
auth_type=BoxAuthType.CCG,
box_client_id=box_client_id,
box_client_secret=box_client_secret,
box_user_id=box_user_id
)
loader = BoxLoader(
box_auth=auth,
...
```
If you wish to use OAuth2 with the authorization_code flow, please use `BoxAuthType.TOKEN` with the token you have acquired.
## Document Loaders
### BoxLoader
[See usage example](/docs/integrations/document_loaders/box)
```python
from langchain_box.document_loaders import BoxLoader
```