Compare commits

...

4 Commits

Author SHA1 Message Date
Bagatur
15951239df wip 2023-08-03 13:26:56 -07:00
Bagatur
6def0a4ed0 Merge branch 'master' into pprados/google_drive 2023-08-03 10:43:43 -07:00
Philippe Prados
80f5e05181 Resync in 3 august 2023-08-03 17:07:47 +02:00
Philippe Prados
7fe77245af Resynch in 3 august 2023-08-03 12:48:54 +02:00
27 changed files with 16532 additions and 489 deletions

View File

@@ -2,14 +2,11 @@
"cells": [
{
"cell_type": "markdown",
"id": "b0ed136e-6983-4893-ae1b-b75753af05f8",
"id": "0b02f34c",
"metadata": {},
"source": [
"# Google Drive\n",
"\n",
">[Google Drive](https://en.wikipedia.org/wiki/Google_Drive) is a file storage and synchronization service developed by Google.\n",
"\n",
"This notebook covers how to load documents from `Google Drive`. Currently, only `Google Docs` are supported.\n",
"# Google Drive Loader\n",
"This notebook covers how to retrieve documents from Google Drive.\n",
"\n",
"## Prerequisites\n",
"\n",
@@ -18,12 +15,21 @@
"1. [Authorize credentials for desktop app](https://developers.google.com/drive/api/quickstart/python#authorize_credentials_for_a_desktop_application)\n",
"1. `pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib`\n",
"\n",
"## 🧑 Instructions for ingesting your Google Docs data\n",
"By default, the `GoogleDriveLoader` expects the `credentials.json` file to be `~/.credentials/credentials.json`, but this is configurable using the `credentials_path` keyword argument. Same thing with `token.json` - `token_path`. Note that `token.json` will be created automatically the first time you use the loader.\n",
"\n",
"`GoogleDriveLoader` can load from a list of Google Docs document ids or a folder id. You can obtain your folder and document id from the URL:\n",
"## Instructions for retrieving your Google Docs data\n",
"By default, the `GoogleDriveLoader` expects the `credentials.json` file to be `~/.credentials/credentials.json`, but this is configurable using the `GOOGLE_ACCOUNT_FILE` environment variable. \n",
"The location of `token.json` use the same directory (or use the parameter `token_path`). Note that `token.json` will be created automatically the first time you use the loader.\n"
]
},
{
"cell_type": "markdown",
"id": "a03b9067",
"metadata": {},
"source": [
"You can obtain your folder and document id from the URL:\n",
"* Folder: https://drive.google.com/drive/u/0/folders/1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5 -> folder id is `\"1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5\"`\n",
"* Document: https://docs.google.com/document/d/1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw/edit -> document id is `\"1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw\"`"
"* Document: https://docs.google.com/document/d/1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw/edit -> document id is `\"1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw\"`\n",
"\n",
"The special value `root` is for your personal home."
]
},
{
@@ -33,12 +39,23 @@
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib"
"#!pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib"
]
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"id": "9bcb6cb1",
"metadata": {},
"outputs": [],
"source": [
"folder_id='root'\n",
"#folder_id='1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5'"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "878928a6-a5ae-4f74-b351-64e3b01733fe",
"metadata": {
"tags": []
@@ -50,7 +67,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"id": "2216c83f-68e4-4d2f-8ea2-5878fb18bbe7",
"metadata": {
"tags": []
@@ -58,174 +75,215 @@
"outputs": [],
"source": [
"loader = GoogleDriveLoader(\n",
" folder_id=\"1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5\",\n",
" # Optional: configure whether to recursively fetch files from subfolders. Defaults to False.\n",
" folder_id=folder_id,\n",
" recursive=False,\n",
" num_results=2, # Maximum number of file to load\n",
")"
]
},
{
"cell_type": "markdown",
"id": "de5be5d4",
"metadata": {},
"source": [
"By default, all files with these mime-type can be converted to `Document`.\n",
"- text/text\n",
"- text/plain\n",
"- text/html\n",
"- text/csv\n",
"- text/markdown\n",
"- image/png\n",
"- image/jpeg\n",
"- application/epub+zip\n",
"- application/pdf\n",
"- application/rtf\n",
"- application/vnd.google-apps.document (GDoc)\n",
"- application/vnd.google-apps.presentation (GSlide)\n",
"- application/vnd.google-apps.spreadsheet (GSheet)\n",
"- application/vnd.google.colaboratory (Notebook colab)\n",
"- application/vnd.openxmlformats-officedocument.presentationml.presentation (PPTX)\n",
"- application/vnd.openxmlformats-officedocument.wordprocessingml.document (DOCX)\n",
"\n",
"It's possible to update or customize this. See the documentation of `GDriveLoader`.\n",
"\n",
"But, the corresponding packages must be installed."
]
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"id": "1bca45c9",
"metadata": {},
"outputs": [],
"source": [
"!pip install unstructured"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f3b6aa0-b45d-4e37-8c50-5bebe70fdb9d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"docs = loader.load()"
"for doc in loader.load():\n",
" print(\"---\")\n",
" print(doc.page_content.strip()[:60]+\"...\")"
]
},
{
"cell_type": "markdown",
"id": "2721ba8a",
"id": "31170e71",
"metadata": {},
"source": [
"When you pass a `folder_id` by default all files of type document, sheet and pdf are loaded. You can modify this behaviour by passing a `file_types` argument "
"# Customize the search pattern\n",
"\n",
"All parameter compatible with Google [`list()`](https://developers.google.com/drive/api/v3/reference/files/list)\n",
"API can be set.\n",
"\n",
"To specify the new pattern of the Google request, you can use a `PromptTemplate()`.\n",
"The variables for the prompt can be set with `kwargs` in the constructor.\n",
"Some pre-formated request are proposed (use `{query}`, `{folder_id}` and/or `{mime_type}`):\n",
"\n",
"You can customize the criteria to select the files. A set of predefined filter are proposed:\n",
"| template | description |\n",
"| -------------------------------------- | --------------------------------------------------------------------- |\n",
"| gdrive-all-in-folder | Return all compatible files from a `folder_id` |\n",
"| gdrive-query | Search `query` in all drives |\n",
"| gdrive-by-name | Search file with name `query` |\n",
"| gdrive-query-in-folder | Search `query` in `folder_id` (and sub-folders in `_recursive=true`) |\n",
"| gdrive-mime-type | Search a specific `mime_type` |\n",
"| gdrive-mime-type-in-folder | Search a specific `mime_type` in `folder_id` |\n",
"| gdrive-query-with-mime-type | Search `query` with a specific `mime_type` |\n",
"| gdrive-query-with-mime-type-and-folder | Search `query` with a specific `mime_type` and in `folder_id` |\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ff83b4c",
"id": "0a47175f",
"metadata": {},
"outputs": [],
"source": [
"loader = GoogleDriveLoader(\n",
" folder_id=\"1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5\",\n",
" file_types=[\"document\", \"sheet\"]\n",
" recursive=False\n",
")"
]
},
{
"cell_type": "markdown",
"id": "d6b80931",
"metadata": {},
"source": [
"## Passing in Optional File Loaders\n",
"\n",
"When processing files other than Google Docs and Google Sheets, it can be helpful to pass an optional file loader to `GoogleDriveLoader`. If you pass in a file loader, that file loader will be used on documents that do not have a Google Docs or Google Sheets MIME type. Here is an example of how to load an Excel document from Google Drive using a file loader. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "94207e39",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import GoogleDriveLoader\n",
"from langchain.document_loaders import UnstructuredFileIOLoader"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a15fbee0",
"metadata": {},
"outputs": [],
"source": [
"file_id = \"1x9WBtFPWMEAdjcJzPScRsjpjQvpSo_kz\"\n",
"loader = GoogleDriveLoader(\n",
" file_ids=[file_id],\n",
" file_loader_cls=UnstructuredFileIOLoader,\n",
" file_loader_kwargs={\"mode\": \"elements\"},\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "98410bda",
"metadata": {},
"outputs": [],
"source": [
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e3e72221",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='\\n \\n \\n Team\\n Location\\n Stanley Cups\\n \\n \\n Blues\\n STL\\n 1\\n \\n \\n Flyers\\n PHI\\n 2\\n \\n \\n Maple Leafs\\n TOR\\n 13\\n \\n \\n', metadata={'filetype': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', 'page_number': 1, 'page_name': 'Stanley Cups', 'text_as_html': '<table border=\"1\" class=\"dataframe\">\\n <tbody>\\n <tr>\\n <td>Team</td>\\n <td>Location</td>\\n <td>Stanley Cups</td>\\n </tr>\\n <tr>\\n <td>Blues</td>\\n <td>STL</td>\\n <td>1</td>\\n </tr>\\n <tr>\\n <td>Flyers</td>\\n <td>PHI</td>\\n <td>2</td>\\n </tr>\\n <tr>\\n <td>Maple Leafs</td>\\n <td>TOR</td>\\n <td>13</td>\\n </tr>\\n </tbody>\\n</table>', 'category': 'Table', 'source': 'https://drive.google.com/file/d/1aA6L2AR3g0CR-PW03HEZZo4NaVlKpaP7/view'})"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0]"
]
},
{
"cell_type": "markdown",
"id": "238cd06f",
"metadata": {},
"source": [
"You can also process a folder with a mix of files and Google Docs/Sheets using the following pattern:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "0e2d093f",
"metadata": {},
"outputs": [],
"source": [
"folder_id = \"1asMOHY1BqBS84JcRbOag5LOJac74gpmD\"\n",
"loader = GoogleDriveLoader(\n",
" folder_id=folder_id,\n",
" file_loader_cls=UnstructuredFileIOLoader,\n",
" file_loader_kwargs={\"mode\": \"elements\"},\n",
" recursive=False,\n",
" template=\"gdrive-query\", # Default template to use\n",
" query=\"machine learning\",\n",
" num_results=2, # Maximum number of file to load\n",
" supportsAllDrives=False, # GDrive `list()` parameter\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "b35ddcc6",
"execution_count": null,
"id": "100cf361",
"metadata": {},
"outputs": [],
"source": [
"docs = loader.load()"
"for doc in loader.load():\n",
" print(\"---\")\n",
" print(doc.page_content.strip()[:60]+\"...\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "3cc141e0",
"cell_type": "markdown",
"id": "74e6e3aa",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='\\n \\n \\n Team\\n Location\\n Stanley Cups\\n \\n \\n Blues\\n STL\\n 1\\n \\n \\n Flyers\\n PHI\\n 2\\n \\n \\n Maple Leafs\\n TOR\\n 13\\n \\n \\n', metadata={'filetype': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', 'page_number': 1, 'page_name': 'Stanley Cups', 'text_as_html': '<table border=\"1\" class=\"dataframe\">\\n <tbody>\\n <tr>\\n <td>Team</td>\\n <td>Location</td>\\n <td>Stanley Cups</td>\\n </tr>\\n <tr>\\n <td>Blues</td>\\n <td>STL</td>\\n <td>1</td>\\n </tr>\\n <tr>\\n <td>Flyers</td>\\n <td>PHI</td>\\n <td>2</td>\\n </tr>\\n <tr>\\n <td>Maple Leafs</td>\\n <td>TOR</td>\\n <td>13</td>\\n </tr>\\n </tbody>\\n</table>', 'category': 'Table', 'source': 'https://drive.google.com/file/d/1aA6L2AR3g0CR-PW03HEZZo4NaVlKpaP7/view'})"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0]"
"You can customize your pattern."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e312268a",
"id": "dcf07ff7",
"metadata": {},
"outputs": [],
"source": []
"source": [
"from langchain.prompts.prompt import PromptTemplate\n",
"loader = GoogleDriveLoader(\n",
" folder_id=folder_id,\n",
" recursive=False,\n",
" template=PromptTemplate(\n",
" input_variables=[\"query\", \"query_name\"],\n",
" template=\"fullText contains '{query}' and name contains '{query_name}' and trashed=false\",\n",
" ), # Default template to use\n",
" query=\"machine learning\",\n",
" query_name=\"ML\", \n",
" num_results=2, # Maximum number of file to load\n",
")\n",
"for doc in loader.load():\n",
" print(\"---\")\n",
" print(doc.page_content.strip()[:60]+\"...\")"
]
},
{
"cell_type": "markdown",
"id": "8e404472",
"metadata": {},
"source": [
"# Modes for GSlide and GSheet\n",
"\n",
"The parameter `mode` accept differents values:\n",
"- `\"document\"`: return the body of each documents\n",
"- `\"snippets\"`: return the `description` of each files.\n",
"\n",
"\n",
"The parameter `gslide_mode` accept differents values:\n",
"- `\"single\"` : one document with `<PAGE BREAK>`\n",
"- `\"slide\"` : one document by slide\n",
"- `\"elements\"` : one document for each `elements`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b33d1a53",
"metadata": {},
"outputs": [],
"source": [
"loader = GoogleDriveLoader(\n",
" template=\"gdrive-mime-type\",\n",
" mime_type=\"application/vnd.google-apps.presentation\", # Only GSlide files\n",
" gslide_mode=\"slide\",\n",
" num_results=2, # Maximum number of file to load\n",
")\n",
"for doc in loader.load():\n",
" print(\"---\")\n",
" print(doc.page_content.strip()[:60]+\"...\")"
]
},
{
"cell_type": "markdown",
"id": "498f0451",
"metadata": {},
"source": [
"The parameter `gsheet_mode` accept differents values:\n",
"- `\"single\"`: Generate one document by line\n",
"- `\"elements\"` : one document with markdown array and `<PAGE BREAK>` tags."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "884c4ca6",
"metadata": {},
"outputs": [],
"source": [
"loader = GoogleDriveLoader(\n",
" template=\"gdrive-mime-type\",\n",
" mime_type=\"application/vnd.google-apps.spreadsheet\", # Only GSheet files\n",
" gsheet_mode=\"elements\",\n",
" num_results=2, # Maximum number of file to load\n",
")\n",
"for doc in loader.load():\n",
" print(\"---\")\n",
" print(doc.page_content.strip()[:60]+\"...\")"
]
}
],
"metadata": {
@@ -244,7 +302,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
"version": "3.10.9"
}
},
"nbformat": 4,

View File

@@ -2,7 +2,7 @@
>[Google Drive](https://en.wikipedia.org/wiki/Google_Drive) is a file storage and synchronization service developed by Google.
Currently, only `Google Docs` are supported.
All Google Drive API is supported, with integration with Google Doc, Google Sheet and Google Slide.
## Installation and Setup
@@ -20,3 +20,22 @@ See a [usage example and authorizing instructions](/docs/integrations/document_l
```python
from langchain.document_loaders import GoogleDriveLoader
```
## Retriever
See a [usage example and authorizing instructions](/docs/modules/data_connection/retrievers/integrations/google_drive.html).
```python
from langchain.retrievers import GoogleDriveRetriever
```
## Tools
See a [usage example and authorizing instructions](/docs/modules/agents/tools/integrations/google_drive.html).
```python
from langchain.tools import GoogleDriveSearchTool
from langchain.utilities import GoogleDriveAPIWrapper
```

View File

@@ -0,0 +1,279 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "b0ed136e-6983-4893-ae1b-b75753af05f8",
"metadata": {},
"source": [
"# Google Drive Retriever\n",
"This notebook covers how to retrieve documents from Google Drive.\n",
"\n",
"## Prerequisites\n",
"\n",
"1. Create a Google Cloud project or use an existing project\n",
"1. Enable the [Google Drive API](https://console.cloud.google.com/flows/enableapi?apiid=drive.googleapis.com)\n",
"1. [Authorize credentials for desktop app](https://developers.google.com/drive/api/quickstart/python#authorize_credentials_for_a_desktop_application)\n",
"1. `pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib`\n",
"\n",
"## Instructions for retrieving your Google Docs data\n",
"By default, the `GoogleDriveRetriever` expects the `credentials.json` file to be `~/.credentials/credentials.json`, but this is configurable using the `GOOGLE_ACCOUNT_FILE` environment variable. \n",
"The location of `token.json` use the same directory (or use the parameter `token_path`). Note that `token.json` will be created automatically the first time you use the retriever.\n",
"\n",
"`GoogleDriveRetriever` can retrieve a selection of files with some requests. \n",
"\n",
"By default, If you use a `folder_id`, all the files inside this folder can be retrieved to `Document`.\n"
]
},
{
"cell_type": "markdown",
"id": "35b94a93-97de-4af8-9cca-de9ffb7930c3",
"metadata": {},
"source": [
"You can obtain your folder and document id from the URL:\n",
"* Folder: https://drive.google.com/drive/u/0/folders/1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5 -> folder id is `\"1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5\"`\n",
"* Document: https://docs.google.com/document/d/1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw/edit -> document id is `\"1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw\"`\n",
"\n",
"The special value `root` is for your personal home."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9c9665c9-a023-4078-9d95-e43021cecb6f",
"metadata": {},
"outputs": [],
"source": [
"#!pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "878928a6-a5ae-4f74-b351-64e3b01733fe",
"metadata": {
"ExecuteTime": {
"end_time": "2023-05-09T10:45:59.438650905Z",
"start_time": "2023-05-09T10:45:57.955900302Z"
},
"tags": []
},
"outputs": [],
"source": [
"from langchain.retrievers import GoogleDriveRetriever"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "755907c2-145d-4f0f-9b15-07a628a2d2d2",
"metadata": {
"ExecuteTime": {
"end_time": "2023-05-09T10:45:59.442890834Z",
"start_time": "2023-05-09T10:45:59.440941528Z"
},
"tags": []
},
"outputs": [],
"source": [
"folder_id=\"root\"\n",
"#folder_id='1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5'"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2216c83f-68e4-4d2f-8ea2-5878fb18bbe7",
"metadata": {
"ExecuteTime": {
"end_time": "2023-05-09T10:45:59.795842403Z",
"start_time": "2023-05-09T10:45:59.445262457Z"
},
"tags": []
},
"outputs": [],
"source": [
"retriever = GoogleDriveRetriever(\n",
" num_results=2,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "fa339ca0-f478-440c-ba80-0e5f41a19ce1",
"metadata": {},
"source": [
"By default, all files with these mime-type can be converted to `Document`.\n",
"- text/text\n",
"- text/plain\n",
"- text/html\n",
"- text/csv\n",
"- text/markdown\n",
"- image/png\n",
"- image/jpeg\n",
"- application/epub+zip\n",
"- application/pdf\n",
"- application/rtf\n",
"- application/vnd.google-apps.document (GDoc)\n",
"- application/vnd.google-apps.presentation (GSlide)\n",
"- application/vnd.google-apps.spreadsheet (GSheet)\n",
"- application/vnd.google.colaboratory (Notebook colab)\n",
"- application/vnd.openxmlformats-officedocument.presentationml.presentation (PPTX)\n",
"- application/vnd.openxmlformats-officedocument.wordprocessingml.document (DOCX)\n",
"\n",
"It's possible to update or customize this. See the documentation of `GDriveRetriever`.\n",
"\n",
"But, the corresponding packages must be installed."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9dadec48",
"metadata": {},
"outputs": [],
"source": [
"#!pip install unstructured"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f3b6aa0-b45d-4e37-8c50-5bebe70fdb9d",
"metadata": {
"ExecuteTime": {
"end_time": "2023-05-09T10:46:00.990310466Z",
"start_time": "2023-05-09T10:45:59.798774595Z"
},
"tags": []
},
"outputs": [],
"source": [
"retriever.get_relevant_documents(\"machine learning\")"
]
},
{
"cell_type": "markdown",
"id": "8ff33817-8619-4897-8742-2216b9934d2a",
"metadata": {},
"source": [
"You can customize the criteria to select the files. A set of predefined filter are proposed:\n",
"| template | description |\n",
"| -------------------------------------- | --------------------------------------------------------------------- |\n",
"| gdrive-all-in-folder | Return all compatible files from a `folder_id` |\n",
"| gdrive-query | Search `query` in all drives |\n",
"| gdrive-by-name | Search file with name `query`) |\n",
"| gdrive-query-in-folder | Search `query` in `folder_id` (and sub-folders in `_recursive=true`) |\n",
"| gdrive-mime-type | Search a specific `mime_type` |\n",
"| gdrive-mime-type-in-folder | Search a specific `mime_type` in `folder_id` |\n",
"| gdrive-query-with-mime-type | Search `query` with a specific `mime_type` |\n",
"| gdrive-query-with-mime-type-and-folder | Search `query` with a specific `mime_type` and in `folder_id` |"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9977c712-9659-4959-b508-f59cc7d49d44",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"retriever = GoogleDriveRetriever(\n",
" template=\"gdrive-query\", # Search everywhere\n",
" num_results=2, # But take only 2 documents\n",
")\n",
"for doc in retriever.get_relevant_documents(\"machine learning\"):\n",
" print(\"---\")\n",
" print(doc.page_content.strip()[:60]+\"...\")"
]
},
{
"cell_type": "markdown",
"id": "a5a0f3ef-26fb-4a5c-85f0-5aba90b682b1",
"metadata": {},
"source": [
"Else, you can customize the prompt with a specialized `PromptTemplate`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b0bbebde-0487-4d20-9d77-8070e4f0e0d6",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain import PromptTemplate\n",
"retriever = GoogleDriveRetriever(\n",
" template=PromptTemplate(input_variables=['query'],\n",
" # See https://developers.google.com/drive/api/guides/search-files\n",
" template=\"(fullText contains '{query}') \"\n",
" \"and mimeType='application/vnd.google-apps.document' \"\n",
" \"and modifiedTime > '2000-01-01T00:00:00' \"\n",
" \"and trashed=false\"),\n",
" num_results=2,\n",
" # See https://developers.google.com/drive/api/v3/reference/files/list\n",
" includeItemsFromAllDrives=False,\n",
" supportsAllDrives=False,\n",
")\n",
"for doc in retriever.get_relevant_documents(\"machine learning\"):\n",
" print(f\"{doc.metadata['name']}:\")\n",
" print(\"---\")\n",
" print(doc.page_content.strip()[:60]+\"...\")"
]
},
{
"cell_type": "markdown",
"id": "9b6fed29-1666-452e-b677-401613270388",
"metadata": {},
"source": [
"# Use GDrive 'description' metadata\n",
"Each Google Drive has a `description` field in metadata (see the *details of a file*).\n",
"Use the `snippets` mode to return the description of selected files.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "342dbe12-ed83-40f4-8957-0cc8c4609542",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"retriever = GoogleDriveRetriever(\n",
" template='gdrive-mime-type-in-folder',\n",
" folder_id=folder_id,\n",
" mime_type='application/vnd.google-apps.document', # Only Google Docs\n",
" num_results=2,\n",
" mode='snippets',\n",
" includeItemsFromAllDrives=False,\n",
" supportsAllDrives=False,\n",
")\n",
"retriever.get_relevant_documents(\"machine learning\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,215 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Google Drive tool\n",
"\n",
"This notebook walks through connecting a LangChain to the Google Drive API.\n",
"\n",
"## Prerequisites\n",
"\n",
"1. Create a Google Cloud project or use an existing project\n",
"1. Enable the [Google Drive API](https://console.cloud.google.com/flows/enableapi?apiid=drive.googleapis.com)\n",
"1. [Authorize credentials for desktop app](https://developers.google.com/drive/api/quickstart/python#authorize_credentials_for_a_desktop_application)\n",
"1. `pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib`\n",
"\n",
"## Instructions for retrieving your Google Docs data\n",
"By default, the `GoogleDriveTools` and `GoogleDriveWrapper` expects the `credentials.json` file to be `~/.credentials/credentials.json`, but this is configurable using the `GOOGLE_ACCOUNT_FILE` environment variable. \n",
"The location of `token.json` use the same directory (or use the parameter `token_path`). Note that `token.json` will be created automatically the first time you use the tool.\n",
"\n",
"`GoogleDriveSearchTool` can retrieve a selection of files with some requests. \n",
"\n",
"By default, If you use a `folder_id`, all the files inside this folder can be retrieved to `Document`, if the name match the query.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#!pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can obtain your folder and document id from the URL:\n",
"* Folder: https://drive.google.com/drive/u/0/folders/1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5 -> folder id is `\"1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5\"`\n",
"* Document: https://docs.google.com/document/d/1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw/edit -> document id is `\"1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw\"`\n",
"\n",
"The special value `root` is for your personal home."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"folder_id=\"root\"\n",
"#folder_id='1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By default, all files with these mime-type can be converted to `Document`.\n",
"- text/text\n",
"- text/plain\n",
"- text/html\n",
"- text/csv\n",
"- text/markdown\n",
"- image/png\n",
"- image/jpeg\n",
"- application/epub+zip\n",
"- application/pdf\n",
"- application/rtf\n",
"- application/vnd.google-apps.document (GDoc)\n",
"- application/vnd.google-apps.presentation (GSlide)\n",
"- application/vnd.google-apps.spreadsheet (GSheet)\n",
"- application/vnd.google.colaboratory (Notebook colab)\n",
"- application/vnd.openxmlformats-officedocument.presentationml.presentation (PPTX)\n",
"- application/vnd.openxmlformats-officedocument.wordprocessingml.document (DOCX)\n",
"\n",
"It's possible to update or customize this. See the documentation of `GoogleDriveAPIWrapper`.\n",
"\n",
"But, the corresponding packages must installed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#!pip install unstructured"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.utilities.google_drive import GoogleDriveAPIWrapper\n",
"from langchain.tools.google_drive.tool import GoogleDriveSearchTool\n",
"\n",
"# By default, search only in the filename.\n",
"tool = GoogleDriveSearchTool(\n",
" api_wrapper=GoogleDriveAPIWrapper(\n",
" folder_id=folder_id,\n",
" num_results=2,\n",
" template=\"gdrive-query-in-folder\", # Search in the body of documents\n",
" )\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"logging.basicConfig(level=logging.INFO)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tool.run(\"machine learning\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tool.description"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.agents import load_tools\n",
"tools = load_tools([\"google-drive-search\"],\n",
" folder_id=folder_id,\n",
" template=\"gdrive-query-in-folder\",\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use within an Agent"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain import OpenAI\n",
"from langchain.agents import initialize_agent, AgentType\n",
"llm = OpenAI(temperature=0)\n",
"agent = initialize_agent(\n",
" tools=tools,\n",
" llm=llm,\n",
" agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"agent.run(\n",
" \"Search in google drive, who is 'Yann LeCun' ?\"\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -57,6 +57,10 @@ from langchain.utilities.wikipedia import WikipediaAPIWrapper
from langchain.utilities.wolfram_alpha import WolframAlphaAPIWrapper
from langchain.utilities.openweathermap import OpenWeatherMapAPIWrapper
from langchain.utilities.dataforseo_api_search import DataForSeoAPIWrapper
from langchain.tools.google_drive.tool import (
GoogleDriveSearchTool,
GoogleDriveAPIWrapper,
)
def _get_python_repl() -> BaseTool:
@@ -180,6 +184,10 @@ def _get_wolfram_alpha(**kwargs: Any) -> BaseTool:
return WolframAlphaQueryRun(api_wrapper=WolframAlphaAPIWrapper(**kwargs))
def _get_google_drive_search(**kwargs: Any) -> BaseTool:
return GoogleDriveSearchTool(api_wrapper=GoogleDriveAPIWrapper(**kwargs))
def _get_google_search(**kwargs: Any) -> BaseTool:
return GoogleSearchRun(api_wrapper=GoogleSearchAPIWrapper(**kwargs))
@@ -287,6 +295,15 @@ _EXTRA_LLM_TOOLS: Dict[
_EXTRA_OPTIONAL_TOOLS: Dict[str, Tuple[Callable[[KwArg(Any)], BaseTool], List[str]]] = {
"wolfram-alpha": (_get_wolfram_alpha, ["wolfram_alpha_appid"]),
"google-drive-search": (
_get_google_drive_search,
[
"gdrive_api_file",
"folder_id",
"mime_type",
"template",
],
),
"google-search": (_get_google_search, ["google_api_key", "google_cse_id"]),
"google-search-results-json": (
_get_google_search_results_json,

View File

@@ -74,7 +74,7 @@ from langchain.document_loaders.geodataframe import GeoDataFrameLoader
from langchain.document_loaders.git import GitLoader
from langchain.document_loaders.gitbook import GitbookLoader
from langchain.document_loaders.github import GitHubIssuesLoader
from langchain.document_loaders.googledrive import GoogleDriveLoader
from langchain.document_loaders.google_drive import GoogleDriveLoader
from langchain.document_loaders.gutenberg import GutenbergLoader
from langchain.document_loaders.hn import HNLoader
from langchain.document_loaders.html import UnstructuredHTMLLoader

View File

@@ -0,0 +1,216 @@
"""Loads data from Google Drive.
Prerequisites:
1. Create a Google Cloud project
2. Enable the Google Drive API:
https://console.cloud.google.com/flows/enableapi?apiid=drive.googleapis.com
3. Authorize credentials for desktop app:
https://developers.google.com/drive/api/quickstart/python#authorize_credentials_for_a_desktop_application
4. For service accounts visit
https://cloud.google.com/iam/docs/service-accounts-create
""" # noqa: E501
import itertools
import logging
import os
import warnings
from pathlib import Path
from typing import (
Any,
Dict,
Iterator,
List,
Optional,
Sequence,
)
from pydantic.class_validators import root_validator
from langchain.base_language import BaseLanguageModel
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders.base import BaseLoader
from langchain.prompts import PromptTemplate
from langchain.schema import Document
from langchain.utilities.google_drive import (
GoogleDriveUtilities,
get_template,
)
logger = logging.getLogger(__name__)
class GoogleDriveLoader(BaseLoader, GoogleDriveUtilities):
"""Loads data from Google Drive."""
document_ids: Optional[Sequence[str]] = None
""" A list of ids of google drive documents to load."""
file_ids: Optional[Sequence[str]] = None
"""A list of ids of google drive files to load."""
@root_validator(pre=True)
def validate_older_api_and_new_environment_variable(
cls, v: Dict[str, Any]
) -> Dict[str, Any]:
service_account_key = v.get("service_account_key")
credentials_path = v.get("credentials_path")
api_file = v.get("gdrive_api_file")
if service_account_key:
warnings.warn(
"service_account_key was deprecated. Use GOOGLE_ACCOUNT_FILE env "
"variable.",
DeprecationWarning,
)
if credentials_path:
warnings.warn(
"service_account_key was deprecated. Use GOOGLE_ACCOUNT_FILE env "
"variable.",
DeprecationWarning,
)
if service_account_key and credentials_path:
raise ValueError("Select only service_account_key or service_account_key")
folder_id = v.get("folder_id")
document_ids = v.get("document_ids")
file_ids = v.get("file_ids")
if folder_id and (document_ids or file_ids):
raise ValueError(
"Cannot specify both folder_id and document_ids nor "
"folder_id and file_ids"
)
# To be compatible with the old approach
if not api_file:
api_file = (
Path(os.environ["GOOGLE_ACCOUNT_FILE"])
if "GOOGLE_ACCOUNT_FILE" in os.environ
else None
)
# Deprecated: To be compatible with the old approach of authentication
if service_account_key:
api_file = service_account_key
elif credentials_path:
api_file = credentials_path
elif not api_file:
api_file = Path.home() / ".credentials" / "keys.json"
v["gdrive_api_file"] = api_file
if not v.get("template"):
if folder_id:
template = get_template("gdrive-all-in-folder")
elif "document_ids" in v or "file_ids" in v:
template = PromptTemplate(input_variables=[], template="")
else:
raise ValueError("Use a template")
v["template"] = template
return v
def lazy_load(self) -> Iterator[Document]:
ids = self.document_ids or self.file_ids
if ids:
yield from (self.load_document_from_id(_id) for _id in ids)
else:
return self.lazy_get_relevant_documents()
def load(self) -> List[Document]:
return list(self.lazy_load())
def lazy_update_description_with_summary(
loader: GoogleDriveLoader,
llm: BaseLanguageModel,
*,
force: bool = False,
query: str = "",
**kwargs: Any,
) -> Iterator[Document]:
"""Summarize all documents, and update the GDrive metadata `description`.
Need `write` access: set scope=["https://www.googleapis.com/auth/drive"].
Note: Update the description of shortcut without touch the target
file description.
Args:
llm: Language model to use.
force: true to update all files. Else, update only if the description
is empty.
query: If possible, the query request.
kwargs: Others parameters for the template (verbose, prompt, etc).
"""
try:
from googleapiclient.errors import HttpError
except ImportError as e:
raise ImportError("""Could not import""") from e
if "https://www.googleapis.com/auth/drive" not in loader._creds.scopes:
raise ValueError(
f"Remove the file 'token.json' and "
f"initialize the {loader.__class__.__name__} with "
f"scopes=['https://www.googleapis.com/auth/drive']"
)
chain = load_summarize_chain(llm, chain_type="stuff", **kwargs)
updated_files = set() # Never update two time the same document (if it's split)
for document in loader.lazy_get_relevant_documents(query, **kwargs):
try:
file_id = document.metadata["gdriveId"]
if file_id not in updated_files:
file = loader.files.get(
fileId=file_id,
fields=loader.fields,
supportsAllDrives=True,
).execute()
if force or not file.get("description", "").strip():
summary = chain.run([document]).strip()
if summary:
loader.files.update(
fileId=file_id,
supportsAllDrives=True,
body={"description": summary},
).execute()
logger.info(
f"For the file '{file['name']}', add description "
f"'{summary[:40]}...'"
)
metadata = loader._extract_meta_data(file)
if "summary" in metadata:
del metadata["summary"]
yield Document(page_content=summary, metadata=metadata)
updated_files.add(file_id)
except HttpError:
logger.warning(
f"Impossible to update the description of file "
f"'{document.metadata['name']}'"
)
def update_description_with_summary(
loader: GoogleDriveLoader,
llm: BaseLanguageModel,
*,
force: bool = False,
query: str = "",
**kwargs: Any,
) -> List[Document]:
"""Summarize all documents, and update the GDrive metadata `description`.
Need `write` access: set scope=["https://www.googleapis.com/auth/drive"].
Note: Update the description of shortcut without touch the target
file description.
Args:
llm: Language model to use.
force: true to update all files. Else, update only if the description
is empty.
query: If possible, the query request.
kwargs: Others parameters for the template (verbose, prompt, etc).
"""
return list(
lazy_update_description_with_summary(
loader, llm, force=force, query=query, **kwargs
)
)

View File

@@ -1,353 +1,4 @@
"""Loads data from Google Drive."""
"""DEPRECATED: Kept for backwards compatibility."""
from langchain.document_loaders.google_drive import GoogleDriveLoader
# Prerequisites:
# 1. Create a Google Cloud project
# 2. Enable the Google Drive API:
# https://console.cloud.google.com/flows/enableapi?apiid=drive.googleapis.com
# 3. Authorize credentials for desktop app:
# https://developers.google.com/drive/api/quickstart/python#authorize_credentials_for_a_desktop_application # noqa: E501
# 4. For service accounts visit
# https://cloud.google.com/iam/docs/service-accounts-create
import os
from pathlib import Path
from typing import Any, Dict, List, Optional, Sequence, Union
from pydantic import BaseModel, root_validator, validator
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
SCOPES = ["https://www.googleapis.com/auth/drive.readonly"]
class GoogleDriveLoader(BaseLoader, BaseModel):
"""Loads Google Docs from Google Drive."""
service_account_key: Path = Path.home() / ".credentials" / "keys.json"
"""Path to the service account key file."""
credentials_path: Path = Path.home() / ".credentials" / "credentials.json"
"""Path to the credentials file."""
token_path: Path = Path.home() / ".credentials" / "token.json"
"""Path to the token file."""
folder_id: Optional[str] = None
"""The folder id to load from."""
document_ids: Optional[List[str]] = None
"""The document ids to load from."""
file_ids: Optional[List[str]] = None
"""The file ids to load from."""
recursive: bool = False
"""Whether to load recursively. Only applies when folder_id is given."""
file_types: Optional[Sequence[str]] = None
"""The file types to load. Only applies when folder_id is given."""
load_trashed_files: bool = False
"""Whether to load trashed files. Only applies when folder_id is given."""
# NOTE(MthwRobinson) - changing the file_loader_cls to type here currently
# results in pydantic validation errors
file_loader_cls: Any = None
"""The file loader class to use."""
file_loader_kwargs: Dict["str", Any] = {}
"""The file loader kwargs to use."""
@root_validator
def validate_inputs(cls, values: Dict[str, Any]) -> Dict[str, Any]:
"""Validate that either folder_id or document_ids is set, but not both."""
if values.get("folder_id") and (
values.get("document_ids") or values.get("file_ids")
):
raise ValueError(
"Cannot specify both folder_id and document_ids nor "
"folder_id and file_ids"
)
if (
not values.get("folder_id")
and not values.get("document_ids")
and not values.get("file_ids")
):
raise ValueError("Must specify either folder_id, document_ids, or file_ids")
file_types = values.get("file_types")
if file_types:
if values.get("document_ids") or values.get("file_ids"):
raise ValueError(
"file_types can only be given when folder_id is given,"
" (not when document_ids or file_ids are given)."
)
type_mapping = {
"document": "application/vnd.google-apps.document",
"sheet": "application/vnd.google-apps.spreadsheet",
"pdf": "application/pdf",
}
allowed_types = list(type_mapping.keys()) + list(type_mapping.values())
short_names = ", ".join([f"'{x}'" for x in type_mapping.keys()])
full_names = ", ".join([f"'{x}'" for x in type_mapping.values()])
for file_type in file_types:
if file_type not in allowed_types:
raise ValueError(
f"Given file type {file_type} is not supported. "
f"Supported values are: {short_names}; and "
f"their full-form names: {full_names}"
)
# replace short-form file types by full-form file types
def full_form(x: str) -> str:
return type_mapping[x] if x in type_mapping else x
values["file_types"] = [full_form(file_type) for file_type in file_types]
return values
@validator("credentials_path")
def validate_credentials_path(cls, v: Any, **kwargs: Any) -> Any:
"""Validate that credentials_path exists."""
if not v.exists():
raise ValueError(f"credentials_path {v} does not exist")
return v
def _load_credentials(self) -> Any:
"""Load credentials."""
# Adapted from https://developers.google.com/drive/api/v3/quickstart/python
try:
from google.auth import default
from google.auth.transport.requests import Request
from google.oauth2 import service_account
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
except ImportError:
raise ImportError(
"You must run "
"`pip install --upgrade "
"google-api-python-client google-auth-httplib2 "
"google-auth-oauthlib` "
"to use the Google Drive loader."
)
creds = None
if self.service_account_key.exists():
return service_account.Credentials.from_service_account_file(
str(self.service_account_key), scopes=SCOPES
)
if self.token_path.exists():
creds = Credentials.from_authorized_user_file(str(self.token_path), SCOPES)
if not creds or not creds.valid:
if creds and creds.expired and creds.refresh_token:
creds.refresh(Request())
elif "GOOGLE_APPLICATION_CREDENTIALS" not in os.environ:
creds, project = default()
creds = creds.with_scopes(SCOPES)
# no need to write to file
if creds:
return creds
else:
flow = InstalledAppFlow.from_client_secrets_file(
str(self.credentials_path), SCOPES
)
creds = flow.run_local_server(port=0)
with open(self.token_path, "w") as token:
token.write(creds.to_json())
return creds
def _load_sheet_from_id(self, id: str) -> List[Document]:
"""Load a sheet and all tabs from an ID."""
from googleapiclient.discovery import build
creds = self._load_credentials()
sheets_service = build("sheets", "v4", credentials=creds)
spreadsheet = sheets_service.spreadsheets().get(spreadsheetId=id).execute()
sheets = spreadsheet.get("sheets", [])
documents = []
for sheet in sheets:
sheet_name = sheet["properties"]["title"]
result = (
sheets_service.spreadsheets()
.values()
.get(spreadsheetId=id, range=sheet_name)
.execute()
)
values = result.get("values", [])
header = values[0]
for i, row in enumerate(values[1:], start=1):
metadata = {
"source": (
f"https://docs.google.com/spreadsheets/d/{id}/"
f"edit?gid={sheet['properties']['sheetId']}"
),
"title": f"{spreadsheet['properties']['title']} - {sheet_name}",
"row": i,
}
content = []
for j, v in enumerate(row):
title = header[j].strip() if len(header) > j else ""
content.append(f"{title}: {v.strip()}")
page_content = "\n".join(content)
documents.append(Document(page_content=page_content, metadata=metadata))
return documents
def _load_document_from_id(self, id: str) -> Document:
"""Load a document from an ID."""
from io import BytesIO
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from googleapiclient.http import MediaIoBaseDownload
creds = self._load_credentials()
service = build("drive", "v3", credentials=creds)
file = service.files().get(fileId=id, supportsAllDrives=True).execute()
request = service.files().export_media(fileId=id, mimeType="text/plain")
fh = BytesIO()
downloader = MediaIoBaseDownload(fh, request)
done = False
try:
while done is False:
status, done = downloader.next_chunk()
except HttpError as e:
if e.resp.status == 404:
print("File not found: {}".format(id))
else:
print("An error occurred: {}".format(e))
text = fh.getvalue().decode("utf-8")
metadata = {
"source": f"https://docs.google.com/document/d/{id}/edit",
"title": f"{file.get('name')}",
}
return Document(page_content=text, metadata=metadata)
def _load_documents_from_folder(
self, folder_id: str, *, file_types: Optional[Sequence[str]] = None
) -> List[Document]:
"""Load documents from a folder."""
from googleapiclient.discovery import build
creds = self._load_credentials()
service = build("drive", "v3", credentials=creds)
files = self._fetch_files_recursive(service, folder_id)
# If file types filter is provided, we'll filter by the file type.
if file_types:
_files = [f for f in files if f["mimeType"] in file_types] # type: ignore
else:
_files = files
returns = []
for file in _files:
if file["trashed"] and not self.load_trashed_files:
continue
elif file["mimeType"] == "application/vnd.google-apps.document":
returns.append(self._load_document_from_id(file["id"])) # type: ignore
elif file["mimeType"] == "application/vnd.google-apps.spreadsheet":
returns.extend(self._load_sheet_from_id(file["id"])) # type: ignore
elif (
file["mimeType"] == "application/pdf"
or self.file_loader_cls is not None
):
returns.extend(self._load_file_from_id(file["id"])) # type: ignore
else:
pass
return returns
def _fetch_files_recursive(
self, service: Any, folder_id: str
) -> List[Dict[str, Union[str, List[str]]]]:
"""Fetch all files and subfolders recursively."""
results = (
service.files()
.list(
q=f"'{folder_id}' in parents",
pageSize=1000,
includeItemsFromAllDrives=True,
supportsAllDrives=True,
fields="nextPageToken, files(id, name, mimeType, parents, trashed)",
)
.execute()
)
files = results.get("files", [])
returns = []
for file in files:
if file["mimeType"] == "application/vnd.google-apps.folder":
if self.recursive:
returns.extend(self._fetch_files_recursive(service, file["id"]))
else:
returns.append(file)
return returns
def _load_documents_from_ids(self) -> List[Document]:
"""Load documents from a list of IDs."""
if not self.document_ids:
raise ValueError("document_ids must be set")
return [self._load_document_from_id(doc_id) for doc_id in self.document_ids]
def _load_file_from_id(self, id: str) -> List[Document]:
"""Load a file from an ID."""
from io import BytesIO
from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload
creds = self._load_credentials()
service = build("drive", "v3", credentials=creds)
file = service.files().get(fileId=id, supportsAllDrives=True).execute()
request = service.files().get_media(fileId=id)
fh = BytesIO()
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
if self.file_loader_cls is not None:
fh.seek(0)
loader = self.file_loader_cls(file=fh, **self.file_loader_kwargs)
docs = loader.load()
for doc in docs:
doc.metadata["source"] = f"https://drive.google.com/file/d/{id}/view"
return docs
else:
from PyPDF2 import PdfReader
content = fh.getvalue()
pdf_reader = PdfReader(BytesIO(content))
return [
Document(
page_content=page.extract_text(),
metadata={
"source": f"https://drive.google.com/file/d/{id}/view",
"title": f"{file.get('name')}",
"page": i,
},
)
for i, page in enumerate(pdf_reader.pages)
]
def _load_file_from_ids(self) -> List[Document]:
"""Load files from a list of IDs."""
if not self.file_ids:
raise ValueError("file_ids must be set")
docs = []
for file_id in self.file_ids:
docs.extend(self._load_file_from_id(file_id))
return docs
def load(self) -> List[Document]:
"""Load documents."""
if self.folder_id:
return self._load_documents_from_folder(
self.folder_id, file_types=self.file_types
)
elif self.document_ids:
return self._load_documents_from_ids()
else:
return self._load_file_from_ids()
__all__ = ["GoogleDriveLoader"]

View File

@@ -30,6 +30,7 @@ from langchain.retrievers.ensemble import EnsembleRetriever
from langchain.retrievers.google_cloud_enterprise_search import (
GoogleCloudEnterpriseSearchRetriever,
)
from langchain.retrievers.google_drive import GoogleDriveRetriever
from langchain.retrievers.kendra import AmazonKendraRetriever
from langchain.retrievers.knn import KNNRetriever
from langchain.retrievers.llama_index import (
@@ -65,6 +66,7 @@ __all__ = [
"ChaindeskRetriever",
"ElasticSearchBM25Retriever",
"GoogleCloudEnterpriseSearchRetriever",
"GoogleDriveRetriever",
"KNNRetriever",
"LlamaIndexGraphRetriever",
"LlamaIndexRetriever",

View File

@@ -0,0 +1,92 @@
from typing import Any, Dict, List, Literal, Optional
from pydantic.class_validators import root_validator
from pydantic.config import Extra
from langchain.callbacks.manager import Callbacks
from langchain.schema import BaseRetriever, Document
from ..utilities.google_drive import (
GoogleDriveUtilities,
get_template,
)
class GoogleDriveRetriever(GoogleDriveUtilities, BaseRetriever):
"""Wrapper around Google Drive API.
The application must be authenticated with a json file.
The format may be for a user or for an application via a service account.
The environment variable `GOOGLE_ACCOUNT_FILE` may be set to reference this file.
For more information, see [here]
(https://developers.google.com/workspace/guides/auth-overview).
"""
class Config:
extra = Extra.allow
allow_mutation = False
underscore_attrs_are_private = True
mode: Literal[
"snippets", "snippets-markdown", "documents", "documents-markdown"
] = "snippets-markdown"
@root_validator(pre=True)
def validate_template(cls, v: Dict[str, Any]) -> Dict[str, Any]:
folder_id = v.get("folder_id")
if not v.get("template"):
if folder_id:
template = get_template("gdrive-query-in-folder")
else:
template = get_template("gdrive-query")
v["template"] = template
return v
def get_relevant_documents(
self,
query: str,
*,
callbacks: Callbacks = None,
tags: Optional[List[str]] = None,
metadata: Optional[Dict[str, Any]] = None,
**kwargs: Any,
) -> List[Document]:
"""Get documents relevant for a query.
Args:
query: string to find relevant documents for
Returns:
List of relevant documents
"""
return list(
self.lazy_get_relevant_documents(
query=query,
callbacks=callbacks,
tags=tags,
metadata=metadata,
**kwargs,
)
)
async def aget_relevant_documents(
self,
query: str,
*,
callbacks: Callbacks = None,
tags: Optional[List[str]] = None,
metadata: Optional[Dict[str, Any]] = None,
**kwargs: Any,
) -> List[Document]:
"""Get documents relevant for a query.
NOT IMPLEMENTED
Args:
query: string to find relevant documents for
Returns:
List of relevant documents
"""
raise NotImplementedError("GoogleSearchRun does not support async")

View File

@@ -45,6 +45,7 @@ from langchain.tools.gmail import (
GmailSearch,
GmailSendMessage,
)
from langchain.tools.google_drive.tool import GoogleDriveSearchTool
from langchain.tools.google_places.tool import GooglePlacesTool
from langchain.tools.google_search.tool import GoogleSearchResults, GoogleSearchRun
from langchain.tools.google_serper.tool import GoogleSerperResults, GoogleSerperRun
@@ -148,6 +149,7 @@ __all__ = [
"GmailGetThread",
"GmailSearch",
"GmailSendMessage",
"GoogleDriveSearchTool",
"GooglePlacesTool",
"GoogleSearchResults",
"GoogleSearchRun",

View File

@@ -0,0 +1,41 @@
import logging
from typing import Optional
from langchain.callbacks.manager import (
AsyncCallbackManagerForToolRun,
CallbackManagerForToolRun,
)
from langchain.tools import BaseTool
from ...utilities.google_drive import FORMAT_INSTRUCTION, GoogleDriveAPIWrapper
logger = logging.getLogger(__name__)
class GoogleDriveSearchTool(BaseTool):
"""Tool that adds the capability to query the Google Drive search API."""
name = "Google Drive Search"
description = (
"A wrapper around Google Drive Search. "
"Useful for when you need to find a document in google drive. "
f"{FORMAT_INSTRUCTION}"
)
api_wrapper: GoogleDriveAPIWrapper
def _run(
self,
query: str,
run_manager: Optional[CallbackManagerForToolRun] = None,
) -> str:
"""Use the tool."""
logger.info(f"{query=}")
return self.api_wrapper.run(query)
async def _arun(
self,
query: str,
run_manager: Optional[AsyncCallbackManagerForToolRun] = None,
) -> str:
"""Use the tool asynchronously."""
raise NotImplementedError("GoogleSearchRun does not support async")

View File

@@ -11,6 +11,7 @@ from langchain.utilities.bing_search import BingSearchAPIWrapper
from langchain.utilities.brave_search import BraveSearchWrapper
from langchain.utilities.duckduckgo_search import DuckDuckGoSearchAPIWrapper
from langchain.utilities.golden_query import GoldenQueryAPIWrapper
from langchain.utilities.google_drive import GoogleDriveAPIWrapper
from langchain.utilities.google_places_api import GooglePlacesAPIWrapper
from langchain.utilities.google_search import GoogleSearchAPIWrapper
from langchain.utilities.google_serper import GoogleSerperAPIWrapper
@@ -42,6 +43,7 @@ __all__ = [
"BraveSearchWrapper",
"DuckDuckGoSearchAPIWrapper",
"GoldenQueryAPIWrapper",
"GoogleDriveAPIWrapper",
"GooglePlacesAPIWrapper",
"GoogleSearchAPIWrapper",
"GoogleSerperAPIWrapper",

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,161 @@
import unittest
from pathlib import Path
from unittest.mock import MagicMock
import pytest
from pytest_mock import MockerFixture
from langchain.document_loaders.google_drive import GoogleDriveLoader
from tests.unit_tests.llms.fake_llm import FakeLLM
from tests.unit_tests.utilities.test_google_drive import (
gdrive_docs,
google_workspace_installed,
patch_google_workspace,
)
@pytest.fixture
def google_workspace(mocker: MockerFixture) -> MagicMock:
return patch_google_workspace(
mocker, [{"nextPageToken": None, "files": gdrive_docs}]
)
@unittest.skipIf(not google_workspace_installed, "Google api not installed")
def test_load_returns_list_of_google_documents_single(
google_workspace: MagicMock,
) -> None:
loader = GoogleDriveLoader(
api_file=Path(__file__).parent.parent
/ "utilities"
/ "examples"
/ "gdrive_credentials.json",
folder_id="999",
)
assert loader.mode == "documents" # Check default value
assert loader.gsheet_mode == "single" # Check default value
assert loader.gslide_mode == "single" # Check default value
@unittest.skipIf(not google_workspace_installed, "Google api not installed")
def test_service_account_key(google_workspace: MagicMock) -> None:
loader = GoogleDriveLoader(
service_account_key=Path(__file__).parent.parent
/ "utilities"
/ "examples"
/ "gdrive_service.json",
template="gdrive-all-in-folder",
)
assert (
loader.gdrive_api_file
== Path(__file__).parent.parent
/ "utilities"
/ "examples"
/ "gdrive_service.json"
)
# @unittest.skipIf(not google_workspace_installed, "Google api not installed")
# def test_no_path(mocker,google_workspace) -> None:
# import os
# mocker.patch.dict(os.environ,{},clear=True)
# loader = GoogleDriveLoader(
# template="gdrive-all-in-folder",
# )
# assert loader.gdrive_api_file == Path.home() / ".credentials" / "keys.json"
@unittest.skipIf(not google_workspace_installed, "Google api not installed")
def test_credentials_path(mocker: MockerFixture, google_workspace: MagicMock) -> None:
loader = GoogleDriveLoader(
credentials_path=Path(__file__).parent.parent
/ "utilities"
/ "examples"
/ "gdrive_credentials.json",
template="gdrive-all-in-folder",
)
assert (
loader.gdrive_api_file
== Path(__file__).parent.parent
/ "utilities"
/ "examples"
/ "gdrive_credentials.json"
)
@unittest.skipIf(not google_workspace_installed, "Google api not installed")
def test_folder_id(google_workspace: MagicMock) -> None:
loader = GoogleDriveLoader(
api_file=Path(__file__).parent.parent
/ "utilities"
/ "examples"
/ "gdrive_credentials.json",
folder_id="999",
)
docs = loader.load()
assert len(docs) == 3
@unittest.skipIf(not google_workspace_installed, "Google api not installed")
def test_query(google_workspace: MagicMock) -> None:
loader = GoogleDriveLoader(
api_file=Path(__file__).parent.parent
/ "utilities"
/ "examples"
/ "gdrive_credentials.json",
query="",
template="gdrive-query",
)
docs = loader.load()
assert len(docs) == 3
@unittest.skipIf(not google_workspace_installed, "Google api not installed")
def test_document_ids(google_workspace: MagicMock) -> None:
loader = GoogleDriveLoader(
api_file=Path(__file__).parent.parent
/ "utilities"
/ "examples"
/ "gdrive_credentials.json",
document_ids=["1", "1"],
)
docs = loader.load()
assert len(docs) == 2
@unittest.skipIf(not google_workspace_installed, "Google api not installed")
def test_files_ids(google_workspace: MagicMock) -> None:
loader = GoogleDriveLoader(
api_file=Path(__file__).parent.parent
/ "utilities"
/ "examples"
/ "gdrive_credentials.json",
file_ids=["1", "2"],
)
docs = loader.load()
assert len(docs) == 2
@unittest.skipIf(not google_workspace_installed, "Google api not installed")
def test_update_description_with_summary(google_workspace: MagicMock) -> None:
loader = GoogleDriveLoader(
api_file=Path(__file__).parent.parent
/ "utilities"
/ "examples"
/ "gdrive_credentials.json",
file_ids=["1", "2"],
scopes=["https://www.googleapis.com/auth/drive"],
)
result = list(
loader.lazy_update_description_with_summary(
llm=FakeLLM(), force=True, prompt=None, verbose=True, query=""
)
)
assert len(result) == 2
result = list(
loader.lazy_update_description_with_summary(
llm=FakeLLM(), force=False, prompt=None, query=""
)
)
assert len(result) == 0

View File

@@ -0,0 +1,53 @@
import unittest
from pathlib import Path
from unittest.mock import MagicMock
import pytest
from pytest_mock import MockerFixture
from langchain.retrievers.google_drive import GoogleDriveRetriever
from tests.unit_tests.utilities.test_google_drive import (
_text_text,
gdrive_docs,
google_workspace_installed,
patch_google_workspace,
)
@pytest.fixture
def google_workspace(mocker: MockerFixture) -> MagicMock:
return patch_google_workspace(
mocker, [{"nextPageToken": None, "files": gdrive_docs}]
)
@unittest.skipIf(not google_workspace_installed, "Google api not installed")
def test_get_relevant_documents(
mocker: MockerFixture,
) -> None:
patch_google_workspace(mocker, [{"nextPageToken": None, "files": [_text_text]}])
retriever = GoogleDriveRetriever(
api_file=Path(__file__).parent.parent
/ "utilities"
/ "examples"
/ "gdrive_credentials.json",
)
docs = retriever.get_relevant_documents("machine learning")
assert len(docs) == 1
@unittest.skipIf(not google_workspace_installed, "Google api not installed")
def test_extra_parameters(
mocker: MockerFixture,
) -> None:
patch_google_workspace(mocker, [{"nextPageToken": None, "files": [_text_text]}])
retriever = GoogleDriveRetriever(
template="gdrive-mime-type-in-folders",
folder_id="root",
mime_type="application/vnd.google-apps.document", # Only Google Docs
num_results=2,
mode="snippets",
includeItemsFromAllDrives=False,
supportsAllDrives=False,
)
retriever.get_relevant_documents("machine learning")

View File

@@ -0,0 +1,40 @@
import unittest
from pathlib import Path
from unittest.mock import MagicMock
import pytest as pytest
from pytest_mock import MockerFixture
from langchain.tools.google_drive.tool import GoogleDriveSearchTool
from langchain.utilities import GoogleDriveAPIWrapper
from tests.unit_tests.utilities.test_google_drive import (
gdrive_docs,
google_workspace_installed,
patch_google_workspace,
)
@pytest.fixture
def google_workspace(mocker: MockerFixture) -> MagicMock:
return patch_google_workspace(
mocker, [{"nextPageToken": None, "files": gdrive_docs}]
)
@unittest.skipIf(not google_workspace_installed, "Google api not installed")
def test_run(google_workspace: MagicMock) -> None:
tool = GoogleDriveSearchTool(
api_wrapper=GoogleDriveAPIWrapper(
api_file=(
Path(__file__).parent.parent
/ "utilities"
/ "examples"
/ "gdrive_credentials.json"
)
)
)
result = tool._run("machine learning")
assert result.startswith(
"[vnd.google-apps.document](https://docs.google.com/document/d/1/edit?usp=drivesdk)<br/>\n"
"It is a doc summary\n\n"
)

View File

@@ -32,6 +32,7 @@ _EXPECTED = [
"GmailGetThread",
"GmailSearch",
"GmailSendMessage",
"GoogleDriveSearchTool",
"GooglePlacesTool",
"GoogleSearchResults",
"GoogleSearchRun",

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,161 @@
{
"spreadsheetId": "1iuGLyUDgw6mCjyXnaqNtpXS2-ALJbZ4wq1cWBuCfTRg",
"properties": {
"title": "vnd.google-apps.spreadsheet",
"locale": "fr_FR",
"autoRecalc": "ON_CHANGE",
"timeZone": "Europe/Paris",
"defaultFormat": {
"backgroundColor": {
"red": 1,
"green": 1,
"blue": 1
},
"padding": {
"top": 2,
"right": 3,
"bottom": 2,
"left": 3
},
"verticalAlignment": "BOTTOM",
"wrapStrategy": "OVERFLOW_CELL",
"textFormat": {
"foregroundColor": {},
"fontFamily": "arial,sans,sans-serif",
"fontSize": 10,
"bold": false,
"italic": false,
"strikethrough": false,
"underline": false,
"foregroundColorStyle": {
"rgbColor": {}
}
},
"backgroundColorStyle": {
"rgbColor": {
"red": 1,
"green": 1,
"blue": 1
}
}
},
"spreadsheetTheme": {
"primaryFontFamily": "Arial",
"themeColors": [
{
"colorType": "TEXT",
"color": {
"rgbColor": {}
}
},
{
"colorType": "BACKGROUND",
"color": {
"rgbColor": {
"red": 1,
"green": 1,
"blue": 1
}
}
},
{
"colorType": "ACCENT1",
"color": {
"rgbColor": {
"red": 0.25882354,
"green": 0.52156866,
"blue": 0.95686275
}
}
},
{
"colorType": "ACCENT2",
"color": {
"rgbColor": {
"red": 0.91764706,
"green": 0.2627451,
"blue": 0.20784314
}
}
},
{
"colorType": "ACCENT3",
"color": {
"rgbColor": {
"red": 0.9843137,
"green": 0.7372549,
"blue": 0.015686275
}
}
},
{
"colorType": "ACCENT4",
"color": {
"rgbColor": {
"red": 0.20392157,
"green": 0.65882355,
"blue": 0.3254902
}
}
},
{
"colorType": "ACCENT5",
"color": {
"rgbColor": {
"red": 1,
"green": 0.42745098,
"blue": 0.003921569
}
}
},
{
"colorType": "ACCENT6",
"color": {
"rgbColor": {
"red": 0.27450982,
"green": 0.7411765,
"blue": 0.7764706
}
}
},
{
"colorType": "LINK",
"color": {
"rgbColor": {
"red": 0.06666667,
"green": 0.33333334,
"blue": 0.8
}
}
}
]
}
},
"sheets": [
{
"properties": {
"sheetId": 0,
"title": "Feuille 1",
"index": 0,
"sheetType": "GRID",
"gridProperties": {
"rowCount": 1000,
"columnCount": 26
}
}
},
{
"properties": {
"sheetId": 831511404,
"title": "Feuille 2",
"index": 1,
"sheetType": "GRID",
"gridProperties": {
"rowCount": 1000,
"columnCount": 26
}
}
}
],
"spreadsheetUrl": "https://docs.google.com/spreadsheets/d/1iuGLyUDgw6mCjyXnaqNtpXS2-ALJbZ4wq1cWBuCfTRg/edit?ouid=109055472267306456451"
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,13 @@
{
"installed": {
"client_id": "",
"project_id": "",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_secret": "",
"redirect_uris": [
"http://localhost"
]
}
}

View File

@@ -0,0 +1,12 @@
{
"type": "service_account",
"project_id": "lanchain",
"private_key_id": "",
"private_key": "",
"client_email": "a@a.com",
"client_id": "",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": ""
}

View File

@@ -0,0 +1 @@
The body of a text file

View File

@@ -0,0 +1,12 @@
{
"token": "MockToken",
"refresh_token": "",
"token_uri": "https://oauth2.googleapis.com/token",
"client_id": "",
"client_secret": "",
"scopes": [
"https://www.googleapis.com/auth/drive.readonly",,
"https://www.googleapis.com/auth/drive"
],
"expiry": "9999-01-01T00:00:00.0Z"
}

File diff suppressed because it is too large Load Diff