langchain/docs/extras/integrations/document_loaders/google_drive.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b0ed136e-6983-4893-ae1b-b75753af05f8",
   "metadata": {},
   "source": [
    "# Google Drive\n",
    "\n",
    ">[Google Drive](https://en.wikipedia.org/wiki/Google_Drive) is a file storage and synchronization service developed by Google.\n",
    "\n",
    "This notebook covers how to load documents from `Google Drive`. Currently, only `Google Docs` are supported.\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "1. Create a Google Cloud project or use an existing project\n",
    "1. Enable the [Google Drive API](https://console.cloud.google.com/flows/enableapi?apiid=drive.googleapis.com)\n",
    "1. [Authorize credentials for desktop app](https://developers.google.com/drive/api/quickstart/python#authorize_credentials_for_a_desktop_application)\n",
    "1. `pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib`\n",
    "\n",
    "## 🧑 Instructions for ingesting your Google Docs data\n",
    "By default, the `GoogleDriveLoader` expects the `credentials.json` file to be `~/.credentials/credentials.json`, but this is configurable using the `credentials_path` keyword argument. Same thing with `token.json` - `token_path`. Note that `token.json` will be created automatically the first time you use the loader.\n",
    "\n",
    "The first time you use GoogleDriveLoader, you will be displayed with the consent screen in your browser. If this doesn't happen and you get a `RefreshError`, do not use `credentials_path` in your `GoogleDriveLoader` constructor call. Instead, put that path in a `GOOGLE_APPLICATION_CREDENTIALS` environmental variable.\n",
    "\n",
    "`GoogleDriveLoader` can load from a list of Google Docs document ids or a folder id. You can obtain your folder and document id from the URL:\n",
    "* Folder: https://drive.google.com/drive/u/0/folders/1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5 -> folder id is `\"1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5\"`\n",
    "* Document: https://docs.google.com/document/d/1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw/edit -> document id is `\"1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw\"`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e40071c-3a65-4e26-b497-3e2be0bd86b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "878928a6-a5ae-4f74-b351-64e3b01733fe",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.document_loaders import GoogleDriveLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2216c83f-68e4-4d2f-8ea2-5878fb18bbe7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "loader = GoogleDriveLoader(\n",
    "    folder_id=\"1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5\",\n",
    "    token_path='/path/where/you/want/token/to/be/created/google_token.json'\n",
    "    # Optional: configure whether to recursively fetch files from subfolders. Defaults to False.\n",
    "    recursive=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f3b6aa0-b45d-4e37-8c50-5bebe70fdb9d",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2721ba8a",
   "metadata": {},
   "source": [
    "When you pass a `folder_id` by default all files of type document, sheet and pdf are loaded. You can modify this behaviour by passing a `file_types` argument "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ff83b4c",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = GoogleDriveLoader(\n",
    "    folder_id=\"1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5\",\n",
    "    file_types=[\"document\", \"sheet\"],\n",
    "    recursive=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6b80931",
   "metadata": {},
   "source": [
    "## Passing in Optional File Loaders\n",
    "\n",
    "When processing files other than Google Docs and Google Sheets, it can be helpful to pass an optional file loader to `GoogleDriveLoader`. If you pass in a file loader, that file loader will be used on documents that do not have a Google Docs or Google Sheets MIME type. Here is an example of how to load an Excel document from Google Drive using a file loader. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "94207e39",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import GoogleDriveLoader\n",
    "from langchain.document_loaders import UnstructuredFileIOLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a15fbee0",
   "metadata": {},
   "outputs": [],
   "source": [
    "file_id = \"1x9WBtFPWMEAdjcJzPScRsjpjQvpSo_kz\"\n",
    "loader = GoogleDriveLoader(\n",
    "    file_ids=[file_id],\n",
    "    file_loader_cls=UnstructuredFileIOLoader,\n",
    "    file_loader_kwargs={\"mode\": \"elements\"},\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98410bda",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3e72221",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "238cd06f",
   "metadata": {},
   "source": [
    "You can also process a folder with a mix of files and Google Docs/Sheets using the following pattern:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e2d093f",
   "metadata": {},
   "outputs": [],
   "source": [
    "folder_id = \"1asMOHY1BqBS84JcRbOag5LOJac74gpmD\"\n",
    "loader = GoogleDriveLoader(\n",
    "    folder_id=folder_id,\n",
    "    file_loader_cls=UnstructuredFileIOLoader,\n",
    "    file_loader_kwargs={\"mode\": \"elements\"},\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b35ddcc6",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3cc141e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e312268a",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "83ac576b-48c9-4aad-a35e-e978ea32f746",
   "metadata": {},
   "source": [
    "## Extended usage\n",
    "An external component can manage the complexity of Google Drive : `langchain-googledrive`\n",
    "It's compatible with the ̀`langchain.document_loaders.GoogleDriveLoader` and can be used\n",
    "in its place.\n",
    "\n",
    "To be compatible with containers, the authentication uses an environment variable ̀GOOGLE_ACCOUNT_FILE` to credential file (for user or service)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b94f7119-bc1e-4ca3-907f-9d81e837ac59",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install langchain-googledrive"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4c7474e-49cb-48a1-b3a0-77fba8e2dd70",
   "metadata": {},
   "outputs": [],
   "source": [
    "folder_id='root'\n",
    "#folder_id='1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8357f7f1-e2b1-41ef-8e38-48fcc3897dba",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use the advanced version.\n",
    "from langchain_googledrive.document_loaders import GoogleDriveLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16ab9d3d-1782-4cb9-ab56-d87edbb25a18",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = GoogleDriveLoader(\n",
    "    folder_id=folder_id,\n",
    "    recursive=False,\n",
    "    num_results=2,  # Maximum number of file to load\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebac43aa-dd64-4964-802a-a90172415fd1",
   "metadata": {},
   "source": [
    "By default, all files with these mime-type can be converted to `Document`.\n",
    "- text/text\n",
    "- text/plain\n",
    "- text/html\n",
    "- text/csv\n",
    "- text/markdown\n",
    "- image/png\n",
    "- image/jpeg\n",
    "- application/epub+zip\n",
    "- application/pdf\n",
    "- application/rtf\n",
    "- application/vnd.google-apps.document (GDoc)\n",
    "- application/vnd.google-apps.presentation (GSlide)\n",
    "- application/vnd.google-apps.spreadsheet (GSheet)\n",
    "- application/vnd.google.colaboratory (Notebook colab)\n",
    "- application/vnd.openxmlformats-officedocument.presentationml.presentation (PPTX)\n",
    "- application/vnd.openxmlformats-officedocument.wordprocessingml.document (DOCX)\n",
    "\n",
    "It's possible to update or customize this. See the documentation of `GDriveLoader`.\n",
    "\n",
    "But, the corresponding packages must be installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b4560f35-a37d-44e2-be0b-adaa245b3b3d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install unstructured"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6cb08da3-27df-46de-b60e-583bb7e31af4",
   "metadata": {},
   "outputs": [],
   "source": [
    "for doc in loader.load():\n",
    "    print(\"---\")\n",
    "    print(doc.page_content.strip()[:60]+\"...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd13d7d1-db7a-498d-ac98-76ccd9ad9019",
   "metadata": {},
   "source": [
    "### Customize the search pattern\n",
    "\n",
    "All parameter compatible with Google [`list()`](https://developers.google.com/drive/api/v3/reference/files/list)\n",
    "API can be set.\n",
    "\n",
    "To specify the new pattern of the Google request, you can use a `PromptTemplate()`.\n",
    "The variables for the prompt can be set with `kwargs` in the constructor.\n",
    "Some pre-formated request are proposed (use `{query}`, `{folder_id}` and/or `{mime_type}`):\n",
    "\n",
    "You can customize the criteria to select the files. A set of predefined filter are proposed:\n",
    "| template                               | description                                                           |\n",
    "| -------------------------------------- | --------------------------------------------------------------------- |\n",
    "| gdrive-all-in-folder                   | Return all compatible files from a `folder_id`                        |\n",
    "| gdrive-query                           | Search `query` in all drives                                          |\n",
    "| gdrive-by-name                         | Search file with name `query`                                        |\n",
    "| gdrive-query-in-folder                 | Search `query` in `folder_id` (and sub-folders if `recursive=true`)  |\n",
    "| gdrive-mime-type                       | Search a specific `mime_type`                                         |\n",
    "| gdrive-mime-type-in-folder             | Search a specific `mime_type` in `folder_id`                          |\n",
    "| gdrive-query-with-mime-type            | Search `query` with a specific `mime_type`                            |\n",
    "| gdrive-query-with-mime-type-and-folder | Search `query` with a specific `mime_type` and in `folder_id`         |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "81348d59-8fd6-45d4-9de3-5df5cff5c7e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = GoogleDriveLoader(\n",
    "    folder_id=folder_id,\n",
    "    recursive=False,\n",
    "    template=\"gdrive-query\",  # Default template to use\n",
    "    query=\"machine learning\",\n",
    "    num_results=2,            # Maximum number of file to load\n",
    "    supportsAllDrives=False,  # GDrive `list()` parameter\n",
    ")\n",
    "for doc in loader.load():\n",
    "    print(\"---\")\n",
    "    print(doc.page_content.strip()[:60]+\"...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46c6ba5b-d4b1-4f0f-9801-5c1314021605",
   "metadata": {},
   "source": [
    "You can customize your pattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a5a323b-8d96-46b7-b46a-fd69bd2c8e04",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts.prompt import PromptTemplate\n",
    "loader = GoogleDriveLoader(\n",
    "    folder_id=folder_id,\n",
    "    recursive=False,\n",
    "    template=PromptTemplate(\n",
    "        input_variables=[\"query\", \"query_name\"],\n",
    "        template=\"fullText contains '{query}' and name contains '{query_name}' and trashed=false\",\n",
    "        ),  # Default template to use\n",
    "    query=\"machine learning\",\n",
    "    query_name=\"ML\",    \n",
    "    num_results=2,  # Maximum number of file to load\n",
    ")\n",
    "for doc in loader.load():\n",
    "    print(\"---\")\n",
    "    print(doc.page_content.strip()[:60]+\"...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "375bb465-8f69-407b-94bd-ffa3718ef500",
   "metadata": {},
   "source": [
    "#### Modes for GSlide and GSheet\n",
    "The parameter mode accepts different values:\n",
    "\n",
    "- \"document\": return the body of each document\n",
    "- \"snippets\": return the description of each file (set in metadata of Google Drive files).\n",
    "\n",
    "\n",
    "The conversion can manage in Markdown format:\n",
    "- bullet\n",
    "- link\n",
    "- table\n",
    "- titles\n",
    "\n",
    "The parameter `gslide_mode` accepts different values:\n",
    "\n",
    "- \"single\" : one document with &lt;PAGE BREAK&gt;\n",
    "- \"slide\" : one document by slide\n",
    "- \"elements\" : one document for each elements.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7493d7b0-0600-49af-8107-7f4597c92de7",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = GoogleDriveLoader(\n",
    "    template=\"gdrive-mime-type\",\n",
    "    mime_type=\"application/vnd.google-apps.presentation\", # Only GSlide files\n",
    "    gslide_mode=\"slide\",\n",
    "    num_results=2,  # Maximum number of file to load\n",
    ")\n",
    "for doc in loader.load():\n",
    "    print(\"---\")\n",
    "    print(doc.page_content.strip()[:60]+\"...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9bf338fb-02d7-452f-8679-c50419b13464",
   "metadata": {},
   "source": [
    "The parameter `gsheet_mode` accepts different values:\n",
    "- `\"single\"`: Generate one document by line\n",
    "- `\"elements\"` : one document with markdown array and &lt;PAGE BREAK&gt; tags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "469f5af0-67db-4f15-8aee-88cde480729b",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = GoogleDriveLoader(\n",
    "    template=\"gdrive-mime-type\",\n",
    "    mime_type=\"application/vnd.google-apps.spreadsheet\", # Only GSheet files\n",
    "    gsheet_mode=\"elements\",\n",
    "    num_results=2,  # Maximum number of file to load\n",
    ")\n",
    "for doc in loader.load():\n",
    "    print(\"---\")\n",
    "    print(doc.page_content.strip()[:60]+\"...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09acb864-e919-4add-9e06-deba6f7f0cd8",
   "metadata": {},
   "source": [
    "### Advanced usage\n",
    "All Google File have a 'description' in the metadata. This field can be used to memorize a summary of the document or others indexed tags (See method `lazy_update_description_with_summary()`).\n",
    "\n",
    "If you use the `mode=\"snippet\"`, only the description will be used for the body. Else, the `metadata['summary']` has the field.\n",
    "\n",
    "Sometime, a specific filter can be used to extract some information from the filename, to select some files with specific criteria. You can use a filter.\n",
    "\n",
    "Sometimes, many documents are returned. It's not necessary to have all documents in memory at the same time. You can use the lazy versions of methods, to get one document at a time. It's better to use a complex query in place of a recursive search. For each folder, a query must be applied if you activate `recursive=True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5e9c8eb-a266-4ae6-a760-d7826a0aa7c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "loader = GoogleDriveLoader(\n",
    "                gdrive_api_file=os.environ[\"GOOGLE_ACCOUNT_FILE\"],\n",
    "                num_results=2,\n",
    "                template=\"gdrive-query\",\n",
    "                filter=lambda search, file: \"#test\" not in file.get('description',''),\n",
    "                query='machine learning',\n",
    "                supportsAllDrives=False,\n",
    "                )\n",
    "for doc in loader.load():\n",
    "    print(\"---\")\n",
    "    print(doc.page_content.strip()[:60]+\"...\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51efa73a-4e2d-4f9c-abaf-6c9bde2ff69d",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}