langchain/docs/docs/integrations/document_loaders/azure_ai_data.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a634365e",
   "metadata": {},
   "source": [
    "# Azure AI Data\n",
    "\n",
    ">[Azure AI Studio](https://ai.azure.com/) provides the capability to upload data assets to cloud storage and register existing data assets from the following sources:\n",
    "\n",
    "- Microsoft OneLake\n",
    "- Azure Blob Storage\n",
    "- Azure Data Lake gen 2\n",
    "\n",
    "The benefit of this approach over `AzureBlobStorageContainerLoader` and `AzureBlobStorageFileLoader` is that authentication is handled seamlessly to cloud storage. You can use either *identity-based* data access control to the data or *credential-based* (e.g. SAS token, account key). In the case of credential-based data access you do not need to specify secrets in your code or set up key vaults - the system handles that for you.\n",
    "\n",
    "This notebook covers how to load document objects from a data asset in AI Studio."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49815096",
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip install azureml-fsspec, azure-ai-generative"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "2f0cd6a5",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from azure.ai.resources.client import AIClient\n",
    "from azure.identity import DefaultAzureCredential\n",
    "from langchain.document_loaders import AzureAIDataLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "08d40b11-e87a-426e-a6b0-89f24e47ce2c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a connection to your project\n",
    "client = AIClient(\n",
    "    credential=DefaultAzureCredential(),\n",
    "    subscription_id=\"<subscription_id>\",\n",
    "    resource_group_name=\"<resource_group_name>\",\n",
    "    project_name=\"<project_name>\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "321cc7f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# get the latest version of your data asset\n",
    "data_asset = client.data.get(name=\"<data_asset_name>\", label=\"latest\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25d91cea-c5f2-4a53-ac19-442810451ec6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the data asset\n",
    "loader = AzureAIDataLoader(url=data_asset.path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "2b11d155",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Lorem ipsum dolor sit amet.', lookup_str='', metadata={'source': '/var/folders/y6/8_bzdg295ld6s1_97_12m4lr0000gn/T/tmpaa9xl6ch/fake.docx'}, lookup_index=0)]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0690c40a",
   "metadata": {},
   "source": [
    "## Specifying a glob pattern\n",
    "You can also specify a glob pattern for more finegrained control over what files to load. In the example below, only files with a `pdf` extension will be loaded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "72d44781",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = AzureAIDataLoader(url=data_asset.path, glob=\"*.pdf\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "2d3c32db",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Lorem ipsum dolor sit amet.', lookup_str='', metadata={'source': '/var/folders/y6/8_bzdg295ld6s1_97_12m4lr0000gn/T/tmpujbkzf_l/fake.docx'}, lookup_index=0)]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "885dc280",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}