community[patch]: Support Streaming in Azure Machine Learning (#18246)

- [x] **PR title**: "community: Support streaming in Azure ML and few naming changes" - [x] **PR message**: - **Description:** Added support for streaming for azureml_endpoint. Also, renamed and AzureMLEndpointApiType.realtime to AzureMLEndpointApiType.dedicated. Also, added new classes CustomOpenAIChatContentFormatter and CustomOpenAIContentFormatter and updated the classes LlamaChatContentFormatter and LlamaContentFormatter to now show a deprecated warning message when instantiated. --------- Co-authored-by: Sachin Paryani <saparan@microsoft.com> Co-authored-by: Bagatur <baskaryan@gmail.com>
2025-09-07 14:03:26 +00:00 · 2024-03-28 16:38:20 -07:00
parent ecb11a4a32
commit 25c9f3d1d1
6 changed files with 285 additions and 76 deletions
--- a/docs/docs/integrations/chat/azureml_chat_endpoint.ipynb
+++ b/docs/docs/integrations/chat/azureml_chat_endpoint.ipynb
@@ -40,7 +40,7 @@
    "You must [deploy a model on Azure ML](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-foundation-models?view=azureml-api-2#deploying-foundation-models-to-endpoints-for-inferencing) or [to Azure AI studio](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-open) and obtain the following parameters:\n",
    "\n",
    "* `endpoint_url`: The REST endpoint url provided by the endpoint.\n",
-    "* `endpoint_api_type`: Use `endpoint_type='realtime'` when deploying models to **Realtime endpoints** (hosted managed infrastructure). Use `endpoint_type='serverless'` when deploying models using the **Pay-as-you-go** offering (model as a service).\n",
+    "* `endpoint_api_type`: Use `endpoint_type='dedicated'` when deploying models to **Dedicated endpoints** (hosted managed infrastructure). Use `endpoint_type='serverless'` when deploying models using the **Pay-as-you-go** offering (model as a service).\n",
    "* `endpoint_api_key`: The API key provided by the endpoint"
   ]
  },
@@ -52,9 +52,9 @@
    "\n",
    "The `content_formatter` parameter is a handler class for transforming the request and response of an AzureML endpoint to match with required schema. Since there are a wide range of models in the model catalog, each of which may process data differently from one another, a `ContentFormatterBase` class is provided to allow users to transform data to their liking. The following content formatters are provided:\n",
    "\n",
-    "* `LLamaChatContentFormatter`: Formats request and response data for LLaMa2-chat\n",
+    "* `CustomOpenAIChatContentFormatter`: Formats request and response data for models like LLaMa2-chat that follow the OpenAI API spec for request and response.\n",
    "\n",
-    "*Note: `langchain.chat_models.azureml_endpoint.LLamaContentFormatter` is being deprecated and replaced with `langchain.chat_models.azureml_endpoint.LLamaChatContentFormatter`.*\n",
+    "*Note: `langchain.chat_models.azureml_endpoint.LlamaChatContentFormatter` is being deprecated and replaced with `langchain.chat_models.azureml_endpoint.CustomOpenAIChatContentFormatter`.*\n",
    "\n",
    "You can implement custom content formatters specific for your model deriving from the class `langchain_community.llms.azureml_endpoint.ContentFormatterBase`."
   ]
@@ -65,20 +65,7 @@
   "source": [
    "## Examples\n",
    "\n",
-    "The following section cotain examples about how to use this class:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from langchain_community.chat_models.azureml_endpoint import (\n",
-    "    AzureMLEndpointApiType,\n",
-    "    LlamaChatContentFormatter,\n",
-    ")\n",
-    "from langchain_core.messages import HumanMessage"
+    "The following section contains examples about how to use this class:"
   ]
  },
  {
@@ -105,14 +92,17 @@
    }
   ],
   "source": [
-    "from langchain_community.chat_models.azureml_endpoint import LlamaContentFormatter\n",
+    "from langchain_community.chat_models.azureml_endpoint import (\n",
+    "    AzureMLEndpointApiType,\n",
+    "    CustomOpenAIChatContentFormatter,\n",
+    ")\n",
    "from langchain_core.messages import HumanMessage\n",
    "\n",
    "chat = AzureMLChatOnlineEndpoint(\n",
    "    endpoint_url=\"https://<your-endpoint>.<your_region>.inference.ml.azure.com/score\",\n",
-    "    endpoint_api_type=AzureMLEndpointApiType.realtime,\n",
+    "    endpoint_api_type=AzureMLEndpointApiType.dedicated,\n",
    "    endpoint_api_key=\"my-api-key\",\n",
-    "    content_formatter=LlamaChatContentFormatter(),\n",
+    "    content_formatter=CustomOpenAIChatContentFormatter(),\n",
    ")\n",
    "response = chat.invoke(\n",
    "    [HumanMessage(content=\"Will the Collatz conjecture ever be solved?\")]\n",
@@ -137,7 +127,7 @@
    "    endpoint_url=\"https://<your-endpoint>.<your_region>.inference.ml.azure.com/v1/chat/completions\",\n",
    "    endpoint_api_type=AzureMLEndpointApiType.serverless,\n",
    "    endpoint_api_key=\"my-api-key\",\n",
-    "    content_formatter=LlamaChatContentFormatter,\n",
+    "    content_formatter=CustomOpenAIChatContentFormatter,\n",
    ")\n",
    "response = chat.invoke(\n",
    "    [HumanMessage(content=\"Will the Collatz conjecture ever be solved?\")]\n",
@@ -149,7 +139,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "If you need to pass additional parameters to the model, use `model_kwards` argument:"
+    "If you need to pass additional parameters to the model, use `model_kwargs` argument:"
   ]
  },
  {
@@ -162,7 +152,7 @@
    "    endpoint_url=\"https://<your-endpoint>.<your_region>.inference.ml.azure.com/v1/chat/completions\",\n",
    "    endpoint_api_type=AzureMLEndpointApiType.serverless,\n",
    "    endpoint_api_key=\"my-api-key\",\n",
-    "    content_formatter=LlamaChatContentFormatter,\n",
+    "    content_formatter=CustomOpenAIChatContentFormatter,\n",
    "    model_kwargs={\"temperature\": 0.8},\n",
    ")"
   ]
@@ -204,7 +194,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.12"
+   "version": "3.9.1"
  }
 },
 "nbformat": 4,
--- a/docs/docs/integrations/llms/azure_ml.ipynb
+++ b/docs/docs/integrations/llms/azure_ml.ipynb
@@ -29,7 +29,7 @@
    "You must [deploy a model on Azure ML](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-foundation-models?view=azureml-api-2#deploying-foundation-models-to-endpoints-for-inferencing) or [to Azure AI studio](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-open) and obtain the following parameters:\n",
    "\n",
    "* `endpoint_url`: The REST endpoint url provided by the endpoint.\n",
-    "* `endpoint_api_type`: Use `endpoint_type='realtime'` when deploying models to **Realtime endpoints** (hosted managed infrastructure). Use `endpoint_type='serverless'` when deploying models using the **Pay-as-you-go** offering (model as a service).\n",
+    "* `endpoint_api_type`: Use `endpoint_type='dedicated'` when deploying models to **Dedicated endpoints** (hosted managed infrastructure). Use `endpoint_type='serverless'` when deploying models using the **Pay-as-you-go** offering (model as a service).\n",
    "* `endpoint_api_key`: The API key provided by the endpoint.\n",
    "* `deployment_name`: (Optional) The deployment name of the model using the endpoint."
   ]
@@ -45,7 +45,7 @@
    "* `GPT2ContentFormatter`: Formats request and response data for GPT2\n",
    "* `DollyContentFormatter`: Formats request and response data for the Dolly-v2\n",
    "* `HFContentFormatter`: Formats request and response data for text-generation Hugging Face models\n",
-    "* `LLamaContentFormatter`: Formats request and response data for LLaMa2\n",
+    "* `CustomOpenAIContentFormatter`: Formats request and response data for models like LLaMa2 that follow OpenAI API compatible scheme.\n",
    "\n",
    "*Note: `OSSContentFormatter` is being deprecated and replaced with `GPT2ContentFormatter`. The logic is the same but `GPT2ContentFormatter` is a more suitable name. You can still continue to use `OSSContentFormatter` as the changes are backwards compatible.*"
   ]
@@ -72,15 +72,15 @@
   "source": [
    "from langchain_community.llms.azureml_endpoint import (\n",
    "    AzureMLEndpointApiType,\n",
-    "    LlamaContentFormatter,\n",
+    "    CustomOpenAIContentFormatter,\n",
    ")\n",
    "from langchain_core.messages import HumanMessage\n",
    "\n",
    "llm = AzureMLOnlineEndpoint(\n",
    "    endpoint_url=\"https://<your-endpoint>.<your_region>.inference.ml.azure.com/score\",\n",
-    "    endpoint_api_type=AzureMLEndpointApiType.realtime,\n",
+    "    endpoint_api_type=AzureMLEndpointApiType.dedicated,\n",
    "    endpoint_api_key=\"my-api-key\",\n",
-    "    content_formatter=LlamaContentFormatter(),\n",
+    "    content_formatter=CustomOpenAIContentFormatter(),\n",
    "    model_kwargs={\"temperature\": 0.8, \"max_new_tokens\": 400},\n",
    ")\n",
    "response = llm.invoke(\"Write me a song about sparkling water:\")\n",
@@ -119,7 +119,7 @@
   "source": [
    "from langchain_community.llms.azureml_endpoint import (\n",
    "    AzureMLEndpointApiType,\n",
-    "    LlamaContentFormatter,\n",
+    "    CustomOpenAIContentFormatter,\n",
    ")\n",
    "from langchain_core.messages import HumanMessage\n",
    "\n",
@@ -127,7 +127,7 @@
    "    endpoint_url=\"https://<your-endpoint>.<your_region>.inference.ml.azure.com/v1/completions\",\n",
    "    endpoint_api_type=AzureMLEndpointApiType.serverless,\n",
    "    endpoint_api_key=\"my-api-key\",\n",
-    "    content_formatter=LlamaContentFormatter(),\n",
+    "    content_formatter=CustomOpenAIContentFormatter(),\n",
    "    model_kwargs={\"temperature\": 0.8, \"max_new_tokens\": 400},\n",
    ")\n",
    "response = llm.invoke(\"Write me a song about sparkling water:\")\n",
@@ -181,7 +181,7 @@
    "content_formatter = CustomFormatter()\n",
    "\n",
    "llm = AzureMLOnlineEndpoint(\n",
-    "    endpoint_api_type=\"realtime\",\n",
+    "    endpoint_api_type=\"dedicated\",\n",
    "    endpoint_api_key=os.getenv(\"BART_ENDPOINT_API_KEY\"),\n",
    "    endpoint_url=os.getenv(\"BART_ENDPOINT_URL\"),\n",
    "    model_kwargs={\"temperature\": 0.8, \"max_new_tokens\": 400},\n",