docs: Added Deploying LLMs into production + a new ecosystem (#4047)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2025-09-05 13:06:03 +00:00 · 2023-06-05 12:47:27 -07:00
parent 74f8e603d9
commit 625717daa8
4 changed files with 379 additions and 0 deletions
--- a/docs/ecosystem/deployments.md
+++ b/docs/ecosystem/deployments.md
@@ -6,6 +6,11 @@ This section covers several options for that. Note that these options are meant

 What follows is a list of template GitHub repositories designed to be easily forked and modified to use your chain. This list is far from exhaustive, and we are EXTREMELY open to contributions here.

+## [Anyscale](https://www.anyscale.com/model-serving)
+
+Anyscale is a unified compute platform that makes it easy to develop, deploy, and manage scalable LLM applications in production using Ray.
+With Anyscale you can scale the most challenging LLM-based workloads and both develop and deploy LLM-based apps on a single compute platform.
+
 ## [Streamlit](https://github.com/hwchase17/langchain-streamlit-template)

 This repo serves as a template for how to deploy a LangChain with Streamlit.
--- a/docs/ecosystem/ray_serve.ipynb
+++ b/docs/ecosystem/ray_serve.ipynb
@@ -0,0 +1,233 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Ray Serve\n",
+    "\n",
+    "[Ray Serve](https://docs.ray.io/en/latest/serve/index.html) is a scalable model serving library for building online inference APIs. Serve is particularly well suited for system composition, enabling you to build a complex inference service consisting of multiple chains and business logic all in Python code. "
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Goal of this notebook\n",
+    "This notebook shows a simple example of how to deploy an OpenAI chain into production. You can extend it to deploy your own self-hosted models where you can easily define amount of hardware resources (GPUs and CPUs) needed to run your model in production efficiently. Read more about available options including autoscaling in the Ray Serve [documentation](https://docs.ray.io/en/latest/serve/getting_started.html).\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup Ray Serve\n",
+    "Install ray with `pip install ray[serve]`. "
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## General Skeleton"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The general skeleton for deploying a service is the following:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 0: Import ray serve and request from starlette\n",
+    "from ray import serve\n",
+    "from starlette.requests import Request\n",
+    "\n",
+    "# 1: Define a Ray Serve deployment.\n",
+    "@serve.deployment\n",
+    "class LLMServe:\n",
+    "\n",
+    "    def __init__(self) -> None:\n",
+    "        # All the initialization code goes here\n",
+    "        pass\n",
+    "\n",
+    "    async def __call__(self, request: Request) -> str:\n",
+    "        # You can parse the request here\n",
+    "        # and return a response\n",
+    "        return \"Hello World\"\n",
+    "\n",
+    "# 2: Bind the model to deployment\n",
+    "deployment = LLMServe.bind()\n",
+    "\n",
+    "# 3: Run the deployment\n",
+    "serve.api.run(deployment)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Shutdown the deployment\n",
+    "serve.api.shutdown()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Example of deploying and OpenAI chain with custom prompts"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Get an OpenAI API key from [here](https://platform.openai.com/account/api-keys). By running the following code, you will be asked to provide your API key."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.llms import OpenAI\n",
+    "from langchain import PromptTemplate, LLMChain"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from getpass import getpass\n",
+    "OPENAI_API_KEY = getpass()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@serve.deployment\n",
+    "class DeployLLM:\n",
+    "\n",
+    "    def __init__(self):\n",
+    "        # We initialize the LLM, template and the chain here\n",
+    "        llm = OpenAI(openai_api_key=OPENAI_API_KEY)\n",
+    "        template = \"Question: {question}\\n\\nAnswer: Let's think step by step.\"\n",
+    "        prompt = PromptTemplate(template=template, input_variables=[\"question\"])\n",
+    "        self.chain = LLMChain(llm=llm, prompt=prompt)\n",
+    "\n",
+    "    def _run_chain(self, text: str):\n",
+    "        return self.chain(text)\n",
+    "\n",
+    "    async def __call__(self, request: Request):\n",
+    "        # 1. Parse the request\n",
+    "        text = request.query_params[\"text\"]\n",
+    "        # 2. Run the chain\n",
+    "        resp = self._run_chain(text)\n",
+    "        # 3. Return the response\n",
+    "        return resp[\"text\"]"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we can bind the deployment."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Bind the model to deployment\n",
+    "deployment = DeployLLM.bind()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can assign the port number and host when we want to run the deployment. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Example port number\n",
+    "PORT_NUMBER = 8282\n",
+    "# Run the deployment\n",
+    "serve.api.run(deployment, port=PORT_NUMBER)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that service is deployed on port `localhost:8282` we can send a post request to get the results back."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "\n",
+    "text = \"What NFL team won the Super Bowl in the year Justin Beiber was born?\"\n",
+    "response = requests.post(f'http://localhost:{PORT_NUMBER}/?text={text}')\n",
+    "print(response.content.decode())"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "ray",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.9"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}