langchain/docs/versioned_docs/version-0.2.x/integrations/llms/vllm.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "499c3142-2033-437d-a60a-731988ac6074",
   "metadata": {},
   "source": [
    "# vLLM\n",
    "\n",
    "[vLLM](https://vllm.readthedocs.io/en/latest/index.html) is a fast and easy-to-use library for LLM inference and serving, offering:\n",
    "\n",
    "* State-of-the-art serving throughput \n",
    "* Efficient management of attention key and value memory with PagedAttention\n",
    "* Continuous batching of incoming requests\n",
    "* Optimized CUDA kernels\n",
    "\n",
    "This notebooks goes over how to use a LLM with langchain and vLLM.\n",
    "\n",
    "To use, you should have the `vllm` python package installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8a3f2666-5c75-4797-967a-7915a247bf33",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%pip install --upgrade --quiet  vllm -q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "84e350f7-21f6-455b-b1f0-8b0116a2fd49",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO 08-06 11:37:33 llm_engine.py:70] Initializing an LLM engine with config: model='mosaicml/mpt-7b', tokenizer='mosaicml/mpt-7b', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)\n",
      "INFO 08-06 11:37:41 llm_engine.py:196] # GPU blocks: 861, # CPU blocks: 512\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.00it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "What is the capital of France ? The capital of France is Paris.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from langchain_community.llms import VLLM\n",
    "\n",
    "llm = VLLM(\n",
    "    model=\"mosaicml/mpt-7b\",\n",
    "    trust_remote_code=True,  # mandatory for hf models\n",
    "    max_new_tokens=128,\n",
    "    top_k=10,\n",
    "    top_p=0.95,\n",
    "    temperature=0.8,\n",
    ")\n",
    "\n",
    "print(llm.invoke(\"What is the capital of France ?\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94a3b41d-8329-4f8f-94f9-453d7f132214",
   "metadata": {},
   "source": [
    "## Integrate the model in an LLMChain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5605b7a1-fa63-49c1-934d-8b4ef8d71dd5",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.34s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "1. The first Pokemon game was released in 1996.\n",
      "2. The president was Bill Clinton.\n",
      "3. Clinton was president from 1993 to 2001.\n",
      "4. The answer is Clinton.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from langchain.chains import LLMChain\n",
    "from langchain_core.prompts import PromptTemplate\n",
    "\n",
    "template = \"\"\"Question: {question}\n",
    "\n",
    "Answer: Let's think step by step.\"\"\"\n",
    "prompt = PromptTemplate.from_template(template)\n",
    "\n",
    "llm_chain = LLMChain(prompt=prompt, llm=llm)\n",
    "\n",
    "question = \"Who was the US president in the year the first Pokemon game was released?\"\n",
    "\n",
    "print(llm_chain.invoke(question))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56826aba-d08b-4838-8bfa-ca96e463b25d",
   "metadata": {},
   "source": [
    "## Distributed Inference\n",
    "\n",
    "vLLM supports distributed tensor-parallel inference and serving. \n",
    "\n",
    "To run multi-GPU inference with the LLM class, set the `tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f8c25c35-47b5-459d-9985-3cf546e9ac16",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.llms import VLLM\n",
    "\n",
    "llm = VLLM(\n",
    "    model=\"mosaicml/mpt-30b\",\n",
    "    tensor_parallel_size=4,\n",
    "    trust_remote_code=True,  # mandatory for hf models\n",
    ")\n",
    "\n",
    "llm.invoke(\"What is the future of AI?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6ca8fd911d25faa",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Quantization\n",
    "\n",
    "vLLM supports `awq` quantization. To enable it, pass `quantization` to `vllm_kwargs`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2cada3174c46a0ea",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "llm_q = VLLM(\n",
    "    model=\"TheBloke/Llama-2-7b-Chat-AWQ\",\n",
    "    trust_remote_code=True,\n",
    "    max_new_tokens=512,\n",
    "    vllm_kwargs={\"quantization\": \"awq\"},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64e89be0-6ad7-43a8-9dac-1324dcd4e851",
   "metadata": {
    "tags": []
   },
   "source": [
    "## OpenAI-Compatible Server\n",
    "\n",
    "vLLM can be deployed as a server that mimics the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.\n",
    "\n",
    "This server can be queried in the same format as OpenAI API.\n",
    "\n",
    "### OpenAI-Compatible Completion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c3cbc428-0bb8-422a-913e-1c6fef8b89d4",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " a city that is filled with history, ancient buildings, and art around every corner\n"
     ]
    }
   ],
   "source": [
    "from langchain_community.llms import VLLMOpenAI\n",
    "\n",
    "llm = VLLMOpenAI(\n",
    "    openai_api_key=\"EMPTY\",\n",
    "    openai_api_base=\"http://localhost:8000/v1\",\n",
    "    model_name=\"tiiuae/falcon-7b\",\n",
    "    model_kwargs={\"stop\": [\".\"]},\n",
    ")\n",
    "print(llm.invoke(\"Rome is\"))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p310",
   "language": "python",
   "name": "conda_pytorch_p310"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}