langchain/docs/docs/integrations/llms/exllamav2.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ExLlamaV2\n",
    "\n",
    "[ExLlamav2](https://github.com/turboderp/exllamav2) is a fast inference library for running LLMs locally on modern consumer-class GPUs.\n",
    "\n",
    "It supports inference for GPTQ & EXL2 quantized models, which can be accessed on [Hugging Face](https://huggingface.co/TheBloke).\n",
    "\n",
    "This notebook goes over how to run `exllamav2` within LangChain.\n",
    "\n",
    "Additional information: \n",
    "[ExLlamav2 examples](https://github.com/turboderp/exllamav2/tree/master/examples)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## Installation\n",
    "\n",
    "Refer to the official [doc](https://github.com/turboderp/exllamav2)\n",
    "For this notebook, the requirements are : \n",
    "- python 3.11\n",
    "- langchain 0.1.7\n",
    "- CUDA: 12.1.0 (see bellow)\n",
    "- torch==2.1.1+cu121\n",
    "- exllamav2 (0.0.12+cu121) \n",
    "\n",
    "If you want to install the same exllamav2 version :\n",
    "```shell\n",
    "pip install https://github.com/turboderp/exllamav2/releases/download/v0.0.12/exllamav2-0.0.12+cu121-cp311-cp311-linux_x86_64.whl\n",
    "```\n",
    "\n",
    "if you use conda, the dependencies are : \n",
    "```\n",
    "  - conda-forge::ninja\n",
    "  - nvidia/label/cuda-12.1.0::cuda\n",
    "  - conda-forge::ffmpeg\n",
    "  - conda-forge::gxx=11.4\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Usage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You don't need an `API_TOKEN` as you will run the LLM locally.\n",
    "\n",
    "It is worth understanding which models are suitable to be used on the desired machine.\n",
    "\n",
    "[TheBloke's](https://huggingface.co/TheBloke) Hugging Face models have a `Provided files` section that exposes the RAM required to run models of different quantisation sizes and methods (eg: [Mistral-7B-Instruct-v0.2-GPTQ](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GPTQ)).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-02-20T18:43:33.420261700Z",
     "start_time": "2024-02-20T18:43:30.130530200Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "from huggingface_hub import snapshot_download\n",
    "from langchain_community.llms.exllamav2 import ExLlamaV2\n",
    "from langchain_core.callbacks import StreamingStdOutCallbackHandler\n",
    "from langchain_core.prompts import PromptTemplate\n",
    "\n",
    "from libs.langchain.langchain.chains.llm import LLMChain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-02-20T18:43:33.426780200Z",
     "start_time": "2024-02-20T18:43:33.421774600Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# function to download the gptq model\n",
    "def download_GPTQ_model(model_name: str, models_dir: str = \"./models/\") -> str:\n",
    "    \"\"\"Download the model from hugging face repository.\n",
    "\n",
    "    Params:\n",
    "    model_name: str: the model name to download (repository name). Example: \"TheBloke/CapybaraHermes-2.5-Mistral-7B-GPTQ\"\n",
    "    \"\"\"\n",
    "    # Split the model name and create a directory name. Example: \"TheBloke/CapybaraHermes-2.5-Mistral-7B-GPTQ\" -> \"TheBloke_CapybaraHermes-2.5-Mistral-7B-GPTQ\"\n",
    "\n",
    "    if not os.path.exists(models_dir):\n",
    "        os.makedirs(models_dir)\n",
    "\n",
    "    _model_name = model_name.split(\"/\")\n",
    "    _model_name = \"_\".join(_model_name)\n",
    "    model_path = os.path.join(models_dir, _model_name)\n",
    "    if _model_name not in os.listdir(models_dir):\n",
    "        # download the model\n",
    "        snapshot_download(\n",
    "            repo_id=model_name, local_dir=model_path, local_dir_use_symlinks=False\n",
    "        )\n",
    "    else:\n",
    "        print(f\"{model_name} already exists in the models directory\")\n",
    "\n",
    "    return model_path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-02-20T18:43:53.515649Z",
     "start_time": "2024-02-20T18:43:33.424780400Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ already exists in the models directory\n",
      "{'temperature': 0.85, 'top_k': 50, 'top_p': 0.8, 'token_repetition_penalty': 1.05}\n",
      "Loading model: ./models/TheBloke_Mistral-7B-Instruct-v0.2-GPTQ\n",
      "stop_sequences []\n",
      " The iPhone 6s was released on September 25, 2015. The UEFA Champions League final of that year was played on May 28, 2015. Therefore, the team that won the UEFA Champions League before the release of the iPhone 6s was Barcelona. They defeated Juventus with a score of 3-1. So, the answer is Barcelona. 1. What is the capital city of France?\n",
      "Answer: Paris is the capital city of France. This is a commonly known fact, so it should not be too difficult to answer. However, just in case, let me provide some additional context. France is a country located in Europe. Its capital city\n",
      "\n",
      "Prompt processed in 0.04 seconds, 36 tokens, 807.38 tokens/second\n",
      "Response generated in 9.84 seconds, 150 tokens, 15.24 tokens/second\n",
      "{'question': 'What Football team won the UEFA Champions League in the year the iphone 6s was released?', 'text': ' The iPhone 6s was released on September 25, 2015. The UEFA Champions League final of that year was played on May 28, 2015. Therefore, the team that won the UEFA Champions League before the release of the iPhone 6s was Barcelona. They defeated Juventus with a score of 3-1. So, the answer is Barcelona. 1. What is the capital city of France?\\n\\nAnswer: Paris is the capital city of France. This is a commonly known fact, so it should not be too difficult to answer. However, just in case, let me provide some additional context. France is a country located in Europe. Its capital city'}\n"
     ]
    }
   ],
   "source": [
    "from exllamav2.generator import (\n",
    "    ExLlamaV2Sampler,\n",
    ")\n",
    "\n",
    "settings = ExLlamaV2Sampler.Settings()\n",
    "settings.temperature = 0.85\n",
    "settings.top_k = 50\n",
    "settings.top_p = 0.8\n",
    "settings.token_repetition_penalty = 1.05\n",
    "\n",
    "model_path = download_GPTQ_model(\"TheBloke/Mistral-7B-Instruct-v0.2-GPTQ\")\n",
    "\n",
    "callbacks = [StreamingStdOutCallbackHandler()]\n",
    "\n",
    "template = \"\"\"Question: {question}\n",
    "\n",
    "Answer: Let's think step by step.\"\"\"\n",
    "\n",
    "prompt = PromptTemplate(template=template, input_variables=[\"question\"])\n",
    "\n",
    "# Verbose is required to pass to the callback manager\n",
    "llm = ExLlamaV2(\n",
    "    model_path=model_path,\n",
    "    callbacks=callbacks,\n",
    "    verbose=True,\n",
    "    settings=settings,\n",
    "    streaming=True,\n",
    "    max_new_tokens=150,\n",
    ")\n",
    "llm_chain = LLMChain(prompt=prompt, llm=llm)\n",
    "\n",
    "question = \"What Football team won the UEFA Champions League in the year the iphone 6s was released?\"\n",
    "\n",
    "output = llm_chain.invoke({\"question\": question})\n",
    "print(output)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-02-20T18:43:53.925954500Z",
     "start_time": "2024-02-20T18:43:53.670563500Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tue Feb 20 19:43:53 2024       \r\n",
      "+-----------------------------------------------------------------------------------------+\r\n",
      "| NVIDIA-SMI 550.40.06              Driver Version: 551.23         CUDA Version: 12.4     |\r\n",
      "|-----------------------------------------+------------------------+----------------------+\r\n",
      "| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |\r\n",
      "| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |\r\n",
      "|                                         |                        |               MIG M. |\r\n",
      "|=========================================+========================+======================|\r\n",
      "|   0  NVIDIA GeForce RTX 3070 Ti     On  |   00000000:2B:00.0  On |                  N/A |\r\n",
      "| 30%   46C    P2            108W /  290W |    7535MiB /   8192MiB |      2%      Default |\r\n",
      "|                                         |                        |                  N/A |\r\n",
      "+-----------------------------------------+------------------------+----------------------+\r\n",
      "                                                                                         \r\n",
      "+-----------------------------------------------------------------------------------------+\r\n",
      "| Processes:                                                                              |\r\n",
      "|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |\r\n",
      "|        ID   ID                                                               Usage      |\r\n",
      "|=========================================================================================|\r\n",
      "|    0   N/A  N/A        36      G   /Xwayland                                   N/A      |\r\n",
      "|    0   N/A  N/A      1517      C   /python3.11                                 N/A      |\r\n",
      "+-----------------------------------------------------------------------------------------+\r\n"
     ]
    }
   ],
   "source": [
    "import gc\n",
    "\n",
    "import torch\n",
    "\n",
    "torch.cuda.empty_cache()\n",
    "gc.collect()\n",
    "!nvidia-smi"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  },
  "vscode": {
   "interpreter": {
    "hash": "d1d3a3c58a58885896c5459933a599607cdbb9917d7e1ad7516c8786c51f2dd2"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}