langchain/docs/versioned_docs/version-0.2.x/integrations/llms/llamacpp.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Llama.cpp\n",
    "\n",
    "[llama-cpp-python](https://github.com/abetlen/llama-cpp-python) is a Python binding for [llama.cpp](https://github.com/ggerganov/llama.cpp).\n",
    "\n",
    "It supports inference for [many LLMs](https://github.com/ggerganov/llama.cpp#description) models, which can be accessed on [Hugging Face](https://huggingface.co/TheBloke).\n",
    "\n",
    "This notebook goes over how to run `llama-cpp-python` within LangChain.\n",
    "\n",
    "**Note: new versions of `llama-cpp-python` use GGUF model files (see [here](https://github.com/abetlen/llama-cpp-python/pull/633)).**\n",
    "\n",
    "This is a breaking change.\n",
    " \n",
    "To convert existing GGML models to GGUF you can run the following in [llama.cpp](https://github.com/ggerganov/llama.cpp):\n",
    "\n",
    "```\n",
    "python ./convert-llama-ggmlv3-to-gguf.py --eps 1e-5 --input models/openorca-platypus2-13b.ggmlv3.q4_0.bin --output models/openorca-platypus2-13b.gguf.q4_0.bin\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation\n",
    "\n",
    "There are different options on how to install the llama-cpp package: \n",
    "- CPU usage\n",
    "- CPU + GPU (using one of many BLAS backends)\n",
    "- Metal GPU (MacOS with Apple Silicon Chip) \n",
    "\n",
    "### CPU only installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%pip install --upgrade --quiet  llama-cpp-python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Installation with OpenBLAS / cuBLAS / CLBlast\n",
    "\n",
    "`llama.cpp` supports multiple BLAS backends for faster processing. Use the `FORCE_CMAKE=1` environment variable to force the use of cmake and install the pip package for the desired BLAS backend ([source](https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast)).\n",
    "\n",
    "Example installation with cuBLAS backend:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip install llama-cpp-python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**IMPORTANT**: If you have already installed the CPU only version of the package, you need to reinstall it from scratch. Consider the following command: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Installation with Metal\n",
    "\n",
    "`llama.cpp` supports Apple silicon first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks. Use the `FORCE_CMAKE=1` environment variable to force the use of cmake and install the pip package for the Metal support ([source](https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md)).\n",
    "\n",
    "Example installation with Metal Support:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 pip install llama-cpp-python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**IMPORTANT**: If you have already installed a cpu only version of the package, you need to reinstall it from scratch: consider the following command: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Installation with Windows\n",
    "\n",
    "It is stable to install the `llama-cpp-python` library by compiling from the source. You can follow most of the instructions in the repository itself but there are some windows specific instructions which might be useful.\n",
    "\n",
    "Requirements to install the `llama-cpp-python`,\n",
    "\n",
    "- git\n",
    "- python\n",
    "- cmake\n",
    "- Visual Studio Community (make sure you install this with the following settings)\n",
    "    - Desktop development with C++\n",
    "    - Python development\n",
    "    - Linux embedded development with C++\n",
    "\n",
    "1. Clone git repository recursively to get `llama.cpp` submodule as well \n",
    "\n",
    "```\n",
    "git clone --recursive -j8 https://github.com/abetlen/llama-cpp-python.git\n",
    "```\n",
    "\n",
    "2. Open up a command Prompt and set the following environment variables.\n",
    "\n",
    "\n",
    "```\n",
    "set FORCE_CMAKE=1\n",
    "set CMAKE_ARGS=-DLLAMA_CUBLAS=OFF\n",
    "```\n",
    "If you have an NVIDIA GPU make sure `DLLAMA_CUBLAS` is set to `ON`\n",
    "\n",
    "#### Compiling and installing\n",
    "\n",
    "Now you can `cd` into the `llama-cpp-python` directory and install the package\n",
    "\n",
    "```\n",
    "python -m pip install -e .\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**IMPORTANT**: If you have already installed a cpu only version of the package, you need to reinstall it from scratch: consider the following command: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m pip install -e . --force-reinstall --no-cache-dir"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Usage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make sure you are following all instructions to [install all necessary model files](https://github.com/ggerganov/llama.cpp).\n",
    "\n",
    "You don't need an `API_TOKEN` as you will run the LLM locally.\n",
    "\n",
    "It is worth understanding which models are suitable to be used on the desired machine.\n",
    "\n",
    "[TheBloke's](https://huggingface.co/TheBloke) Hugging Face models have a `Provided files` section that exposes the RAM required to run models of different quantisation sizes and methods (eg: [Llama2-7B-Chat-GGUF](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF#provided-files)).\n",
    "\n",
    "This [github issue](https://github.com/facebookresearch/llama/issues/425) is also relevant to find the right model for your machine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain_community.llms import LlamaCpp\n",
    "from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler\n",
    "from langchain_core.prompts import PromptTemplate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Consider using a template that suits your model! Check the models page on Hugging Face etc. to get a correct prompting template.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "template = \"\"\"Question: {question}\n",
    "\n",
    "Answer: Let's work this out in a step by step way to be sure we have the right answer.\"\"\"\n",
    "\n",
    "prompt = PromptTemplate.from_template(template)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Callbacks support token-wise streaming\n",
    "callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### CPU"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Example using a LLaMA 2 7B model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make sure the model path is correct for your system!\n",
    "llm = LlamaCpp(\n",
    "    model_path=\"/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin\",\n",
    "    temperature=0.75,\n",
    "    max_tokens=2000,\n",
    "    top_p=1,\n",
    "    callback_manager=callback_manager,\n",
    "    verbose=True,  # Verbose is required to pass to the callback manager\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Stephen Colbert:\n",
      "Yo, John, I heard you've been talkin' smack about me on your show.\n",
      "Let me tell you somethin', pal, I'm the king of late-night TV\n",
      "My satire is sharp as a razor, it cuts deeper than a knife\n",
      "While you're just a british bloke tryin' to be funny with your accent and your wit.\n",
      "John Oliver:\n",
      "Oh Stephen, don't be ridiculous, you may have the ratings but I got the real talk.\n",
      "My show is the one that people actually watch and listen to, not just for the laughs but for the facts.\n",
      "While you're busy talkin' trash, I'm out here bringing the truth to light.\n",
      "Stephen Colbert:\n",
      "Truth? Ha! You think your show is about truth? Please, it's all just a joke to you.\n",
      "You're just a fancy-pants british guy tryin' to be funny with your news and your jokes.\n",
      "While I'm the one who's really makin' a difference, with my sat"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "llama_print_timings:        load time =   358.60 ms\n",
      "llama_print_timings:      sample time =   172.55 ms /   256 runs   (    0.67 ms per token,  1483.59 tokens per second)\n",
      "llama_print_timings: prompt eval time =   613.36 ms /    16 tokens (   38.33 ms per token,    26.09 tokens per second)\n",
      "llama_print_timings:        eval time = 10151.17 ms /   255 runs   (   39.81 ms per token,    25.12 tokens per second)\n",
      "llama_print_timings:       total time = 11332.41 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "\"\\nStephen Colbert:\\nYo, John, I heard you've been talkin' smack about me on your show.\\nLet me tell you somethin', pal, I'm the king of late-night TV\\nMy satire is sharp as a razor, it cuts deeper than a knife\\nWhile you're just a british bloke tryin' to be funny with your accent and your wit.\\nJohn Oliver:\\nOh Stephen, don't be ridiculous, you may have the ratings but I got the real talk.\\nMy show is the one that people actually watch and listen to, not just for the laughs but for the facts.\\nWhile you're busy talkin' trash, I'm out here bringing the truth to light.\\nStephen Colbert:\\nTruth? Ha! You think your show is about truth? Please, it's all just a joke to you.\\nYou're just a fancy-pants british guy tryin' to be funny with your news and your jokes.\\nWhile I'm the one who's really makin' a difference, with my sat\""
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "question = \"\"\"\n",
    "Question: A rap battle between Stephen Colbert and John Oliver\n",
    "\"\"\"\n",
    "llm.invoke(question)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Example using a LLaMA v1 model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make sure the model path is correct for your system!\n",
    "llm = LlamaCpp(\n",
    "    model_path=\"./ggml-model-q4_0.bin\", callback_manager=callback_manager, verbose=True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm_chain = prompt | llm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "1. First, find out when Justin Bieber was born.\n",
      "2. We know that Justin Bieber was born on March 1, 1994.\n",
      "3. Next, we need to look up when the Super Bowl was played in that year.\n",
      "4. The Super Bowl was played on January 28, 1995.\n",
      "5. Finally, we can use this information to answer the question. The NFL team that won the Super Bowl in the year Justin Bieber was born is the San Francisco 49ers."
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "llama_print_timings:        load time =   434.15 ms\n",
      "llama_print_timings:      sample time =    41.81 ms /   121 runs   (    0.35 ms per token)\n",
      "llama_print_timings: prompt eval time =  2523.78 ms /    48 tokens (   52.58 ms per token)\n",
      "llama_print_timings:        eval time = 23971.57 ms /   121 runs   (  198.11 ms per token)\n",
      "llama_print_timings:       total time = 28945.95 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'\\n\\n1. First, find out when Justin Bieber was born.\\n2. We know that Justin Bieber was born on March 1, 1994.\\n3. Next, we need to look up when the Super Bowl was played in that year.\\n4. The Super Bowl was played on January 28, 1995.\\n5. Finally, we can use this information to answer the question. The NFL team that won the Super Bowl in the year Justin Bieber was born is the San Francisco 49ers.'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "question = \"What NFL team won the Super Bowl in the year Justin Bieber was born?\"\n",
    "llm_chain.invoke({\"question\": question})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### GPU\n",
    "\n",
    "If the installation with BLAS backend was correct, you will see a `BLAS = 1` indicator in model properties.\n",
    "\n",
    "Two of the most important parameters for use with GPU are:\n",
    "\n",
    "- `n_gpu_layers` - determines how many layers of the model are offloaded to your GPU.\n",
    "- `n_batch` - how many tokens are processed in parallel. \n",
    "\n",
    "Setting these parameters correctly will dramatically improve the evaluation speed (see [wrapper code](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/llms/llamacpp.py) for more details)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_gpu_layers = -1  # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.\n",
    "n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.\n",
    "\n",
    "# Make sure the model path is correct for your system!\n",
    "llm = LlamaCpp(\n",
    "    model_path=\"/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin\",\n",
    "    n_gpu_layers=n_gpu_layers,\n",
    "    n_batch=n_batch,\n",
    "    callback_manager=callback_manager,\n",
    "    verbose=True,  # Verbose is required to pass to the callback manager\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "1. Identify Justin Bieber's birth date: Justin Bieber was born on March 1, 1994.\n",
      "\n",
      "2. Find the Super Bowl winner of that year: The NFL season of 1993 with the Super Bowl being played in January or of 1994.\n",
      "\n",
      "3. Determine which team won the game: The Dallas Cowboys faced the Buffalo Bills in Super Bowl XXVII on January 31, 1993 (as the year is mis-labelled due to a error). The Dallas Cowboys won this matchup.\n",
      "\n",
      "So, Justin Bieber was born when the Dallas Cowboys were the reigning NFL Super Bowl."
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "llama_print_timings:        load time =   427.63 ms\n",
      "llama_print_timings:      sample time =   115.85 ms /   164 runs   (    0.71 ms per token,  1415.67 tokens per second)\n",
      "llama_print_timings: prompt eval time =   427.53 ms /    45 tokens (    9.50 ms per token,   105.26 tokens per second)\n",
      "llama_print_timings:        eval time =  4526.53 ms /   163 runs   (   27.77 ms per token,    36.01 tokens per second)\n",
      "llama_print_timings:       total time =  5293.77 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "\"\\n\\n1. Identify Justin Bieber's birth date: Justin Bieber was born on March 1, 1994.\\n\\n2. Find the Super Bowl winner of that year: The NFL season of 1993 with the Super Bowl being played in January or of 1994.\\n\\n3. Determine which team won the game: The Dallas Cowboys faced the Buffalo Bills in Super Bowl XXVII on January 31, 1993 (as the year is mis-labelled due to a error). The Dallas Cowboys won this matchup.\\n\\nSo, Justin Bieber was born when the Dallas Cowboys were the reigning NFL Super Bowl.\""
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "llm_chain = prompt | llm\n",
    "question = \"What NFL team won the Super Bowl in the year Justin Bieber was born?\"\n",
    "llm_chain.invoke({\"question\": question})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Metal\n",
    "\n",
    "If the installation with Metal was correct, you will see a `NEON = 1` indicator in model properties.\n",
    "\n",
    "Two of the most important GPU parameters are:\n",
    "\n",
    "- `n_gpu_layers` - determines how many layers of the model are offloaded to your Metal GPU.\n",
    "- `n_batch` - how many tokens are processed in parallel, default is 8, set to bigger number.\n",
    "- `f16_kv` - for some reason, Metal only support `True`, otherwise you will get error such as `Asserting on type 0\n",
    "GGML_ASSERT: .../ggml-metal.m:706: false && \"not implemented\"`\n",
    "\n",
    "Setting these parameters correctly will dramatically improve the evaluation speed (see [wrapper code](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/llms/llamacpp.py) for more details)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_gpu_layers = 1  # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.\n",
    "n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.\n",
    "# Make sure the model path is correct for your system!\n",
    "llm = LlamaCpp(\n",
    "    model_path=\"/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin\",\n",
    "    n_gpu_layers=n_gpu_layers,\n",
    "    n_batch=n_batch,\n",
    "    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls\n",
    "    callback_manager=callback_manager,\n",
    "    verbose=True,  # Verbose is required to pass to the callback manager\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The console log will show the following log to indicate Metal was enable properly.\n",
    "\n",
    "```\n",
    "ggml_metal_init: allocating\n",
    "ggml_metal_init: using MPS\n",
    "...\n",
    "```\n",
    "\n",
    "You also could check `Activity Monitor` by watching the GPU usage of the process, the CPU usage will drop dramatically after turn on `n_gpu_layers=1`. \n",
    "\n",
    "For the first call to the LLM, the performance may be slow due to the model compilation in Metal GPU."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Grammars\n",
    "\n",
    "We can use [grammars](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md) to constrain model outputs and sample tokens based on the rules defined in them.\n",
    "\n",
    "To demonstrate this concept, we've included [sample grammar files](https://github.com/langchain-ai/langchain/tree/master/libs/langchain/langchain/llms/grammars), that will be used in the examples below.\n",
    "\n",
    "Creating gbnf grammar files can be time-consuming, but if you have a use-case where output schemas are important, there are two tools that can help:\n",
    "- [Online grammar generator app](https://grammar.intrinsiclabs.ai/) that converts TypeScript interface definitions to gbnf file.\n",
    "- [Python script](https://github.com/ggerganov/llama.cpp/blob/master/examples/json-schema-to-grammar.py) for converting json schema to gbnf file. You can for example create `pydantic` object, generate its JSON schema using `.schema_json()` method, and then use this script to convert it to gbnf file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the first example, supply the path to the specified `json.gbnf` file in order to produce JSON:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_gpu_layers = 1  # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.\n",
    "n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.\n",
    "# Make sure the model path is correct for your system!\n",
    "llm = LlamaCpp(\n",
    "    model_path=\"/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin\",\n",
    "    n_gpu_layers=n_gpu_layers,\n",
    "    n_batch=n_batch,\n",
    "    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls\n",
    "    callback_manager=callback_manager,\n",
    "    verbose=True,  # Verbose is required to pass to the callback manager\n",
    "    grammar_path=\"/Users/rlm/Desktop/Code/langchain-main/langchain/libs/langchain/langchain/llms/grammars/json.gbnf\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"name\": \"John Doe\",\n",
      "  \"age\": 34,\n",
      "  \"\": {\n",
      "    \"title\": \"Software Developer\",\n",
      "    \"company\": \"Google\"\n",
      "  },\n",
      "  \"interests\": [\n",
      "    \"Sports\",\n",
      "    \"Music\",\n",
      "    \"Cooking\"\n",
      "  ],\n",
      "  \"address\": {\n",
      "    \"street_number\": 123,\n",
      "    \"street_name\": \"Oak Street\",\n",
      "    \"city\": \"Mountain View\",\n",
      "    \"state\": \"California\",\n",
      "    \"postal_code\": 94040\n",
      "  }}"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "llama_print_timings:        load time =   357.51 ms\n",
      "llama_print_timings:      sample time =  1213.30 ms /   144 runs   (    8.43 ms per token,   118.68 tokens per second)\n",
      "llama_print_timings: prompt eval time =   356.78 ms /     9 tokens (   39.64 ms per token,    25.23 tokens per second)\n",
      "llama_print_timings:        eval time =  3947.16 ms /   143 runs   (   27.60 ms per token,    36.23 tokens per second)\n",
      "llama_print_timings:       total time =  5846.21 ms\n"
     ]
    }
   ],
   "source": [
    "%%capture captured --no-stdout\n",
    "result = llm.invoke(\"Describe a person in JSON format:\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also supply `list.gbnf` to return a list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_gpu_layers = 1\n",
    "n_batch = 512\n",
    "llm = LlamaCpp(\n",
    "    model_path=\"/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin\",\n",
    "    n_gpu_layers=n_gpu_layers,\n",
    "    n_batch=n_batch,\n",
    "    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls\n",
    "    callback_manager=callback_manager,\n",
    "    verbose=True,\n",
    "    grammar_path=\"/Users/rlm/Desktop/Code/langchain-main/langchain/libs/langchain/langchain/llms/grammars/list.gbnf\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\"The Catcher in the Rye\", \"Wuthering Heights\", \"Anna Karenina\"]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "llama_print_timings:        load time =   322.34 ms\n",
      "llama_print_timings:      sample time =   232.60 ms /    26 runs   (    8.95 ms per token,   111.78 tokens per second)\n",
      "llama_print_timings: prompt eval time =   321.90 ms /    11 tokens (   29.26 ms per token,    34.17 tokens per second)\n",
      "llama_print_timings:        eval time =   680.82 ms /    25 runs   (   27.23 ms per token,    36.72 tokens per second)\n",
      "llama_print_timings:       total time =  1295.27 ms\n"
     ]
    }
   ],
   "source": [
    "%%capture captured --no-stdout\n",
    "result = llm.invoke(\"List of top-3 my favourite books:\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  },
  "vscode": {
   "interpreter": {
    "hash": "d1d3a3c58a58885896c5459933a599607cdbb9917d7e1ad7516c8786c51f2dd2"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}