community[minor]: add chat model llamacpp (#22589)

- **PR title**: [community] add chat model llamacpp - **PR message**: - **Description:** This PR introduces a new chat model integration with llamacpp_python, designed to work similarly to the existing ChatOpenAI model. + Work well with instructed chat, chain and function/tool calling. + Work with LangGraph (persistent memory, tool calling), will update soon - **Dependencies:** This change requires the llamacpp_python library to be installed. @baskaryan --------- Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
2025-09-15 22:44:36 +00:00 · 2024-06-14 21:51:43 +07:00
parent e4279f80cd
commit b5e2ba3a47
5 changed files with 1417 additions and 0 deletions
--- a/docs/docs/integrations/chat/llamacpp.ipynb
+++ b/docs/docs/integrations/chat/llamacpp.ipynb
@@ -0,0 +1,595 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# ChatLlamaCpp\n",
+    "\n",
+    "This notebook provides a quick overview for getting started with chat model intergrated with [llama cpp python](https://github.com/abetlen/llama-cpp-python)\n",
+    "\n",
+    "An example below demonstrating how to implement with the open-source Llama3 Instruct 8B"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Overview\n",
+    "\n",
+    "### Integration details\n",
+    "| Class | Package | Local | Serializable | JS support |\n",
+    "| :--- | :--- | :---: | :---: |  :---: |\n",
+    "| [ChatLlamaCpp](https://api.python.langchain.com/en/latest/chat_models/langchain_community.chat_models.llamacpp.ChatLlamaCpp.html) | [langchain-community](https://api.python.langchain.com/en/latest/community_api_reference.html) | ✅ | ❌ | ❌ |\n",
+    "\n",
+    "### Model features\n",
+    "| [Tool calling](/docs/how_to/tool_calling/) | [Structured output](/docs/how_to/structured_output/) | JSON mode | Image input | Audio input | Video input | [Token-level streaming](/docs/how_to/chat_streaming/) | Native async | [Token usage](/docs/how_to/chat_token_usage_tracking/) | [Logprobs](/docs/how_to/logprobs/) |\n",
+    "| :---: | :---: | :---: | :---: |  :---: | :---: | :---: | :---: | :---: | :---: |\n",
+    "| ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | \n",
+    "\n",
+    "## Setup\n",
+    "\n",
+    "### Installation\n",
+    "\n",
+    "The LangChain OpenAI integration lives in the `langchain-community` and `llama-cpp-python` packages:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install -qU langchain-community llama-cpp-python"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Instantiation\n",
+    "\n",
+    "Now we can instantiate our model object and generate chat completions:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/tni5hc/Documents/langchain_llamacpp/SanctumAI-meta-llama-3-8b-instruct.Q8_0.gguf (version GGUF V3 (latest))\n",
+      "llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.\n",
+      "llama_model_loader: - kv   0:                       general.architecture str              = llama\n",
+      "llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct\n",
+      "llama_model_loader: - kv   2:                          llama.block_count u32              = 32\n",
+      "llama_model_loader: - kv   3:                       llama.context_length u32              = 8192\n",
+      "llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096\n",
+      "llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336\n",
+      "llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32\n",
+      "llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8\n",
+      "llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000\n",
+      "llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010\n",
+      "llama_model_loader: - kv  10:                          general.file_type u32              = 7\n",
+      "llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256\n",
+      "llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128\n",
+      "llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2\n",
+      "llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe\n",
+      "llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = [\"!\", \"\\\"\", \"#\", \"$\", \"%\", \"&\", \"'\", ...\n",
+      "llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...\n",
+      "llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = [\"Ġ Ġ\", \"Ġ ĠĠĠ\", \"ĠĠ ĠĠ\", \"...\n",
+      "llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000\n",
+      "llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009\n",
+      "llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...\n",
+      "llama_model_loader: - kv  21:               general.quantization_version u32              = 2\n",
+      "llama_model_loader: - type  f32:   65 tensors\n",
+      "llama_model_loader: - type q8_0:  226 tensors\n",
+      "llm_load_vocab: special tokens definition check successful ( 256/128256 ).\n",
+      "llm_load_print_meta: format           = GGUF V3 (latest)\n",
+      "llm_load_print_meta: arch             = llama\n",
+      "llm_load_print_meta: vocab type       = BPE\n",
+      "llm_load_print_meta: n_vocab          = 128256\n",
+      "llm_load_print_meta: n_merges         = 280147\n",
+      "llm_load_print_meta: n_ctx_train      = 8192\n",
+      "llm_load_print_meta: n_embd           = 4096\n",
+      "llm_load_print_meta: n_head           = 32\n",
+      "llm_load_print_meta: n_head_kv        = 8\n",
+      "llm_load_print_meta: n_layer          = 32\n",
+      "llm_load_print_meta: n_rot            = 128\n",
+      "llm_load_print_meta: n_embd_head_k    = 128\n",
+      "llm_load_print_meta: n_embd_head_v    = 128\n",
+      "llm_load_print_meta: n_gqa            = 4\n",
+      "llm_load_print_meta: n_embd_k_gqa     = 1024\n",
+      "llm_load_print_meta: n_embd_v_gqa     = 1024\n",
+      "llm_load_print_meta: f_norm_eps       = 0.0e+00\n",
+      "llm_load_print_meta: f_norm_rms_eps   = 1.0e-05\n",
+      "llm_load_print_meta: f_clamp_kqv      = 0.0e+00\n",
+      "llm_load_print_meta: f_max_alibi_bias = 0.0e+00\n",
+      "llm_load_print_meta: f_logit_scale    = 0.0e+00\n",
+      "llm_load_print_meta: n_ff             = 14336\n",
+      "llm_load_print_meta: n_expert         = 0\n",
+      "llm_load_print_meta: n_expert_used    = 0\n",
+      "llm_load_print_meta: causal attn      = 1\n",
+      "llm_load_print_meta: pooling type     = 0\n",
+      "llm_load_print_meta: rope type        = 0\n",
+      "llm_load_print_meta: rope scaling     = linear\n",
+      "llm_load_print_meta: freq_base_train  = 500000.0\n",
+      "llm_load_print_meta: freq_scale_train = 1\n",
+      "llm_load_print_meta: n_yarn_orig_ctx  = 8192\n",
+      "llm_load_print_meta: rope_finetuned   = unknown\n",
+      "llm_load_print_meta: ssm_d_conv       = 0\n",
+      "llm_load_print_meta: ssm_d_inner      = 0\n",
+      "llm_load_print_meta: ssm_d_state      = 0\n",
+      "llm_load_print_meta: ssm_dt_rank      = 0\n",
+      "llm_load_print_meta: model type       = 7B\n",
+      "llm_load_print_meta: model ftype      = Q8_0\n",
+      "llm_load_print_meta: model params     = 8.03 B\n",
+      "llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW) \n",
+      "llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct\n",
+      "llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'\n",
+      "llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'\n",
+      "llm_load_print_meta: LF token         = 128 'Ä'\n",
+      "ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no\n",
+      "ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes\n",
+      "ggml_cuda_init: found 1 CUDA devices:\n",
+      "  Device 0: NVIDIA RTX A2000 12GB, compute capability 8.6, VMM: yes\n",
+      "llm_load_tensors: ggml ctx size =    0.22 MiB\n",
+      "llm_load_tensors: offloading 8 repeating layers to GPU\n",
+      "llm_load_tensors: offloaded 8/33 layers to GPU\n",
+      "llm_load_tensors:        CPU buffer size =  8137.64 MiB\n",
+      "llm_load_tensors:      CUDA0 buffer size =  1768.25 MiB\n",
+      ".........................................................................................\n",
+      "llama_new_context_with_model: n_ctx      = 10016\n",
+      "llama_new_context_with_model: n_batch    = 300\n",
+      "llama_new_context_with_model: n_ubatch   = 300\n",
+      "llama_new_context_with_model: freq_base  = 10000.0\n",
+      "llama_new_context_with_model: freq_scale = 1\n",
+      "llama_kv_cache_init:  CUDA_Host KV buffer size =   939.00 MiB\n",
+      "llama_kv_cache_init:      CUDA0 KV buffer size =   313.00 MiB\n",
+      "llama_new_context_with_model: KV self size  = 1252.00 MiB, K (f16):  626.00 MiB, V (f16):  626.00 MiB\n",
+      "llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB\n",
+      "llama_new_context_with_model:      CUDA0 compute buffer size =   683.78 MiB\n",
+      "llama_new_context_with_model:  CUDA_Host compute buffer size =    16.15 MiB\n",
+      "llama_new_context_with_model: graph nodes  = 1030\n",
+      "llama_new_context_with_model: graph splits = 268\n",
+      "AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | \n",
+      "Model metadata: {'tokenizer.chat_template': \"{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}{% endif %}\", 'tokenizer.ggml.eos_token_id': '128009', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'gpt2', 'general.architecture': 'llama', 'llama.rope.freq_base': '500000.000000', 'tokenizer.ggml.pre': 'llama-bpe', 'llama.context_length': '8192', 'general.name': 'Meta-Llama-3-8B-Instruct', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'tokenizer.ggml.bos_token_id': '128000', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.file_type': '7', 'llama.vocab_size': '128256', 'llama.rope.dimension_count': '128'}\n",
+      "Using gguf chat template: {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n",
+      "\n",
+      "'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n",
+      "\n",
+      "' }}{% endif %}\n",
+      "Using chat eos_token: <|eot_id|>\n",
+      "Using chat bos_token: <|begin_of_text|>\n"
+     ]
+    }
+   ],
+   "source": [
+    "import multiprocessing\n",
+    "\n",
+    "from langchain_community.chat_models import ChatLlamaCpp\n",
+    "\n",
+    "llm = ChatLlamaCpp(\n",
+    "    temperature=0.5,\n",
+    "    model_path=\"./SanctumAI-meta-llama-3-8b-instruct.Q8_0.gguf\",\n",
+    "    n_ctx=10000,\n",
+    "    n_gpu_layers=8,\n",
+    "    n_batch=300,  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.\n",
+    "    max_tokens=512,\n",
+    "    n_threads=multiprocessing.cpu_count() - 1,\n",
+    "    repeat_penalty=1.5,\n",
+    "    top_p=0.5,\n",
+    "    verbose=True,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Invocation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "llama_print_timings:        load time =    1077.71 ms\n",
+      "llama_print_timings:      sample time =      21.82 ms /    39 runs   (    0.56 ms per token,  1787.35 tokens per second)\n",
+      "llama_print_timings: prompt eval time =    1077.65 ms /    37 tokens (   29.13 ms per token,    34.33 tokens per second)\n",
+      "llama_print_timings:        eval time =    8403.75 ms /    38 runs   (  221.15 ms per token,     4.52 tokens per second)\n",
+      "llama_print_timings:       total time =    9689.66 ms /    75 tokens\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "AIMessage(content='Je adore le programmation.\\n\\n(Note: \"programmation\" is used in both formal and informal contexts, but it\\'s generally accepted as equivalent of saying you like computer science or coding.)', response_metadata={'finish_reason': 'stop'}, id='run-e9e03b94-f29f-4c1d-8483-e23a46acb556-0')"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "messages = [\n",
+    "    (\n",
+    "        \"system\",\n",
+    "        \"You are a helpful assistant that translates English to French. Translate the user sentence.\",\n",
+    "    ),\n",
+    "    (\"human\", \"I love programming.\"),\n",
+    "]\n",
+    "\n",
+    "ai_msg = llm.invoke(messages)\n",
+    "ai_msg"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Je adore le programmation.\n",
+      "\n",
+      "(Note: \"programmation\" is used in both formal and informal contexts, but it's generally accepted as equivalent of saying you like computer science or coding.)\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(ai_msg.content)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Chaining\n",
+    "\n",
+    "We can [chain](/docs/how_to/sequence/) our model with a prompt template like so:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Llama.generate: prefix-match hit\n",
+      "\n",
+      "llama_print_timings:        load time =    1077.71 ms\n",
+      "llama_print_timings:      sample time =      29.23 ms /    52 runs   (    0.56 ms per token,  1778.75 tokens per second)\n",
+      "llama_print_timings: prompt eval time =     869.38 ms /    17 tokens (   51.14 ms per token,    19.55 tokens per second)\n",
+      "llama_print_timings:        eval time =    6694.18 ms /    51 runs   (  131.26 ms per token,     7.62 tokens per second)\n",
+      "llama_print_timings:       total time =    7830.86 ms /    68 tokens\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "AIMessage(content='Ich liebe auch Programmieren! (Translation: I also like coding!) Do you have any favorite languages or projects? Ich bin hier, um dir zu helfen und über deine Lieblingsprogrammierthemen sprechen können wir gerne weiter machen... !)', response_metadata={'finish_reason': 'stop'}, id='run-922c4cad-368f-41ba-9db9-eacb41d37cb2-0')"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain_core.prompts import ChatPromptTemplate\n",
+    "\n",
+    "prompt = ChatPromptTemplate.from_messages(\n",
+    "    [\n",
+    "        (\n",
+    "            \"system\",\n",
+    "            \"You are a helpful assistant that translates {input_language} to {output_language}.\",\n",
+    "        ),\n",
+    "        (\"human\", \"{input}\"),\n",
+    "    ]\n",
+    ")\n",
+    "\n",
+    "chain = prompt | llm\n",
+    "chain.invoke(\n",
+    "    {\n",
+    "        \"input_language\": \"English\",\n",
+    "        \"output_language\": \"German\",\n",
+    "        \"input\": \"I love programming.\",\n",
+    "    }\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Tool calling\n",
+    "\n",
+    "Firstly, it works mostly the same as OpenAI Function Calling\n",
+    "\n",
+    "OpenAI has a [tool calling](https://platform.openai.com/docs/guides/function-calling) (we use \"tool calling\" and \"function calling\" interchangeably here) API that lets you describe tools and their arguments, and have the model return a JSON object with a tool to invoke and the inputs to that tool. tool-calling is extremely useful for building tool-using chains and agents, and for getting structured outputs from models more generally.\n",
+    "\n",
+    "With `ChatLlamaCpp.bind_tools`, we can easily pass in Pydantic classes, dict schemas, LangChain tools, or even functions as tools to the model. Under the hood these are converted to an OpenAI tool schemas, which looks like:\n",
+    "```\n",
+    "{\n",
+    "    \"name\": \"...\",\n",
+    "    \"description\": \"...\",\n",
+    "    \"parameters\": {...}  # JSONSchema\n",
+    "}\n",
+    "```\n",
+    "and passed in every model invocation.\n",
+    "\n",
+    "\n",
+    "However, it cannot automatically trigger a function/tool, we need to force it by specifying the 'tool choice' parameter. This parameter is typically formatted as described below.\n",
+    "\n",
+    "```{\"type\": \"function\", \"function\": {\"name\": <<tool_name>>}}.```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.tools import tool\n",
+    "from langchain_core.pydantic_v1 import BaseModel, Field\n",
+    "\n",
+    "\n",
+    "class WeatherInput(BaseModel):\n",
+    "    location: str = Field(description=\"The city and state, e.g. San Francisco, CA\")\n",
+    "    unit: str = Field(enum=[\"celsius\", \"fahrenheit\"])\n",
+    "\n",
+    "\n",
+    "@tool(\"get_current_weather\", args_schema=WeatherInput)\n",
+    "def get_weather(location: str, unit: str):\n",
+    "    \"\"\"Get the current weather in a given location\"\"\"\n",
+    "    return f\"Now the weather in {location} is 22 {unit}\"\n",
+    "\n",
+    "\n",
+    "llm_with_tools = llm.bind_tools(\n",
+    "    tools=[get_weather],\n",
+    "    tool_choice={\"type\": \"function\", \"function\": {\"name\": \"get_current_weather\"}},\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Llama.generate: prefix-match hit\n",
+      "\n",
+      "llama_print_timings:        load time =    1077.71 ms\n",
+      "llama_print_timings:      sample time =     853.67 ms /    20 runs   (   42.68 ms per token,    23.43 tokens per second)\n",
+      "llama_print_timings: prompt eval time =    1060.96 ms /    21 tokens (   50.52 ms per token,    19.79 tokens per second)\n",
+      "llama_print_timings:        eval time =    2754.74 ms /    19 runs   (  144.99 ms per token,     6.90 tokens per second)\n",
+      "llama_print_timings:       total time =    4817.07 ms /    40 tokens\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "AIMessage(content='', additional_kwargs={'function_call': {'name': 'get_current_weather', 'arguments': '{ \"location\": \"Ho Chi Minh City\", \"unit\" : \"celsius\"}'}, 'tool_calls': [{'id': 'call__0_get_current_weather_cmpl-3e329fde-4fa6-41b9-837c-131fa9494554', 'type': 'function', 'function': {'name': 'get_current_weather', 'arguments': '{ \"location\": \"Ho Chi Minh City\", \"unit\" : \"celsius\"}'}}]}, response_metadata={'token_usage': {'prompt_tokens': 23, 'completion_tokens': 19, 'total_tokens': 42}, 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-9d35869c-36fe-4f4a-835e-089a3f3aba3c-0', tool_calls=[{'name': 'get_current_weather', 'args': {'location': 'Ho Chi Minh City', 'unit': 'celsius'}, 'id': 'call__0_get_current_weather_cmpl-3e329fde-4fa6-41b9-837c-131fa9494554'}])"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "ai_msg = llm_with_tools.invoke(\n",
+    "    \"what is the weather like in HCMC in celsius\",\n",
+    ")\n",
+    "ai_msg"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[{'name': 'get_current_weather',\n",
+       "  'args': {'location': 'Ho Chi Minh City', 'unit': 'celsius'},\n",
+       "  'id': 'call__0_get_current_weather_cmpl-3e329fde-4fa6-41b9-837c-131fa9494554'}]"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "ai_msg.tool_calls"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Structured output"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Llama.generate: prefix-match hit\n",
+      "\n",
+      "llama_print_timings:        load time =    1077.71 ms\n",
+      "llama_print_timings:      sample time =    1964.76 ms /    44 runs   (   44.65 ms per token,    22.39 tokens per second)\n",
+      "llama_print_timings: prompt eval time =     914.34 ms /    18 tokens (   50.80 ms per token,    19.69 tokens per second)\n",
+      "llama_print_timings:        eval time =    7903.81 ms /    43 runs   (  183.81 ms per token,     5.44 tokens per second)\n",
+      "llama_print_timings:       total time =   11065.60 ms /    61 tokens\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain_core.pydantic_v1 import BaseModel\n",
+    "from langchain_core.utils.function_calling import convert_to_openai_tool\n",
+    "\n",
+    "\n",
+    "class AnswerWithJustification(BaseModel):\n",
+    "    \"\"\"An answer to the user question along with justification for the answer.\"\"\"\n",
+    "\n",
+    "    answer: str\n",
+    "    justification: str\n",
+    "\n",
+    "\n",
+    "dict_schema = convert_to_openai_tool(AnswerWithJustification)\n",
+    "\n",
+    "structured_llm = llm.with_structured_output(dict_schema)\n",
+    "\n",
+    "result = structured_llm.invoke(\n",
+    "    \"What weighs more a pound of bricks or a pound of feathers ?\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'answer': \"a pound is always the same weight, regardless of what it's made up off. So both options are equal in terms of their mass.\", 'justification': ''}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Streaming\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Llama.generate: prefix-match hit\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "The\n",
+      " answer\n",
+      " to\n",
+      " the\n",
+      " multiplication\n",
+      " problem\n",
+      " \"\n",
+      "What\n",
+      "'s\n",
+      " \n",
+      "25\n",
+      " x\n",
+      " \n",
+      "5\n",
+      "?\"\n",
+      " would\n",
+      " be\n",
+      ":\n",
+      "\n",
+      "\n",
+      "125\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "llama_print_timings:        load time =    1077.71 ms\n",
+      "llama_print_timings:      sample time =      10.60 ms /    20 runs   (    0.53 ms per token,  1886.26 tokens per second)\n",
+      "llama_print_timings: prompt eval time =    3661.75 ms /    12 tokens (  305.15 ms per token,     3.28 tokens per second)\n",
+      "llama_print_timings:        eval time =    2468.01 ms /    19 runs   (  129.90 ms per token,     7.70 tokens per second)\n",
+      "llama_print_timings:       total time =    3133.11 ms /    31 tokens\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "for chunk in llm.stream(\"what is 25x5\"):\n",
+    "    print(chunk.content, end=\"\\n\", flush=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## API reference\n",
+    "\n",
+    "For detailed documentation of all ChatLlamaCpp features and configurations head to the API reference: https://api.python.langchain.com/en/latest/chat_models/langchain_community.chat_models.llamacpp.ChatLlamaCpp.html"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/docs/scripts/model_feat_table.py
+++ b/docs/scripts/model_feat_table.py
@@ -112,6 +112,13 @@ CHAT_MODEL_FEAT_TABLE = {
        "package": "langchain-community",
        "link": "/docs/integrations/chat/edenai/",
    },
+    "ChatLlamaCpp": {
+        "tool_calling": True,
+        "structured_output": True,
+        "local": True,
+        "package": "langchain-community",
+        "link": "/docs/integrations/chat/llamacpp",
+    },
 }