Adds ChatOllama (#9628)

@rlancemartin --------- Co-authored-by: Adilkhan Sarsen <54854336+adolkhan@users.noreply.github.com> Co-authored-by: Kim Minjong <make.dirty.code@gmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Lance Martin <lance@langchain.dev> Co-authored-by: Bagatur <baskaryan@gmail.com>
2025-08-11 13:55:03 +00:00 · 2023-08-23 13:02:26 -07:00 · 2023-08-23 13:02:26 -07:00 · 278ef0bdcf
commit 278ef0bdcf
parent fa05e18278
6 changed files with 550 additions and 16 deletions
--- a/docs/extras/integrations/chat/ollama.ipynb
+++ b/docs/extras/integrations/chat/ollama.ipynb
@ -0,0 +1,382 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Ollama\n",
+    "\n",
+    "[Ollama](https://ollama.ai/) allows you to run open-source large language models, such as LLaMA2, locally.\n",
+    "\n",
+    "Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. \n",
+    "\n",
+    "It optimizes setup and configuration details, including GPU usage.\n",
+    "\n",
+    "For a complete list of supported models and model variants, see the [Ollama model library](https://ollama.ai/library).\n",
+    "\n",
+    "## Setup\n",
+    "\n",
+    "First, follow [these instructions](https://github.com/jmorganca/ollama) to set up and run a local Ollama instance:\n",
+    "\n",
+    "* [Download](https://ollama.ai/download)\n",
+    "* Fetch a model via `ollama pull <model family>`\n",
+    "* e.g., for `Llama-7b`: `ollama pull llama2`\n",
+    "* This will download the most basic version of the model (e.g., minimum # parameters and 4-bit quantization)\n",
+    "* On Mac, it will download to:\n",
+    "\n",
+    "`~/.ollama/models/manifests/registry.ollama.ai/library/<model family>/latest`\n",
+    "\n",
+    "* And we can specify a particular version, e.g., for `ollama pull vicuna:13b-v1.5-16k-q4_0`\n",
+    "* The file is here with the model version in place of `latest`\n",
+    "\n",
+    "`~/.ollama/models/manifests/registry.ollama.ai/library/vicuna/13b-v1.5-16k-q4_0`\n",
+    "\n",
+    "You can easily access models in a few ways:\n",
+    "\n",
+    "1/ if the app is running:\n",
+    "* All of your local models are automatically served on `localhost:11434`\n",
+    "* Select your model when setting `llm = Ollama(..., model=\"<model family>:<version>\")`\n",
+    "* If you set `llm = Ollama(..., model=\"<model family\")` withoout a version it will simply look for `latest`\n",
+    "\n",
+    "2/ if building from source or just running the binary: \n",
+    "* Then you must run `ollama serve`\n",
+    "* All of your local models are automatically served on `localhost:11434`\n",
+    "* Then, select as shown above\n",
+    "\n",
+    "\n",
+    "## Usage\n",
+    "\n",
+    "You can see a full list of supported parameters on the [API reference page](https://api.python.langchain.com/en/latest/llms/langchain.llms.ollama.Ollama.html).\n",
+    "\n",
+    "If you are using a LLaMA `chat` model (e.g., `ollama pull llama2:7b-chat`) then you can use the `ChatOllama` interface.\n",
+    "\n",
+    "This includes [special tokens](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) for system message and user input."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.chat_models import ChatOllama\n",
+    "from langchain.callbacks.manager import CallbackManager\n",
+    "from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler                                  \n",
+    "chat_model = ChatOllama(model=\"llama2:7b-chat\", \n",
+    "                        callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "With `StreamingStdOutCallbackHandler`, you will see tokens streamed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " Artificial intelligence (AI) has a rich and varied history that spans several decades. Hinweis: The following is a brief overview of the major milestones in the history of AI, but it is by no means exhaustive.\n",
+      "\n",
+      "1. Early Beginnings (1950s-1960s): The term \"Artificial Intelligence\" was coined in 1956 by computer scientist John McCarthy. However, the concept of creating machines that can think and learn like humans dates back to ancient times. In the 1950s and 1960s, researchers began exploring the possibilities of AI using simple algorithms and machine learning techniques.\n",
+      "2. Rule-Based Systems (1970s-1980s): In the 1970s and 1980s, AI research focused on developing rule-based systems, which use predefined rules to reason and make decisions. This led to the development of expert systems, which were designed to mimic the decision-making abilities of human experts in specific domains.\n",
+      "3. Machine Learning (1980s-1990s): The 1980s saw a shift towards machine learning, which enables machines to learn from data without being explicitly programmed. This led to the development of algorithms such as decision trees, neural networks, and support vector machines.\n",
+      "4. Deep Learning (2000s-present): In the early 2000s, deep learning emerged as a subfield of machine learning, focusing on neural networks with multiple layers. These networks can learn complex representations of data, leading to breakthroughs in image and speech recognition, natural language processing, and other areas.\n",
+      "5. Natural Language Processing (NLP) (1980s-present): NLP has been an active area of research since the 1980s, with a focus on developing algorithms that can understand and generate human language. This has led to applications such as chatbots, voice assistants, and language translation systems.\n",
+      "6. Robotics (1970s-present): The development of robotics has been closely tied to AI research, with a focus on creating machines that can perform tasks that typically require human intelligence, such as manipulation and locomotion.\n",
+      "7. Computer Vision (1980s-present): Computer vision has been an active area of research since the 1980s, with a focus on enabling machines to interpret and understand visual data from the world around us. This has led to applications such as image recognition, object detection, and autonomous driving.\n",
+      "8. Ethics and Society (1990s-present): As AI technology has become more advanced and integrated into various aspects of society, there has been a growing concern about the ethical implications of AI. This includes issues related to privacy, bias, and job displacement.\n",
+      "9. Reinforcement Learning (2000s-present): Reinforcement learning is a subfield of machine learning that involves training machines to make decisions based on feedback from their environment. This has led to breakthroughs in areas such as game playing, robotics, and autonomous driving.\n",
+      "10. Generative Models (2010s-present): Generative models are a class of AI algorithms that can generate new data that is similar to a given dataset. This has led to applications such as image synthesis, music generation, and language creation.\n",
+      "\n",
+      "These are just a few of the many developments in the history of AI. As the field continues to evolve, we can expect even more exciting breakthroughs and innovations in the years to come."
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "AIMessage(content=' Artificial intelligence (AI) has a rich and varied history that spans several decades. Hinweis: The following is a brief overview of the major milestones in the history of AI, but it is by no means exhaustive.\\n\\n1. Early Beginnings (1950s-1960s): The term \"Artificial Intelligence\" was coined in 1956 by computer scientist John McCarthy. However, the concept of creating machines that can think and learn like humans dates back to ancient times. In the 1950s and 1960s, researchers began exploring the possibilities of AI using simple algorithms and machine learning techniques.\\n2. Rule-Based Systems (1970s-1980s): In the 1970s and 1980s, AI research focused on developing rule-based systems, which use predefined rules to reason and make decisions. This led to the development of expert systems, which were designed to mimic the decision-making abilities of human experts in specific domains.\\n3. Machine Learning (1980s-1990s): The 1980s saw a shift towards machine learning, which enables machines to learn from data without being explicitly programmed. This led to the development of algorithms such as decision trees, neural networks, and support vector machines.\\n4. Deep Learning (2000s-present): In the early 2000s, deep learning emerged as a subfield of machine learning, focusing on neural networks with multiple layers. These networks can learn complex representations of data, leading to breakthroughs in image and speech recognition, natural language processing, and other areas.\\n5. Natural Language Processing (NLP) (1980s-present): NLP has been an active area of research since the 1980s, with a focus on developing algorithms that can understand and generate human language. This has led to applications such as chatbots, voice assistants, and language translation systems.\\n6. Robotics (1970s-present): The development of robotics has been closely tied to AI research, with a focus on creating machines that can perform tasks that typically require human intelligence, such as manipulation and locomotion.\\n7. Computer Vision (1980s-present): Computer vision has been an active area of research since the 1980s, with a focus on enabling machines to interpret and understand visual data from the world around us. This has led to applications such as image recognition, object detection, and autonomous driving.\\n8. Ethics and Society (1990s-present): As AI technology has become more advanced and integrated into various aspects of society, there has been a growing concern about the ethical implications of AI. This includes issues related to privacy, bias, and job displacement.\\n9. Reinforcement Learning (2000s-present): Reinforcement learning is a subfield of machine learning that involves training machines to make decisions based on feedback from their environment. This has led to breakthroughs in areas such as game playing, robotics, and autonomous driving.\\n10. Generative Models (2010s-present): Generative models are a class of AI algorithms that can generate new data that is similar to a given dataset. This has led to applications such as image synthesis, music generation, and language creation.\\n\\nThese are just a few of the many developments in the history of AI. As the field continues to evolve, we can expect even more exciting breakthroughs and innovations in the years to come.', additional_kwargs={}, example=False)"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.schema import HumanMessage\n",
+    "\n",
+    "messages = [\n",
+    "    HumanMessage(content=\"Tell me about the history of AI\")\n",
+    "]\n",
+    "chat_model(messages)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## RAG\n",
+    "\n",
+    "We can use Olama with RAG, [just as shown here](https://python.langchain.com/docs/use_cases/question_answering/how_to/local_retrieval_qa).\n",
+    "\n",
+    "Let's use the 13b model:\n",
+    "\n",
+    "```\n",
+    "ollama pull llama2:13b\n",
+    "```\n",
+    "\n",
+    "Or, the 13b-chat model:\n",
+    "\n",
+    "```\n",
+    "ollama pull llama2:13b-chat\n",
+    "```\n",
+    "\n",
+    "Let's also use local embeddings from `GPT4AllEmbeddings` and `Chroma`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install gpt4all chromadb"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import WebBaseLoader\n",
+    "loader = WebBaseLoader(\"https://lilianweng.github.io/posts/2023-06-23-agent/\")\n",
+    "data = loader.load()\n",
+    "\n",
+    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n",
+    "all_splits = text_splitter.split_documents(data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Found model file at  /Users/rlm/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain.vectorstores import Chroma\n",
+    "from langchain.embeddings import GPT4AllEmbeddings\n",
+    "\n",
+    "vectorstore = Chroma.from_documents(documents=all_splits, embedding=GPT4AllEmbeddings())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "4"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "question = \"What are the approaches to Task Decomposition?\"\n",
+    "docs = vectorstore.similarity_search(question)\n",
+    "len(docs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain import PromptTemplate\n",
+    "\n",
+    "# Prompt\n",
+    "template = \"\"\"[INST] <<SYS>> Use the following pieces of context to answer the question at the end. \n",
+    "If you don't know the answer, just say that you don't know, don't try to make up an answer. \n",
+    "Use three sentences maximum and keep the answer as concise as possible. <</SYS>>\n",
+    "{context}\n",
+    "Question: {question}\n",
+    "Helpful Answer:[/INST]\"\"\"\n",
+    "QA_CHAIN_PROMPT = PromptTemplate(\n",
+    "    input_variables=[\"context\", \"question\"],\n",
+    "    template=template,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Chat model\n",
+    "from langchain.chat_models import ChatOllama\n",
+    "from langchain.callbacks.manager import CallbackManager\n",
+    "from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler\n",
+    "chat_model = ChatOllama(model=\"llama2:13b-chat\",\n",
+    "                        verbose=True,\n",
+    "                        callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# QA chain\n",
+    "from langchain.chains import RetrievalQA\n",
+    "qa_chain = RetrievalQA.from_chain_type(\n",
+    "    chat_model,\n",
+    "    retriever=vectorstore.as_retriever(),\n",
+    "    chain_type_kwargs={\"prompt\": QA_CHAIN_PROMPT},\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " Based on the provided context, there are three approaches to task decomposition for AI agents:\n",
+      "\n",
+      "1. LLM with simple prompting, such as \"Steps for XYZ.\" or \"What are the subgoals for achieving XYZ?\"\n",
+      "2. Task-specific instructions, such as \"Write a story outline\" for writing a novel.\n",
+      "3. Human inputs."
+     ]
+    }
+   ],
+   "source": [
+    "question = \"What are the various approaches to Task Decomposition for AI Agents?\"\n",
+    "result = qa_chain({\"query\": question})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can also get logging for tokens."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " Based on the given context, here is the answer to the question \"What are the approaches to Task Decomposition?\"\n",
+      "\n",
+      "There are three approaches to task decomposition:\n",
+      "\n",
+      "1. LLM with simple prompting, such as \"Steps for XYZ.\" or \"What are the subgoals for achieving XYZ?\"\n",
+      "2. Using task-specific instructions, like \"Write a story outline\" for writing a novel.\n",
+      "3. With human inputs.{'model': 'llama2:13b-chat', 'created_at': '2023-08-23T15:37:51.469127Z', 'done': True, 'context': [1, 29871, 1, 29961, 25580, 29962, 518, 25580, 29962, 518, 25580, 29962, 3532, 14816, 29903, 6778, 4803, 278, 1494, 12785, 310, 3030, 304, 1234, 278, 1139, 472, 278, 1095, 29889, 29871, 13, 3644, 366, 1016, 29915, 29873, 1073, 278, 1234, 29892, 925, 1827, 393, 366, 1016, 29915, 29873, 1073, 29892, 1016, 29915, 29873, 1018, 304, 1207, 701, 385, 1234, 29889, 29871, 13, 11403, 2211, 25260, 7472, 322, 3013, 278, 1234, 408, 3022, 895, 408, 1950, 29889, 529, 829, 14816, 29903, 6778, 13, 5398, 26227, 508, 367, 2309, 313, 29896, 29897, 491, 365, 26369, 411, 2560, 9508, 292, 763, 376, 7789, 567, 363, 1060, 29979, 29999, 7790, 29876, 29896, 19602, 376, 5618, 526, 278, 1014, 1484, 1338, 363, 3657, 15387, 1060, 29979, 29999, 29973, 613, 313, 29906, 29897, 491, 773, 3414, 29899, 14940, 11994, 29936, 321, 29889, 29887, 29889, 376, 6113, 263, 5828, 27887, 1213, 363, 5007, 263, 9554, 29892, 470, 313, 29941, 29897, 411, 5199, 10970, 29889, 13, 13, 5398, 26227, 508, 367, 2309, 313, 29896, 29897, 491, 365, 26369, 411, 2560, 9508, 292, 763, 376, 7789, 567, 363, 1060, 29979, 29999, 7790, 29876, 29896, 19602, 376, 5618, 526, 278, 1014, 1484, 1338, 363, 3657, 15387, 1060, 29979, 29999, 29973, 613, 313, 29906, 29897, 491, 773, 3414, 29899, 14940, 11994, 29936, 321, 29889, 29887, 29889, 376, 6113, 263, 5828, 27887, 1213, 363, 5007, 263, 9554, 29892, 470, 313, 29941, 29897, 411, 5199, 10970, 29889, 13, 13, 1451, 16047, 267, 297, 1472, 29899, 8489, 18987, 322, 3414, 26227, 29901, 1858, 9450, 975, 263, 3309, 29891, 4955, 322, 17583, 3902, 8253, 278, 1650, 2913, 3933, 18066, 292, 29889, 365, 26369, 29879, 21117, 304, 10365, 13900, 746, 20050, 411, 15668, 4436, 29892, 3907, 963, 3109, 16424, 9401, 304, 25618, 1058, 5110, 515, 14260, 322, 1059, 29889, 13, 13, 1451, 16047, 267, 297, 1472, 29899, 8489, 18987, 322, 3414, 26227, 29901, 1858, 9450, 975, 263, 3309, 29891, 4955, 322, 17583, 3902, 8253, 278, 1650, 2913, 3933, 18066, 292, 29889, 365, 26369, 29879, 21117, 304, 10365, 13900, 746, 20050, 411, 15668, 4436, 29892, 3907, 963, 3109, 16424, 9401, 304, 25618, 1058, 5110, 515, 14260, 322, 1059, 29889, 13, 16492, 29901, 1724, 526, 278, 13501, 304, 9330, 897, 510, 3283, 29973, 13, 29648, 1319, 673, 10834, 29914, 25580, 29962, 518, 29914, 25580, 29962, 518, 29914, 25580, 29962, 29871, 16564, 373, 278, 2183, 3030, 29892, 1244, 338, 278, 1234, 304, 278, 1139, 376, 5618, 526, 278, 13501, 304, 9330, 897, 510, 3283, 3026, 13, 13, 8439, 526, 2211, 13501, 304, 3414, 26227, 29901, 13, 13, 29896, 29889, 365, 26369, 411, 2560, 9508, 292, 29892, 1316, 408, 376, 7789, 567, 363, 1060, 29979, 29999, 1213, 470, 376, 5618, 526, 278, 1014, 1484, 1338, 363, 3657, 15387, 1060, 29979, 29999, 3026, 13, 29906, 29889, 5293, 3414, 29899, 14940, 11994, 29892, 763, 376, 6113, 263, 5828, 27887, 29908, 363, 5007, 263, 9554, 29889, 13, 29941, 29889, 2973, 5199, 10970, 29889, 2], 'total_duration': 9514823750, 'load_duration': 795542, 'sample_count': 99, 'sample_duration': 68732000, 'prompt_eval_count': 146, 'prompt_eval_duration': 6206275000, 'eval_count': 98, 'eval_duration': 3229641000}\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain.schema import LLMResult\n",
+    "from langchain.callbacks.base import BaseCallbackHandler\n",
+    "\n",
+    "class GenerationStatisticsCallback(BaseCallbackHandler):\n",
+    "    def on_llm_end(self, response: LLMResult, **kwargs) -> None:\n",
+    "        print(response.generations[0][0].generation_info)\n",
+    "        \n",
+    "callback_manager = CallbackManager([StreamingStdOutCallbackHandler(), GenerationStatisticsCallback()])\n",
+    "\n",
+    "chat_model = ChatOllama(model=\"llama2:13b-chat\",\n",
+    "                        verbose=True,\n",
+    "                        callback_manager=callback_manager)\n",
+    "\n",
+    "qa_chain = RetrievalQA.from_chain_type(\n",
+    "    chat_model,\n",
+    "    retriever=vectorstore.as_retriever(),\n",
+    "    chain_type_kwargs={\"prompt\": QA_CHAIN_PROMPT},\n",
+    ")\n",
+    "\n",
+    "question = \"What are the approaches to Task Decomposition?\"\n",
+    "result = qa_chain({\"query\": question})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`eval_count` / (`eval_duration`/10e9)  gets `tok / s`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "30.343929867127645"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "98 / (3229641000/1000/1000/1000)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/libs/langchain/langchain/callbacks/streaming_stdout.py
+++ b/libs/langchain/langchain/callbacks/streaming_stdout.py
@ -4,6 +4,7 @@ from typing import Any, Dict, List, Union

 from langchain.callbacks.base import BaseCallbackHandler
 from langchain.schema import AgentAction, AgentFinish, LLMResult
+from langchain.schema.messages import BaseMessage


 class StreamingStdOutCallbackHandler(BaseCallbackHandler):
@ -14,6 +15,14 @@ class StreamingStdOutCallbackHandler(BaseCallbackHandler):
    ) -> None:
        """Run when LLM starts running."""

+    def on_chat_model_start(
+        self,
+        serialized: Dict[str, Any],
+        messages: List[List[BaseMessage]],
+        **kwargs: Any
+    ) -> None:
+        """Run when LLM starts running."""
+
    def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
        """Run on new LLM token. Only available when streaming is enabled."""
        sys.stdout.write(token)
--- a/libs/langchain/langchain/chat_models/init.py
+++ b/libs/langchain/langchain/chat_models/init.py
@ -27,6 +27,7 @@ from langchain.chat_models.human import HumanInputChatModel
 from langchain.chat_models.jinachat import JinaChat
 from langchain.chat_models.litellm import ChatLiteLLM
 from langchain.chat_models.mlflow_ai_gateway import ChatMLflowAIGateway
+from langchain.chat_models.ollama import ChatOllama
 from langchain.chat_models.openai import ChatOpenAI
 from langchain.chat_models.promptlayer_openai import PromptLayerChatOpenAI
 from langchain.chat_models.vertexai import ChatVertexAI
@ -39,6 +40,7 @@ __all__ = [
    "ChatAnthropic",
    "ChatGooglePalm",
    "ChatMLflowAIGateway",
+    "ChatOllama",
    "ChatVertexAI",
    "JinaChat",
    "HumanInputChatModel",
--- a/libs/langchain/langchain/chat_models/anthropic.py
+++ b/libs/langchain/langchain/chat_models/anthropic.py
@ -32,7 +32,7 @@ class ChatAnthropic(BaseChatModel, _AnthropicCommon):
        .. code-block:: python

            import anthropic
-            from langchain.llms import Anthropic
+            from langchain.chat_models import ChatAnthropic
            model = ChatAnthropic(model="<model_name>", anthropic_api_key="my-api-key")
    """

--- a/libs/langchain/langchain/chat_models/ollama.py
+++ b/libs/langchain/langchain/chat_models/ollama.py
@ -0,0 +1,122 @@
+import json
+from typing import Any, Iterator, List, Optional
+
+from langchain.callbacks.manager import (
+    CallbackManagerForLLMRun,
+)
+from langchain.chat_models.base import BaseChatModel
+from langchain.llms.ollama import _OllamaCommon
+from langchain.schema import ChatResult
+from langchain.schema.messages import (
+    AIMessage,
+    AIMessageChunk,
+    BaseMessage,
+    ChatMessage,
+    HumanMessage,
+    SystemMessage,
+)
+from langchain.schema.output import ChatGeneration, ChatGenerationChunk
+
+
+def _stream_response_to_chat_generation_chunk(
+    stream_response: str,
+) -> ChatGenerationChunk:
+    """Convert a stream response to a generation chunk."""
+    parsed_response = json.loads(stream_response)
+    generation_info = parsed_response if parsed_response.get("done") is True else None
+    return ChatGenerationChunk(
+        message=AIMessageChunk(content=parsed_response.get("response", "")),
+        generation_info=generation_info,
+    )
+
+
+class ChatOllama(BaseChatModel, _OllamaCommon):
+    """Ollama locally runs large language models.
+
+    To use, follow the instructions at https://ollama.ai/.
+
+    Example:
+        .. code-block:: python
+
+            from langchain.chat_models import ChatOllama
+            ollama = ChatOllama(model="llama2")
+    """
+
+    @property
+    def _llm_type(self) -> str:
+        """Return type of chat model."""
+        return "ollama-chat"
+
+    @property
+    def lc_serializable(self) -> bool:
+        return True
+
+    def _format_message_as_text(self, message: BaseMessage) -> str:
+        if isinstance(message, ChatMessage):
+            message_text = f"\n\n{message.role.capitalize()}: {message.content}"
+        elif isinstance(message, HumanMessage):
+            message_text = f"[INST] {message.content} [/INST]"
+        elif isinstance(message, AIMessage):
+            message_text = f"{message.content}"
+        elif isinstance(message, SystemMessage):
+            message_text = f"<<SYS>> {message.content} <</SYS>>"
+        else:
+            raise ValueError(f"Got unknown type {message}")
+        return message_text
+
+    def _format_messages_as_text(self, messages: List[BaseMessage]) -> str:
+        return "\n".join(
+            [self._format_message_as_text(message) for message in messages]
+        )
+
+    def _generate(
+        self,
+        messages: List[BaseMessage],
+        stop: Optional[List[str]] = None,
+        run_manager: Optional[CallbackManagerForLLMRun] = None,
+        **kwargs: Any,
+    ) -> ChatResult:
+        """Call out to Ollama's generate endpoint.
+
+        Args:
+            messages: The list of base messages to pass into the model.
+            stop: Optional list of stop words to use when generating.
+
+        Returns:
+            Chat generations from the model
+
+        Example:
+            .. code-block:: python
+
+                response = ollama([
+                    HumanMessage(content="Tell me about the history of AI")
+                ])
+        """
+
+        prompt = self._format_messages_as_text(messages)
+        final_chunk = super()._stream_with_aggregation(
+            prompt, stop=stop, run_manager=run_manager, verbose=self.verbose, **kwargs
+        )
+        chat_generation = ChatGeneration(
+            message=AIMessage(content=final_chunk.text),
+            generation_info=final_chunk.generation_info,
+        )
+        return ChatResult(generations=[chat_generation])
+
+    def _stream(
+        self,
+        messages: List[BaseMessage],
+        stop: Optional[List[str]] = None,
+        run_manager: Optional[CallbackManagerForLLMRun] = None,
+        **kwargs: Any,
+    ) -> Iterator[ChatGenerationChunk]:
+        prompt = self._format_messages_as_text(messages)
+        for stream_resp in self._create_stream(prompt, stop, **kwargs):
+            if stream_resp:
+                chunk = _stream_response_to_chat_generation_chunk(stream_resp)
+                yield chunk
+                if run_manager:
+                    run_manager.on_llm_new_token(
+                        chunk.text,
+                        verbose=self.verbose,
+                    )
--- a/libs/langchain/langchain/llms/ollama.py
+++ b/libs/langchain/langchain/llms/ollama.py
@ -144,9 +144,35 @@ class _OllamaCommon(BaseLanguageModel):
            )
        return response.iter_lines(decode_unicode=True)

+    def _stream_with_aggregation(
+        self,
+        prompt: str,
+        stop: Optional[List[str]] = None,
+        run_manager: Optional[CallbackManagerForLLMRun] = None,
+        verbose: bool = False,
+        **kwargs: Any,
+    ) -> GenerationChunk:
+        final_chunk: Optional[GenerationChunk] = None
+        for stream_resp in self._create_stream(prompt, stop, **kwargs):
+            if stream_resp:
+                chunk = _stream_response_to_generation_chunk(stream_resp)
+                if final_chunk is None:
+                    final_chunk = chunk
+                else:
+                    final_chunk += chunk
+                if run_manager:
+                    run_manager.on_llm_new_token(
+                        chunk.text,
+                        verbose=verbose,
+                    )
+        if final_chunk is None:
+            raise ValueError("No data received from Ollama stream.")
+
+        return final_chunk
+

 class Ollama(BaseLLM, _OllamaCommon):
-    """Ollama locally run large language models.
+    """Ollama locally runs large language models.

    To use, follow the instructions at https://ollama.ai/.

@ -191,20 +217,13 @@ class Ollama(BaseLLM, _OllamaCommon):
        # TODO: add caching here.
        generations = []
        for prompt in prompts:
-            final_chunk: Optional[GenerationChunk] = None
-            for stream_resp in self._create_stream(prompt, stop, **kwargs):
-                if stream_resp:
-                    chunk = _stream_response_to_generation_chunk(stream_resp)
-                    if final_chunk is None:
-                        final_chunk = chunk
-                    else:
-                        final_chunk += chunk
-                    if run_manager:
-                        run_manager.on_llm_new_token(
-                            chunk.text,
-                            verbose=self.verbose,
-                        )
-
+            final_chunk = super()._stream_with_aggregation(
+                prompt,
+                stop=stop,
+                run_manager=run_manager,
+                verbose=self.verbose,
+                **kwargs,
+            )
            generations.append([final_chunk])
        return LLMResult(generations=generations)