langchain/docs/extras/integrations/llms/ollama.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Ollama\n",
    "\n",
    "[Ollama](https://ollama.ai/) allows you to run open-source large language models, such as Llama 2, locally.\n",
    "\n",
    "Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. \n",
    "\n",
    "It optimizes setup and configuration details, including GPU usage.\n",
    "\n",
    "For a complete list of supported models and model variants, see the [Ollama model library](https://github.com/jmorganca/ollama#model-library).\n",
    "\n",
    "## Setup\n",
    "\n",
    "First, follow [these instructions](https://github.com/jmorganca/ollama) to set up and run a local Ollama instance:\n",
    "\n",
    "* [Download](https://ollama.ai/download)\n",
    "* Fetch a model via `ollama pull <model family>`\n",
    "* e.g., for `Llama-7b`: `ollama pull llama2` (see full list [here](https://github.com/jmorganca/ollama))\n",
    "* This will download the most basic version of the model typically (e.g., smallest # parameters and `q4_0`)\n",
    "* On Mac, it will download to \n",
    "\n",
    "`~/.ollama/models/manifests/registry.ollama.ai/library/<model family>/latest`\n",
    "\n",
    "* And we specify a particular version, e.g., for `ollama pull vicuna:13b-v1.5-16k-q4_0`\n",
    "* The file is here with the model version in place of `latest`\n",
    "\n",
    "`~/.ollama/models/manifests/registry.ollama.ai/library/vicuna/13b-v1.5-16k-q4_0`\n",
    "\n",
    "You can easily access models in a few ways:\n",
    "\n",
    "1/ if the app is running:\n",
    "* All of your local models are automatically served on `localhost:11434`\n",
    "* Select your model when setting `llm = Ollama(..., model=\"<model family>:<version>\")`\n",
    "* If you set `llm = Ollama(..., model=\"<model family\")` withoout a version it will simply look for `latest`\n",
    "\n",
    "2/ if building from source or just running the binary: \n",
    "* Then you must run `ollama serve`\n",
    "* All of your local models are automatically served on `localhost:11434`\n",
    "* Then, select as shown above\n",
    "\n",
    "\n",
    "## Usage\n",
    "\n",
    "You can see a full list of supported parameters on the [API reference page](https://api.python.langchain.com/en/latest/llms/langchain.llms.ollama.Ollama.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.llms import Ollama\n",
    "from langchain.callbacks.manager import CallbackManager\n",
    "from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler                                  \n",
    "llm = Ollama(model=\"llama2\", \n",
    "             callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With `StreamingStdOutCallbackHandler`, you will see tokens streamed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm(\"Tell me about the history of AI\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ollama supports embeddings via `OllamaEmbeddings`:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import OllamaEmbeddings\n",
    "oembed = OllamaEmbeddings(base_url=\"http://localhost:11434\", model=\"llama2\")\n",
    "oembed.embed_query(\"Llamas are social animals and live with others as a herd.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## RAG\n",
    "\n",
    "We can use Olama with RAG, [just as shown here](https://python.langchain.com/docs/use_cases/question_answering/how_to/local_retrieval_qa).\n",
    "\n",
    "Let's use the 13b model:\n",
    "\n",
    "```\n",
    "ollama pull llama2:13b\n",
    "```\n",
    "\n",
    "Let's also use local embeddings from `OllamaEmbeddings` and `Chroma`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install chromadb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load web page\n",
    "from langchain.document_loaders import WebBaseLoader\n",
    "loader = WebBaseLoader(\"https://lilianweng.github.io/posts/2023-06-23-agent/\")\n",
    "data = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split into chunks \n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100)\n",
    "all_splits = text_splitter.split_documents(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found model file at  /Users/rlm/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "objc[77472]: Class GGMLMetalClass is implemented in both /Users/rlm/miniforge3/envs/llama2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libreplit-mainline-metal.dylib (0x17f754208) and /Users/rlm/miniforge3/envs/llama2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libllamamodel-mainline-metal.dylib (0x17fb80208). One of the two will be used. Which one is undefined.\n"
     ]
    }
   ],
   "source": [
    "# Embed and store\n",
    "from langchain.vectorstores import Chroma\n",
    "from langchain.embeddings import GPT4AllEmbeddings\n",
    "from langchain.embeddings import OllamaEmbeddings # We can also try Ollama embeddings\n",
    "vectorstore = Chroma.from_documents(documents=all_splits,\n",
    "                                    embedding=GPT4AllEmbeddings())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Retrieve\n",
    "question = \"How can Task Decomposition be done?\"\n",
    "docs = vectorstore.similarity_search(question)\n",
    "len(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# RAG prompt\n",
    "from langchain import hub\n",
    "QA_CHAIN_PROMPT = hub.pull(\"rlm/rag-prompt-llama\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# LLM\n",
    "from langchain.llms import Ollama\n",
    "from langchain.callbacks.manager import CallbackManager\n",
    "from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler\n",
    "llm = Ollama(model=\"llama2\",\n",
    "             verbose=True,\n",
    "             callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# QA chain\n",
    "from langchain.chains import RetrievalQA\n",
    "qa_chain = RetrievalQA.from_chain_type(\n",
    "    llm,\n",
    "    retriever=vectorstore.as_retriever(),\n",
    "    chain_type_kwargs={\"prompt\": QA_CHAIN_PROMPT},\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " There are several approaches to task decomposition for AI agents, including:\n",
      "\n",
      "1. Chain of thought (CoT): This involves instructing the model to \"think step by step\" and use more test-time computation to decompose hard tasks into smaller and simpler steps.\n",
      "2. Tree of thoughts (ToT): This extends CoT by exploring multiple reasoning possibilities at each step, creating a tree structure. The search process can be BFS or DFS with each state evaluated by a classifier or majority vote.\n",
      "3. Using task-specific instructions: For example, \"Write a story outline.\" for writing a novel.\n",
      "4. Human inputs: The agent can receive input from a human operator to perform tasks that require creativity and domain expertise.\n",
      "\n",
      "These approaches allow the agent to break down complex tasks into manageable subgoals, enabling efficient handling of tasks and improving the quality of final results through self-reflection and refinement."
     ]
    }
   ],
   "source": [
    "question = \"What are the various approaches to Task Decomposition for AI Agents?\"\n",
    "result = qa_chain({\"query\": question})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also get logging for tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.schema import LLMResult\n",
    "from langchain.callbacks.base import BaseCallbackHandler\n",
    "\n",
    "class GenerationStatisticsCallback(BaseCallbackHandler):\n",
    "    def on_llm_end(self, response: LLMResult, **kwargs) -> None:\n",
    "        print(response.generations[0][0].generation_info)\n",
    "        \n",
    "callback_manager = CallbackManager([StreamingStdOutCallbackHandler(), GenerationStatisticsCallback()])\n",
    "\n",
    "llm = Ollama(base_url=\"http://localhost:11434\",\n",
    "             model=\"llama2\",\n",
    "             verbose=True,\n",
    "             callback_manager=callback_manager)\n",
    "\n",
    "qa_chain = RetrievalQA.from_chain_type(\n",
    "    llm,\n",
    "    retriever=vectorstore.as_retriever(),\n",
    "    chain_type_kwargs={\"prompt\": QA_CHAIN_PROMPT},\n",
    ")\n",
    "\n",
    "question = \"What are the approaches to Task Decomposition?\"\n",
    "result = qa_chain({\"query\": question})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`eval_count` / (`eval_duration`/10e9)  gets `tok / s`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "47.22003469910937"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "62 / (1313002000/1000/1000/1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using the Hub for prompt management\n",
    " \n",
    "Open source models often benefit from specific prompts. \n",
    "\n",
    "For example, [Mistral 7b](https://mistral.ai/news/announcing-mistral-7b/) was fine-tuned for chat using the prompt format shown [here](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1).\n",
    "\n",
    "Get the model: `ollama pull mistral:7b-instruct`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# LLM\n",
    "from langchain.llms import Ollama\n",
    "from langchain.callbacks.manager import CallbackManager\n",
    "from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler\n",
    "llm = Ollama(model=\"mistral:7b-instruct\",\n",
    "             verbose=True,\n",
    "             callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain import hub\n",
    "QA_CHAIN_PROMPT = hub.pull(\"rlm/rag-prompt-mistral\")\n",
    "\n",
    "# QA chain\n",
    "from langchain.chains import RetrievalQA\n",
    "qa_chain = RetrievalQA.from_chain_type(\n",
    "    llm,\n",
    "    retriever=vectorstore.as_retriever(),\n",
    "    chain_type_kwargs={\"prompt\": QA_CHAIN_PROMPT},\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "There are different approaches to Task Decomposition for AI Agents such as Chain of thought (CoT) and Tree of Thoughts (ToT). CoT breaks down big tasks into multiple manageable tasks and generates multiple thoughts per step, while ToT explores multiple reasoning possibilities at each step. Task decomposition can be done by LLM with simple prompting or using task-specific instructions or human inputs."
     ]
    }
   ],
   "source": [
    "question = \"What are the various approaches to Task Decomposition for AI Agents?\"\n",
    "result = qa_chain({\"query\": question})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}