langchain/docs/extras/use_cases/code_understanding.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Code Understanding\n",
    "\n",
    "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/code_understanding.ipynb)\n",
    "\n",
    "## Use case\n",
    "\n",
    "Source code analysis is one of the most popular LLM applications (e.g., [GitHub Co-Pilot](https://github.com/features/copilot), [Code Interpreter](https://chat.openai.com/auth/login?next=%2F%3Fmodel%3Dgpt-4-code-interpreter), [Codium](https://www.codium.ai/), and [Codeium](https://codeium.com/about)) for use-cases such as:\n",
    "\n",
    "- Q&A over the code base to understand how it works\n",
    "- Using LLMs for suggesting refactors or improvements\n",
    "- Using LLMs for documenting the code\n",
    "\n",
    "![Image description](/img/code_understanding.png)\n",
    "\n",
    "## Overview\n",
    "\n",
    "The pipeline for QA over code follows [the steps we do for document question answering](/docs/extras/use_cases/question_answering), with some differences:\n",
    "\n",
    "In particular, we can employ a [splitting strategy](https://python.langchain.com/docs/integrations/document_loaders/source_code) that does a few things:\n",
    "\n",
    "* Keeps each top-level function and class in the code is loaded into separate documents. \n",
    "* Puts remaining into a seperate document.\n",
    "* Retains metadata about where each split comes from\n",
    "\n",
    "## Quickstart"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install openai tiktoken chromadb langchain\n",
    "\n",
    "# Set env var OPENAI_API_KEY or load from a .env file\n",
    "# import dotenv\n",
    "\n",
    "# dotenv.load_env()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'lll follow the structure of [this notebook](https://github.com/cristobalcl/LearningLangChain/blob/master/notebooks/04%20-%20QA%20with%20code.ipynb) and employ [context aware code splitting](https://python.langchain.com/docs/integrations/document_loaders/source_code)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading\n",
    "\n",
    "\n",
    "We will upload all python project files using the `langchain.document_loaders.TextLoader`.\n",
    "\n",
    "The following script iterates over the files in the LangChain repository and loads every `.py` file (a.k.a. **documents**):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "from git import Repo\n",
    "from langchain.text_splitter import Language\n",
    "from langchain.document_loaders.generic import GenericLoader\n",
    "from langchain.document_loaders.parsers import LanguageParser"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clone\n",
    "repo_path = \"/Users/rlm/Desktop/test_repo\"\n",
    "repo = Repo.clone_from(\"https://github.com/hwchase17/langchain\", to_path=repo_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We load the py code using [`LanguageParser`](https://python.langchain.com/docs/integrations/document_loaders/source_code), which will:\n",
    "\n",
    "* Keep top-level functions and classes together (into a single document)\n",
    "* Put remaining code into a seperate document\n",
    "* Retains metadata about where each split comes from"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1293"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load\n",
    "loader = GenericLoader.from_filesystem(\n",
    "    repo_path+\"/libs/langchain/langchain\",\n",
    "    glob=\"**/*\",\n",
    "    suffixes=[\".py\"],\n",
    "    parser=LanguageParser(language=Language.PYTHON, parser_threshold=500)\n",
    ")\n",
    "documents = loader.load()\n",
    "len(documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Splitting\n",
    "\n",
    "Split the `Document` into chunks for embedding and vector storage.\n",
    "\n",
    "We can use `RecursiveCharacterTextSplitter` w/ `language` specified."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3748"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "python_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON, \n",
    "                                                               chunk_size=2000, \n",
    "                                                               chunk_overlap=200)\n",
    "texts = python_splitter.split_documents(documents)\n",
    "len(texts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RetrievalQA\n",
    "\n",
    "We need to store the documents in a way we can semantically search for their content. \n",
    "\n",
    "The most common approach is to embed the contents of each document then store the embedding and document in a vector store. \n",
    "\n",
    "When setting up the vectorstore retriever:\n",
    "\n",
    "* We test [max marginal relevance](/docs/extras/use_cases/question_answering) for retrieval\n",
    "* And 8 documents returned\n",
    "\n",
    "#### Go deeper\n",
    "\n",
    "- Browse the > 40 vectorstores integrations [here](https://integrations.langchain.com/).\n",
    "- See further documentation on vectorstores [here](/docs/modules/data_connection/vectorstores/).\n",
    "- Browse the > 30 text embedding integrations [here](https://integrations.langchain.com/).\n",
    "- See further documentation on embedding models [here](/docs/modules/data_connection/text_embedding/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.vectorstores import Chroma\n",
    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
    "db = Chroma.from_documents(texts, OpenAIEmbeddings(disallowed_special=()))\n",
    "retriever = db.as_retriever(\n",
    "    search_type=\"mmr\", # Also test \"similarity\"\n",
    "    search_kwargs={\"k\": 8},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Chat\n",
    "\n",
    "Test chat, just as we do for [chatbots](/docs/extras/use_cases/chatbots).\n",
    "\n",
    "#### Go deeper\n",
    "\n",
    "- Browse the > 55 LLM and chat model integrations [here](https://integrations.langchain.com/).\n",
    "- See further documentation on LLMs and chat models [here](/docs/modules/model_io/models/).\n",
    "- Use local LLMS: The popularity of [PrivateGPT](https://github.com/imartinez/privateGPT) and [GPT4All](https://github.com/nomic-ai/gpt4all) underscore the importance of running LLMs locally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatOpenAI\n",
    "from langchain.memory import ConversationSummaryMemory\n",
    "from langchain.chains import ConversationalRetrievalChain\n",
    "llm = ChatOpenAI(model_name=\"gpt-4\") \n",
    "memory = ConversationSummaryMemory(llm=llm,memory_key=\"chat_history\",return_messages=True)\n",
    "qa = ConversationalRetrievalChain.from_llm(llm, retriever=retriever, memory=memory)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'To initialize a ReAct agent, you need to follow these steps:\\n\\n1. Initialize a language model `llm` of type `BaseLanguageModel`.\\n\\n2. Initialize a document store `docstore` of type `Docstore`.\\n\\n3. Create a `DocstoreExplorer` with the initialized `docstore`. The `DocstoreExplorer` is used to search for and look up terms in the document store.\\n\\n4. Create an array of `Tool` objects. The `Tool` objects represent the actions that the agent can perform. In the case of `ReActDocstoreAgent`, the tools must be \"Search\" and \"Lookup\" with their corresponding functions from the `DocstoreExplorer`.\\n\\n5. Initialize the `ReActDocstoreAgent` using the `from_llm_and_tools` method with the `llm` (language model) and `tools` as parameters.\\n\\n6. Initialize the `ReActChain` (which is the `AgentExecutor`) using the `ReActDocstoreAgent` and `tools` as parameters.\\n\\nHere is an example of how to do this:\\n\\n```python\\nfrom langchain import ReActChain, OpenAI\\nfrom langchain.docstore.base import Docstore\\nfrom langchain.docstore.document import Document\\nfrom langchain.tools.base import BaseTool\\n\\n# Initialize the LLM and a docstore\\nllm = OpenAI()\\ndocstore = Docstore()\\n\\ndocstore_explorer = DocstoreExplorer(docstore)\\ntools = [\\n    Tool(\\n        name=\"Search\",\\n        func=docstore_explorer.search,\\n        description=\"Search for a term in the docstore.\",\\n    ),\\n    Tool(\\n        name=\"Lookup\",\\n        func=docstore_explorer.lookup,\\n        description=\"Lookup a term in the docstore.\",\\n    ),\\n]\\nagent = ReActDocstoreAgent.from_llm_and_tools(llm, tools)\\nreact = ReActChain(agent=agent, tools=tools)\\n```\\n\\nKeep in mind that this is a simplified example and you might need to adapt it to your specific needs.'"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "question = \"How can I initialize a ReAct agent?\"\n",
    "result = qa(question)\n",
    "result['answer']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-> **Question**: What is the class hierarchy? \n",
      "\n",
      "**Answer**: The class hierarchy in object-oriented programming is the structure that forms when classes are derived from other classes. The derived class is a subclass of the base class also known as the superclass. This hierarchy is formed based on the concept of inheritance in object-oriented programming where a subclass inherits the properties and functionalities of the superclass. \n",
      "\n",
      "In the given context, we have the following examples of class hierarchies:\n",
      "\n",
      "1. `BaseCallbackHandler --> <name>CallbackHandler` means `BaseCallbackHandler` is a base class and `<name>CallbackHandler` (like `AimCallbackHandler`, `ArgillaCallbackHandler` etc.) are derived classes that inherit from `BaseCallbackHandler`.\n",
      "\n",
      "2. `BaseLoader --> <name>Loader` means `BaseLoader` is a base class and `<name>Loader` (like `TextLoader`, `UnstructuredFileLoader` etc.) are derived classes that inherit from `BaseLoader`.\n",
      "\n",
      "3. `ToolMetaclass --> BaseTool --> <name>Tool` means `ToolMetaclass` is a base class, `BaseTool` is a derived class that inherits from `ToolMetaclass`, and `<name>Tool` (like `AIPluginTool`, `BaseGraphQLTool` etc.) are further derived classes that inherit from `BaseTool`. \n",
      "\n",
      "-> **Question**: What classes are derived from the Chain class? \n",
      "\n",
      "**Answer**: The classes that are derived from the Chain class are:\n",
      "\n",
      "1. LLMSummarizationCheckerChain\n",
      "2. MapReduceChain\n",
      "3. OpenAIModerationChain\n",
      "4. NatBotChain\n",
      "5. QAGenerationChain\n",
      "6. QAWithSourcesChain\n",
      "7. RetrievalQAWithSourcesChain\n",
      "8. VectorDBQAWithSourcesChain\n",
      "9. RetrievalQA\n",
      "10. VectorDBQA\n",
      "11. LLMRouterChain\n",
      "12. MultiPromptChain\n",
      "13. MultiRetrievalQAChain\n",
      "14. MultiRouteChain\n",
      "15. RouterChain\n",
      "16. SequentialChain\n",
      "17. SimpleSequentialChain\n",
      "18. TransformChain\n",
      "19. BaseConversationalRetrievalChain\n",
      "20. ConstitutionalChain \n",
      "\n",
      "-> **Question**: What one improvement do you propose in code in relation to the class herarchy for the Chain class? \n",
      "\n",
      "**Answer**: As an AI model, I don't have personal opinions. However, one suggestion could be to improve the documentation of the Chain class hierarchy. The current comments and docstrings provide some details but it could be helpful to include more explicit explanations about the hierarchy, roles of each subclass, and their relationships with one another. Also, incorporating UML diagrams or other visuals could help developers better understand the structure and interactions of the classes. \n",
      "\n"
     ]
    }
   ],
   "source": [
    "questions = [\n",
    "    \"What is the class hierarchy?\",\n",
    "    \"What classes are derived from the Chain class?\",\n",
    "    \"What one improvement do you propose in code in relation to the class herarchy for the Chain class?\",\n",
    "]\n",
    "\n",
    "for question in questions:\n",
    "    result = qa(question)\n",
    "    print(f\"-> **Question**: {question} \\n\")\n",
    "    print(f\"**Answer**: {result['answer']} \\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The can look at the [LangSmith trace](https://smith.langchain.com/public/2b23045f-4e49-4d2d-8980-dec85259af36/r) to see what is happening under the hood:\n",
    "\n",
    "* In particular, the code well structured and kept together in the retrival output\n",
    "* The retrieved code and chat history are passed to the LLM for answer distillation\n",
    "\n",
    "![Image description](/img/code_retrieval.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Private chat\n",
    "\n",
    "We can use [Code LLaMA](https://about.fb.com/news/2023/08/code-llama-ai-for-coding/) via the Ollama integration.\n",
    "\n",
    "`ollama pull codellama:7b-instruct`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.llms import Ollama\n",
    "from langchain.callbacks.manager import CallbackManager\n",
    "from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler                                  \n",
    "llm = Ollama(model=\"codellama:7b-instruct\", \n",
    "             callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]))\n",
    "memory = ConversationSummaryMemory(llm=llm,memory_key=\"chat_history\",return_messages=True)\n",
    "qa_llama=ConversationalRetrievalChain.from_llm(llm, retriever=retriever, memory=memory)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " \"How can I initialize a ReAct agent?\" To initialize a ReAct agent, you can use the `ReActAgent.from_llm_and_tools()` class method. This method takes two arguments: the LLM and a list of tools.\n",
      "Here is an example of how to initialize a ReAct agent with the OpenAI language model and the \"Search\" tool:\n",
      "from langchain.agents.mrkl.base import ZeroShotAgent\n",
      "\n",
      "agent = ReActDocstoreAgent.from_llm_and_tools(OpenAIFunctionsAgent(), [Tool(\"Search\")]])\n",
      "\n",
      " The human asks what the AI thinks of artificial intelligence. The AI thinks artificial intelligence is a force for good because it will help humans reach their full potential."
     ]
    },
    {
     "data": {
      "text/plain": [
       "' To initialize a ReAct agent, you can use the `ReActAgent.from_llm_and_tools()` class method. This method takes two arguments: the LLM and a list of tools.\\nHere is an example of how to initialize a ReAct agent with the OpenAI language model and the \"Search\" tool:\\nfrom langchain.agents.mrkl.base import ZeroShotAgent\\n\\nagent = ReActDocstoreAgent.from_llm_and_tools(OpenAIFunctionsAgent(), [Tool(\"Search\")]])\\n\\n'"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "question = \"How can I initialize a ReAct agent?\"\n",
    "result = qa_llama(question)\n",
    "result['answer']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can view the [LangSmith trace](https://smith.langchain.com/public/fd24c734-e365-4a09-b883-cdbc7dcfa582/r) to sanity check the result relative to what was retrieved."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}