Doc refactor (#6300)

Co-authored-by: jacoblee93 <jacoblee93@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
This commit is contained in:
Davis Chase
2023-06-16 11:52:56 -07:00
committed by GitHub
parent 94c82a189d
commit 87e502c6bc
1027 changed files with 23013 additions and 36747 deletions

View File

@@ -0,0 +1,666 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Use LangChain, GPT and Deep Lake to work with code base\n",
"In this tutorial, we are going to use Langchain + Deep Lake with GPT to analyze the code base of the LangChain itself. "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Design"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Prepare data:\n",
" 1. Upload all python project files using the `langchain.document_loaders.TextLoader`. We will call these files the **documents**.\n",
" 2. Split all documents to chunks using the `langchain.text_splitter.CharacterTextSplitter`.\n",
" 3. Embed chunks and upload them into the DeepLake using `langchain.embeddings.openai.OpenAIEmbeddings` and `langchain.vectorstores.DeepLake`\n",
"2. Question-Answering:\n",
" 1. Build a chain from `langchain.chat_models.ChatOpenAI` and `langchain.chains.ConversationalRetrievalChain`\n",
" 2. Prepare questions.\n",
" 3. Get answers running the chain.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Implementation"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### Integration preparations"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We need to set up keys for external services and install necessary python libraries."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#!python3 -m pip install --upgrade langchain deeplake openai"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Set up OpenAI embeddings, Deep Lake multi-modal vector store api and authenticate. \n",
"\n",
"For full documentation of Deep Lake please follow https://docs.activeloop.ai/ and API reference https://docs.deeplake.ai/en/latest/"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" ········\n"
]
}
],
"source": [
"import os\n",
"from getpass import getpass\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass()\n",
"# Please manually enter OpenAI Key"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Authenticate into Deep Lake if you want to create your own dataset and publish it. You can get an API key from the platform at [app.activeloop.ai](https://app.activeloop.ai)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" ········\n"
]
}
],
"source": [
"os.environ[\"ACTIVELOOP_TOKEN\"] = getpass.getpass(\"Activeloop Token:\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prepare data "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Load all repository files. Here we assume this notebook is downloaded as the part of the langchain fork and we work with the python files of the `langchain` repo.\n",
"\n",
"If you want to use files from different repo, change `root_dir` to the root dir of your repo."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1147\n"
]
}
],
"source": [
"from langchain.document_loaders import TextLoader\n",
"\n",
"root_dir = \"../../../..\"\n",
"\n",
"docs = []\n",
"for dirpath, dirnames, filenames in os.walk(root_dir):\n",
" for file in filenames:\n",
" if file.endswith(\".py\") and \"/.venv/\" not in dirpath:\n",
" try:\n",
" loader = TextLoader(os.path.join(dirpath, file), encoding=\"utf-8\")\n",
" docs.extend(loader.load_and_split())\n",
" except Exception as e:\n",
" pass\n",
"print(f\"{len(docs)}\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, chunk the files"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Created a chunk of size 1620, which is longer than the specified 1000\n",
"Created a chunk of size 1213, which is longer than the specified 1000\n",
"Created a chunk of size 1263, which is longer than the specified 1000\n",
"Created a chunk of size 1448, which is longer than the specified 1000\n",
"Created a chunk of size 1120, which is longer than the specified 1000\n",
"Created a chunk of size 1148, which is longer than the specified 1000\n",
"Created a chunk of size 1826, which is longer than the specified 1000\n",
"Created a chunk of size 1260, which is longer than the specified 1000\n",
"Created a chunk of size 1195, which is longer than the specified 1000\n",
"Created a chunk of size 2147, which is longer than the specified 1000\n",
"Created a chunk of size 1410, which is longer than the specified 1000\n",
"Created a chunk of size 1269, which is longer than the specified 1000\n",
"Created a chunk of size 1030, which is longer than the specified 1000\n",
"Created a chunk of size 1046, which is longer than the specified 1000\n",
"Created a chunk of size 1024, which is longer than the specified 1000\n",
"Created a chunk of size 1026, which is longer than the specified 1000\n",
"Created a chunk of size 1285, which is longer than the specified 1000\n",
"Created a chunk of size 1370, which is longer than the specified 1000\n",
"Created a chunk of size 1031, which is longer than the specified 1000\n",
"Created a chunk of size 1999, which is longer than the specified 1000\n",
"Created a chunk of size 1029, which is longer than the specified 1000\n",
"Created a chunk of size 1120, which is longer than the specified 1000\n",
"Created a chunk of size 1033, which is longer than the specified 1000\n",
"Created a chunk of size 1143, which is longer than the specified 1000\n",
"Created a chunk of size 1416, which is longer than the specified 1000\n",
"Created a chunk of size 2482, which is longer than the specified 1000\n",
"Created a chunk of size 1890, which is longer than the specified 1000\n",
"Created a chunk of size 1418, which is longer than the specified 1000\n",
"Created a chunk of size 1848, which is longer than the specified 1000\n",
"Created a chunk of size 1069, which is longer than the specified 1000\n",
"Created a chunk of size 2369, which is longer than the specified 1000\n",
"Created a chunk of size 1045, which is longer than the specified 1000\n",
"Created a chunk of size 1501, which is longer than the specified 1000\n",
"Created a chunk of size 1208, which is longer than the specified 1000\n",
"Created a chunk of size 1950, which is longer than the specified 1000\n",
"Created a chunk of size 1283, which is longer than the specified 1000\n",
"Created a chunk of size 1414, which is longer than the specified 1000\n",
"Created a chunk of size 1304, which is longer than the specified 1000\n",
"Created a chunk of size 1224, which is longer than the specified 1000\n",
"Created a chunk of size 1060, which is longer than the specified 1000\n",
"Created a chunk of size 2461, which is longer than the specified 1000\n",
"Created a chunk of size 1099, which is longer than the specified 1000\n",
"Created a chunk of size 1178, which is longer than the specified 1000\n",
"Created a chunk of size 1449, which is longer than the specified 1000\n",
"Created a chunk of size 1345, which is longer than the specified 1000\n",
"Created a chunk of size 3359, which is longer than the specified 1000\n",
"Created a chunk of size 2248, which is longer than the specified 1000\n",
"Created a chunk of size 1589, which is longer than the specified 1000\n",
"Created a chunk of size 2104, which is longer than the specified 1000\n",
"Created a chunk of size 1505, which is longer than the specified 1000\n",
"Created a chunk of size 1387, which is longer than the specified 1000\n",
"Created a chunk of size 1215, which is longer than the specified 1000\n",
"Created a chunk of size 1240, which is longer than the specified 1000\n",
"Created a chunk of size 1635, which is longer than the specified 1000\n",
"Created a chunk of size 1075, which is longer than the specified 1000\n",
"Created a chunk of size 2180, which is longer than the specified 1000\n",
"Created a chunk of size 1791, which is longer than the specified 1000\n",
"Created a chunk of size 1555, which is longer than the specified 1000\n",
"Created a chunk of size 1082, which is longer than the specified 1000\n",
"Created a chunk of size 1225, which is longer than the specified 1000\n",
"Created a chunk of size 1287, which is longer than the specified 1000\n",
"Created a chunk of size 1085, which is longer than the specified 1000\n",
"Created a chunk of size 1117, which is longer than the specified 1000\n",
"Created a chunk of size 1966, which is longer than the specified 1000\n",
"Created a chunk of size 1150, which is longer than the specified 1000\n",
"Created a chunk of size 1285, which is longer than the specified 1000\n",
"Created a chunk of size 1150, which is longer than the specified 1000\n",
"Created a chunk of size 1585, which is longer than the specified 1000\n",
"Created a chunk of size 1208, which is longer than the specified 1000\n",
"Created a chunk of size 1267, which is longer than the specified 1000\n",
"Created a chunk of size 1542, which is longer than the specified 1000\n",
"Created a chunk of size 1183, which is longer than the specified 1000\n",
"Created a chunk of size 2424, which is longer than the specified 1000\n",
"Created a chunk of size 1017, which is longer than the specified 1000\n",
"Created a chunk of size 1304, which is longer than the specified 1000\n",
"Created a chunk of size 1379, which is longer than the specified 1000\n",
"Created a chunk of size 1324, which is longer than the specified 1000\n",
"Created a chunk of size 1205, which is longer than the specified 1000\n",
"Created a chunk of size 1056, which is longer than the specified 1000\n",
"Created a chunk of size 1195, which is longer than the specified 1000\n",
"Created a chunk of size 3608, which is longer than the specified 1000\n",
"Created a chunk of size 1058, which is longer than the specified 1000\n",
"Created a chunk of size 1075, which is longer than the specified 1000\n",
"Created a chunk of size 1217, which is longer than the specified 1000\n",
"Created a chunk of size 1109, which is longer than the specified 1000\n",
"Created a chunk of size 1440, which is longer than the specified 1000\n",
"Created a chunk of size 1046, which is longer than the specified 1000\n",
"Created a chunk of size 1220, which is longer than the specified 1000\n",
"Created a chunk of size 1403, which is longer than the specified 1000\n",
"Created a chunk of size 1241, which is longer than the specified 1000\n",
"Created a chunk of size 1427, which is longer than the specified 1000\n",
"Created a chunk of size 1049, which is longer than the specified 1000\n",
"Created a chunk of size 1580, which is longer than the specified 1000\n",
"Created a chunk of size 1565, which is longer than the specified 1000\n",
"Created a chunk of size 1131, which is longer than the specified 1000\n",
"Created a chunk of size 1425, which is longer than the specified 1000\n",
"Created a chunk of size 1054, which is longer than the specified 1000\n",
"Created a chunk of size 1027, which is longer than the specified 1000\n",
"Created a chunk of size 2559, which is longer than the specified 1000\n",
"Created a chunk of size 1028, which is longer than the specified 1000\n",
"Created a chunk of size 1382, which is longer than the specified 1000\n",
"Created a chunk of size 1888, which is longer than the specified 1000\n",
"Created a chunk of size 1475, which is longer than the specified 1000\n",
"Created a chunk of size 1652, which is longer than the specified 1000\n",
"Created a chunk of size 1891, which is longer than the specified 1000\n",
"Created a chunk of size 1899, which is longer than the specified 1000\n",
"Created a chunk of size 1021, which is longer than the specified 1000\n",
"Created a chunk of size 1085, which is longer than the specified 1000\n",
"Created a chunk of size 1854, which is longer than the specified 1000\n",
"Created a chunk of size 1672, which is longer than the specified 1000\n",
"Created a chunk of size 2537, which is longer than the specified 1000\n",
"Created a chunk of size 1251, which is longer than the specified 1000\n",
"Created a chunk of size 1734, which is longer than the specified 1000\n",
"Created a chunk of size 1642, which is longer than the specified 1000\n",
"Created a chunk of size 1376, which is longer than the specified 1000\n",
"Created a chunk of size 1253, which is longer than the specified 1000\n",
"Created a chunk of size 1642, which is longer than the specified 1000\n",
"Created a chunk of size 1419, which is longer than the specified 1000\n",
"Created a chunk of size 1438, which is longer than the specified 1000\n",
"Created a chunk of size 1427, which is longer than the specified 1000\n",
"Created a chunk of size 1684, which is longer than the specified 1000\n",
"Created a chunk of size 1760, which is longer than the specified 1000\n",
"Created a chunk of size 1157, which is longer than the specified 1000\n",
"Created a chunk of size 2504, which is longer than the specified 1000\n",
"Created a chunk of size 1082, which is longer than the specified 1000\n",
"Created a chunk of size 2268, which is longer than the specified 1000\n",
"Created a chunk of size 1784, which is longer than the specified 1000\n",
"Created a chunk of size 1311, which is longer than the specified 1000\n",
"Created a chunk of size 2972, which is longer than the specified 1000\n",
"Created a chunk of size 1144, which is longer than the specified 1000\n",
"Created a chunk of size 1825, which is longer than the specified 1000\n",
"Created a chunk of size 1508, which is longer than the specified 1000\n",
"Created a chunk of size 2901, which is longer than the specified 1000\n",
"Created a chunk of size 1715, which is longer than the specified 1000\n",
"Created a chunk of size 1062, which is longer than the specified 1000\n",
"Created a chunk of size 1206, which is longer than the specified 1000\n",
"Created a chunk of size 1102, which is longer than the specified 1000\n",
"Created a chunk of size 1184, which is longer than the specified 1000\n",
"Created a chunk of size 1002, which is longer than the specified 1000\n",
"Created a chunk of size 1065, which is longer than the specified 1000\n",
"Created a chunk of size 1871, which is longer than the specified 1000\n",
"Created a chunk of size 1754, which is longer than the specified 1000\n",
"Created a chunk of size 2413, which is longer than the specified 1000\n",
"Created a chunk of size 1771, which is longer than the specified 1000\n",
"Created a chunk of size 2054, which is longer than the specified 1000\n",
"Created a chunk of size 2000, which is longer than the specified 1000\n",
"Created a chunk of size 2061, which is longer than the specified 1000\n",
"Created a chunk of size 1066, which is longer than the specified 1000\n",
"Created a chunk of size 1419, which is longer than the specified 1000\n",
"Created a chunk of size 1368, which is longer than the specified 1000\n",
"Created a chunk of size 1008, which is longer than the specified 1000\n",
"Created a chunk of size 1227, which is longer than the specified 1000\n",
"Created a chunk of size 1745, which is longer than the specified 1000\n",
"Created a chunk of size 2296, which is longer than the specified 1000\n",
"Created a chunk of size 1083, which is longer than the specified 1000\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"3477\n"
]
}
],
"source": [
"from langchain.text_splitter import CharacterTextSplitter\n",
"\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"texts = text_splitter.split_documents(docs)\n",
"print(f\"{len(texts)}\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Then embed chunks and upload them to the DeepLake.\n",
"\n",
"This can take several minutes. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, model='text-embedding-ada-002', document_model_name='text-embedding-ada-002', query_model_name='text-embedding-ada-002', embedding_ctx_length=8191, openai_api_key=None, openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=6)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"\n",
"embeddings = OpenAIEmbeddings()\n",
"embeddings"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.vectorstores import DeepLake\n",
"\n",
"db = DeepLake.from_documents(\n",
" texts, embeddings, dataset_path=f\"hub://{DEEPLAKE_ACCOUNT_NAME}/langchain-code\"\n",
")\n",
"db"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question Answering\n",
"First load the dataset, construct the retriever, then construct the Conversational Chain"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"-"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/user_name/langchain-code\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"hub://user_name/langchain-code loaded successfully.\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Deep Lake Dataset in hub://user_name/langchain-code already exists, loading from the storage\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset(path='hub://user_name/langchain-code', read_only=True, tensors=['embedding', 'ids', 'metadata', 'text'])\n",
"\n",
" tensor htype shape dtype compression\n",
" ------- ------- ------- ------- ------- \n",
" embedding generic (3477, 1536) float32 None \n",
" ids text (3477, 1) str None \n",
" metadata json (3477, 1) str None \n",
" text text (3477, 1) str None \n"
]
}
],
"source": [
"db = DeepLake(\n",
" dataset_path=f\"hub://{DEEPLAKE_ACCOUNT_NAME}/langchain-code\",\n",
" read_only=True,\n",
" embedding_function=embeddings,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"retriever = db.as_retriever()\n",
"retriever.search_kwargs[\"distance_metric\"] = \"cos\"\n",
"retriever.search_kwargs[\"fetch_k\"] = 20\n",
"retriever.search_kwargs[\"maximal_marginal_relevance\"] = True\n",
"retriever.search_kwargs[\"k\"] = 20"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also specify user defined functions using [Deep Lake filters](https://docs.deeplake.ai/en/latest/deeplake.core.dataset.html#deeplake.core.dataset.Dataset.filter)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def filter(x):\n",
" # filter based on source code\n",
" if \"something\" in x[\"text\"].data()[\"value\"]:\n",
" return False\n",
"\n",
" # filter based on path e.g. extension\n",
" metadata = x[\"metadata\"].data()[\"value\"]\n",
" return \"only_this\" in metadata[\"source\"] or \"also_that\" in metadata[\"source\"]\n",
"\n",
"\n",
"### turn on below for custom filtering\n",
"# retriever.search_kwargs['filter'] = filter"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.chains import ConversationalRetrievalChain\n",
"\n",
"model = ChatOpenAI(model_name=\"gpt-3.5-turbo\") # 'ada' 'gpt-3.5-turbo' 'gpt-4',\n",
"qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"questions = [\n",
" \"What is the class hierarchy?\",\n",
" # \"What classes are derived from the Chain class?\",\n",
" # \"What classes and functions in the ./langchain/utilities/ forlder are not covered by unit tests?\",\n",
" # \"What one improvement do you propose in code in relation to the class herarchy for the Chain class?\",\n",
"]\n",
"chat_history = []\n",
"\n",
"for question in questions:\n",
" result = qa({\"question\": question, \"chat_history\": chat_history})\n",
" chat_history.append((question, result[\"answer\"]))\n",
" print(f\"-> **Question**: {question} \\n\")\n",
" print(f\"**Answer**: {result['answer']} \\n\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"-> **Question**: What is the class hierarchy? \n",
"\n",
"**Answer**: There are several class hierarchies in the provided code, so I'll list a few:\n",
"\n",
"1. `BaseModel` -> `ConstitutionalPrinciple`: `ConstitutionalPrinciple` is a subclass of `BaseModel`.\n",
"2. `BasePromptTemplate` -> `StringPromptTemplate`, `AIMessagePromptTemplate`, `BaseChatPromptTemplate`, `ChatMessagePromptTemplate`, `ChatPromptTemplate`, `HumanMessagePromptTemplate`, `MessagesPlaceholder`, `SystemMessagePromptTemplate`, `FewShotPromptTemplate`, `FewShotPromptWithTemplates`, `Prompt`, `PromptTemplate`: All of these classes are subclasses of `BasePromptTemplate`.\n",
"3. `APIChain`, `Chain`, `MapReduceDocumentsChain`, `MapRerankDocumentsChain`, `RefineDocumentsChain`, `StuffDocumentsChain`, `HypotheticalDocumentEmbedder`, `LLMChain`, `LLMBashChain`, `LLMCheckerChain`, `LLMMathChain`, `LLMRequestsChain`, `PALChain`, `QAWithSourcesChain`, `VectorDBQAWithSourcesChain`, `VectorDBQA`, `SQLDatabaseChain`: All of these classes are subclasses of `Chain`.\n",
"4. `BaseLoader`: `BaseLoader` is a subclass of `ABC`.\n",
"5. `BaseTracer` -> `ChainRun`, `LLMRun`, `SharedTracer`, `ToolRun`, `Tracer`, `TracerException`, `TracerSession`: All of these classes are subclasses of `BaseTracer`.\n",
"6. `OpenAIEmbeddings`, `HuggingFaceEmbeddings`, `CohereEmbeddings`, `JinaEmbeddings`, `LlamaCppEmbeddings`, `HuggingFaceHubEmbeddings`, `TensorflowHubEmbeddings`, `SagemakerEndpointEmbeddings`, `HuggingFaceInstructEmbeddings`, `SelfHostedEmbeddings`, `SelfHostedHuggingFaceEmbeddings`, `SelfHostedHuggingFaceInstructEmbeddings`, `FakeEmbeddings`, `AlephAlphaAsymmetricSemanticEmbedding`, `AlephAlphaSymmetricSemanticEmbedding`: All of these classes are subclasses of `BaseLLM`. \n",
"\n",
"\n",
"-> **Question**: What classes are derived from the Chain class? \n",
"\n",
"**Answer**: There are multiple classes that are derived from the Chain class. Some of them are:\n",
"- APIChain\n",
"- AnalyzeDocumentChain\n",
"- ChatVectorDBChain\n",
"- CombineDocumentsChain\n",
"- ConstitutionalChain\n",
"- ConversationChain\n",
"- GraphQAChain\n",
"- HypotheticalDocumentEmbedder\n",
"- LLMChain\n",
"- LLMCheckerChain\n",
"- LLMRequestsChain\n",
"- LLMSummarizationCheckerChain\n",
"- MapReduceChain\n",
"- OpenAPIEndpointChain\n",
"- PALChain\n",
"- QAWithSourcesChain\n",
"- RetrievalQA\n",
"- RetrievalQAWithSourcesChain\n",
"- SequentialChain\n",
"- SQLDatabaseChain\n",
"- TransformChain\n",
"- VectorDBQA\n",
"- VectorDBQAWithSourcesChain\n",
"\n",
"There might be more classes that are derived from the Chain class as it is possible to create custom classes that extend the Chain class.\n",
"\n",
"\n",
"-> **Question**: What classes and functions in the ./langchain/utilities/ forlder are not covered by unit tests? \n",
"\n",
"**Answer**: All classes and functions in the `./langchain/utilities/` folder seem to have unit tests written for them. \n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,26 @@
# Code Understanding
Overview
LangChain is a useful tool designed to parse GitHub code repositories. By leveraging VectorStores, Conversational RetrieverChain, and GPT-4, it can answer questions in the context of an entire GitHub repository or generate new code. This documentation page outlines the essential components of the system and guides using LangChain for better code comprehension, contextual question answering, and code generation in GitHub repositories.
## Conversational Retriever Chain
Conversational RetrieverChain is a retrieval-focused system that interacts with the data stored in a VectorStore. Utilizing advanced techniques, like context-aware filtering and ranking, it retrieves the most relevant code snippets and information for a given user query. Conversational RetrieverChain is engineered to deliver high-quality, pertinent results while considering conversation history and context.
LangChain Workflow for Code Understanding and Generation
1. Index the code base: Clone the target repository, load all files within, chunk the files, and execute the indexing process. Optionally, you can skip this step and use an already indexed dataset.
2. Embedding and Code Store: Code snippets are embedded using a code-aware embedding model and stored in a VectorStore.
Query Understanding: GPT-4 processes user queries, grasping the context and extracting relevant details.
3. Construct the Retriever: Conversational RetrieverChain searches the VectorStore to identify the most relevant code snippets for a given query.
4. Build the Conversational Chain: Customize the retriever settings and define any user-defined filters as needed.
5. Ask questions: Define a list of questions to ask about the codebase, and then use the ConversationalRetrievalChain to generate context-aware answers. The LLM (GPT-4) generates comprehensive, context-aware answers based on retrieved code snippets and conversation history.
The full tutorial is available below.
- [Twitter the-algorithm codebase analysis with Deep Lake](code/twitter-the-algorithm-analysis-deeplake.html): A notebook walking through how to parse github source code and run queries conversation.
- [LangChain codebase analysis with Deep Lake](code/code-analysis-deeplake.html): A notebook walking through how to analyze and do question answering over THIS code base.

View File

@@ -0,0 +1,420 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Analysis of Twitter the-algorithm source code with LangChain, GPT4 and Deep Lake\n",
"In this tutorial, we are going to use Langchain + Deep Lake with GPT4 to analyze the code base of the twitter algorithm. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python3 -m pip install --upgrade langchain deeplake openai tiktoken"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Define OpenAI embeddings, Deep Lake multi-modal vector store api and authenticate. For full documentation of Deep Lake please follow [docs](https://docs.activeloop.ai/) and [API reference](https://docs.deeplake.ai/en/latest/).\n",
"\n",
"Authenticate into Deep Lake if you want to create your own dataset and publish it. You can get an API key from the [platform](https://app.activeloop.ai)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import getpass\n",
"\n",
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.vectorstores import DeepLake\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")\n",
"os.environ[\"ACTIVELOOP_TOKEN\"] = getpass.getpass(\"Activeloop Token:\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"embeddings = OpenAIEmbeddings(disallowed_special=())"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"disallowed_special=() is required to avoid `Exception: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte` from tiktoken for some repositories"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Index the code base (optional)\n",
"You can directly skip this part and directly jump into using already indexed dataset. To begin with, first we will clone the repository, then parse and chunk the code base and use OpenAI indexing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!git clone https://github.com/twitter/the-algorithm # replace any repository of your choice"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load all files inside the repository"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from langchain.document_loaders import TextLoader\n",
"\n",
"root_dir = \"./the-algorithm\"\n",
"docs = []\n",
"for dirpath, dirnames, filenames in os.walk(root_dir):\n",
" for file in filenames:\n",
" try:\n",
" loader = TextLoader(os.path.join(dirpath, file), encoding=\"utf-8\")\n",
" docs.extend(loader.load_and_split())\n",
" except Exception as e:\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, chunk the files"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import CharacterTextSplitter\n",
"\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"texts = text_splitter.split_documents(docs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Execute the indexing. This will take about ~4 mins to compute embeddings and upload to Activeloop. You can then publish the dataset to be public."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"username = \"davitbun\" # replace with your username from app.activeloop.ai\n",
"db = DeepLake(\n",
" dataset_path=f\"hub://{username}/twitter-algorithm\",\n",
" embedding_function=embeddings,\n",
" public=True,\n",
") # dataset would be publicly available\n",
"db.add_documents(texts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Question Answering on Twitter algorithm codebase\n",
"First load the dataset, construct the retriever, then construct the Conversational Chain"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"db = DeepLake(\n",
" dataset_path=\"hub://davitbun/twitter-algorithm\",\n",
" read_only=True,\n",
" embedding_function=embeddings,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"retriever = db.as_retriever()\n",
"retriever.search_kwargs[\"distance_metric\"] = \"cos\"\n",
"retriever.search_kwargs[\"fetch_k\"] = 100\n",
"retriever.search_kwargs[\"maximal_marginal_relevance\"] = True\n",
"retriever.search_kwargs[\"k\"] = 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also specify user defined functions using [Deep Lake filters](https://docs.deeplake.ai/en/latest/deeplake.core.dataset.html#deeplake.core.dataset.Dataset.filter)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def filter(x):\n",
" # filter based on source code\n",
" if \"com.google\" in x[\"text\"].data()[\"value\"]:\n",
" return False\n",
"\n",
" # filter based on path e.g. extension\n",
" metadata = x[\"metadata\"].data()[\"value\"]\n",
" return \"scala\" in metadata[\"source\"] or \"py\" in metadata[\"source\"]\n",
"\n",
"\n",
"### turn on below for custom filtering\n",
"# retriever.search_kwargs['filter'] = filter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.chains import ConversationalRetrievalChain\n",
"\n",
"model = ChatOpenAI(model_name=\"gpt-3.5-turbo\") # switch to 'gpt-4'\n",
"qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"questions = [\n",
" \"What does favCountParams do?\",\n",
" \"is it Likes + Bookmarks, or not clear from the code?\",\n",
" \"What are the major negative modifiers that lower your linear ranking parameters?\",\n",
" \"How do you get assigned to SimClusters?\",\n",
" \"What is needed to migrate from one SimClusters to another SimClusters?\",\n",
" \"How much do I get boosted within my cluster?\",\n",
" \"How does Heavy ranker work. what are its main inputs?\",\n",
" \"How can one influence Heavy ranker?\",\n",
" \"why threads and long tweets do so well on the platform?\",\n",
" \"Are thread and long tweet creators building a following that reacts to only threads?\",\n",
" \"Do you need to follow different strategies to get most followers vs to get most likes and bookmarks per tweet?\",\n",
" \"Content meta data and how it impacts virality (e.g. ALT in images).\",\n",
" \"What are some unexpected fingerprints for spam factors?\",\n",
" \"Is there any difference between company verified checkmarks and blue verified individual checkmarks?\",\n",
"]\n",
"chat_history = []\n",
"\n",
"for question in questions:\n",
" result = qa({\"question\": question, \"chat_history\": chat_history})\n",
" chat_history.append((question, result[\"answer\"]))\n",
" print(f\"-> **Question**: {question} \\n\")\n",
" print(f\"**Answer**: {result['answer']} \\n\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"-> **Question**: What does favCountParams do? \n",
"\n",
"**Answer**: `favCountParams` is an optional ThriftLinearFeatureRankingParams instance that represents the parameters related to the \"favorite count\" feature in the ranking process. It is used to control the weight of the favorite count feature while ranking tweets. The favorite count is the number of times a tweet has been marked as a favorite by users, and it is considered an important signal in the ranking of tweets. By using `favCountParams`, the system can adjust the importance of the favorite count while calculating the final ranking score of a tweet. \n",
"\n",
"-> **Question**: is it Likes + Bookmarks, or not clear from the code?\n",
"\n",
"**Answer**: From the provided code, it is not clear if the favorite count metric is determined by the sum of likes and bookmarks. The favorite count is mentioned in the code, but there is no explicit reference to how it is calculated in terms of likes and bookmarks. \n",
"\n",
"-> **Question**: What are the major negative modifiers that lower your linear ranking parameters?\n",
"\n",
"**Answer**: In the given code, major negative modifiers that lower the linear ranking parameters are:\n",
"\n",
"1. `scoringData.querySpecificScore`: This score adjustment is based on the query-specific information. If its value is negative, it will lower the linear ranking parameters.\n",
"\n",
"2. `scoringData.authorSpecificScore`: This score adjustment is based on the author-specific information. If its value is negative, it will also lower the linear ranking parameters.\n",
"\n",
"Please note that I cannot provide more information on the exact calculations of these negative modifiers, as the code for their determination is not provided. \n",
"\n",
"-> **Question**: How do you get assigned to SimClusters?\n",
"\n",
"**Answer**: The assignment to SimClusters occurs through a Metropolis-Hastings sampling-based community detection algorithm that is run on the Producer-Producer similarity graph. This graph is created by computing the cosine similarity scores between the users who follow each producer. The algorithm identifies communities or clusters of Producers with similar followers, and takes a parameter *k* for specifying the number of communities to be detected.\n",
"\n",
"After the community detection, different users and content are represented as sparse, interpretable vectors within these identified communities (SimClusters). The resulting SimClusters embeddings can be used for various recommendation tasks. \n",
"\n",
"-> **Question**: What is needed to migrate from one SimClusters to another SimClusters?\n",
"\n",
"**Answer**: To migrate from one SimClusters representation to another, you can follow these general steps:\n",
"\n",
"1. **Prepare the new representation**: Create the new SimClusters representation using any necessary updates or changes in the clustering algorithm, similarity measures, or other model parameters. Ensure that this new representation is properly stored and indexed as needed.\n",
"\n",
"2. **Update the relevant code and configurations**: Modify the relevant code and configuration files to reference the new SimClusters representation. This may involve updating paths or dataset names to point to the new representation, as well as changing code to use the new clustering method or similarity functions if applicable.\n",
"\n",
"3. **Test the new representation**: Before deploying the changes to production, thoroughly test the new SimClusters representation to ensure its effectiveness and stability. This may involve running offline jobs like candidate generation and label candidates, validating the output, as well as testing the new representation in the evaluation environment using evaluation tools like TweetSimilarityEvaluationAdhocApp.\n",
"\n",
"4. **Deploy the changes**: Once the new representation has been tested and validated, deploy the changes to production. This may involve creating a zip file, uploading it to the packer, and then scheduling it with Aurora. Be sure to monitor the system to ensure a smooth transition between representations and verify that the new representation is being used in recommendations as expected.\n",
"\n",
"5. **Monitor and assess the new representation**: After the new representation has been deployed, continue to monitor its performance and impact on recommendations. Take note of any improvements or issues that arise and be prepared to iterate on the new representation if needed. Always ensure that the results and performance metrics align with the system's goals and objectives. \n",
"\n",
"-> **Question**: How much do I get boosted within my cluster?\n",
"\n",
"**Answer**: It's not possible to determine the exact amount your content is boosted within your cluster in the SimClusters representation without specific data about your content and its engagement metrics. However, a combination of factors, such as the favorite score and follow score, alongside other engagement signals and SimCluster calculations, influence the boosting of content. \n",
"\n",
"-> **Question**: How does Heavy ranker work. what are its main inputs?\n",
"\n",
"**Answer**: The Heavy Ranker is a machine learning model that plays a crucial role in ranking and scoring candidates within the recommendation algorithm. Its primary purpose is to predict the likelihood of a user engaging with a tweet or connecting with another user on the platform.\n",
"\n",
"Main inputs to the Heavy Ranker consist of:\n",
"\n",
"1. Static Features: These are features that can be computed directly from a tweet at the time it's created, such as whether it has a URL, has cards, has quotes, etc. These features are produced by the Index Ingester as the tweets are generated and stored in the index.\n",
"\n",
"2. Real-time Features: These per-tweet features can change after the tweet has been indexed. They mostly consist of social engagements like retweet count, favorite count, reply count, and some spam signals that are computed with later activities. The Signal Ingester, which is part of a Heron topology, processes multiple event streams to collect and compute these real-time features.\n",
"\n",
"3. User Table Features: These per-user features are obtained from the User Table Updater that processes a stream written by the user service. This input is used to store sparse real-time user information, which is later propagated to the tweet being scored by looking up the author of the tweet.\n",
"\n",
"4. Search Context Features: These features represent the context of the current searcher, like their UI language, their content consumption, and the current time (implied). They are combined with Tweet Data to compute some of the features used in scoring.\n",
"\n",
"These inputs are then processed by the Heavy Ranker to score and rank candidates based on their relevance and likelihood of engagement by the user. \n",
"\n",
"-> **Question**: How can one influence Heavy ranker?\n",
"\n",
"**Answer**: To influence the Heavy Ranker's output or ranking of content, consider the following actions:\n",
"\n",
"1. Improve content quality: Create high-quality and engaging content that is relevant, informative, and valuable to users. High-quality content is more likely to receive positive user engagement, which the Heavy Ranker considers when ranking content.\n",
"\n",
"2. Increase user engagement: Encourage users to interact with content through likes, retweets, replies, and comments. Higher engagement levels can lead to better ranking in the Heavy Ranker's output.\n",
"\n",
"3. Optimize your user profile: A user's reputation, based on factors such as their follower count and follower-to-following ratio, may impact the ranking of their content. Maintain a good reputation by following relevant users, keeping a reasonable follower-to-following ratio and engaging with your followers.\n",
"\n",
"4. Enhance content discoverability: Use relevant keywords, hashtags, and mentions in your tweets, making it easier for users to find and engage with your content. This increased discoverability may help improve the ranking of your content by the Heavy Ranker.\n",
"\n",
"5. Leverage multimedia content: Experiment with different content formats, such as videos, images, and GIFs, which may capture users' attention and increase engagement, resulting in better ranking by the Heavy Ranker.\n",
"\n",
"6. User feedback: Monitor and respond to feedback for your content. Positive feedback may improve your ranking, while negative feedback provides an opportunity to learn and improve.\n",
"\n",
"Note that the Heavy Ranker uses a combination of machine learning models and various features to rank the content. While the above actions may help influence the ranking, there are no guarantees as the ranking process is determined by a complex algorithm, which evolves over time. \n",
"\n",
"-> **Question**: why threads and long tweets do so well on the platform?\n",
"\n",
"**Answer**: Threads and long tweets perform well on the platform for several reasons:\n",
"\n",
"1. **More content and context**: Threads and long tweets provide more information and context about a topic, which can make the content more engaging and informative for users. People tend to appreciate a well-structured and detailed explanation of a subject or a story, and threads and long tweets can do that effectively.\n",
"\n",
"2. **Increased user engagement**: As threads and long tweets provide more content, they also encourage users to engage with the tweets through replies, retweets, and likes. This increased engagement can lead to better visibility of the content, as the Twitter algorithm considers user engagement when ranking and surfacing tweets.\n",
"\n",
"3. **Narrative structure**: Threads enable users to tell stories or present arguments in a step-by-step manner, making the information more accessible and easier to follow. This narrative structure can capture users' attention and encourage them to read through the entire thread and interact with the content.\n",
"\n",
"4. **Expanded reach**: When users engage with a thread, their interactions can bring the content to the attention of their followers, helping to expand the reach of the thread. This increased visibility can lead to more interactions and higher performance for the threaded tweets.\n",
"\n",
"5. **Higher content quality**: Generally, threads and long tweets require more thought and effort to create, which may lead to higher quality content. Users are more likely to appreciate and interact with high-quality, well-reasoned content, further improving the performance of these tweets within the platform.\n",
"\n",
"Overall, threads and long tweets perform well on Twitter because they encourage user engagement and provide a richer, more informative experience that users find valuable. \n",
"\n",
"-> **Question**: Are thread and long tweet creators building a following that reacts to only threads?\n",
"\n",
"**Answer**: Based on the provided code and context, there isn't enough information to conclude if the creators of threads and long tweets primarily build a following that engages with only thread-based content. The code provided is focused on Twitter's recommendation and ranking algorithms, as well as infrastructure components like Kafka, partitions, and the Follow Recommendations Service (FRS). To answer your question, data analysis of user engagement and results of specific edge cases would be required. \n",
"\n",
"-> **Question**: Do you need to follow different strategies to get most followers vs to get most likes and bookmarks per tweet?\n",
"\n",
"**Answer**: Yes, different strategies need to be followed to maximize the number of followers compared to maximizing likes and bookmarks per tweet. While there may be some overlap in the approaches, they target different aspects of user engagement.\n",
"\n",
"Maximizing followers: The primary focus is on growing your audience on the platform. Strategies include:\n",
"\n",
"1. Consistently sharing high-quality content related to your niche or industry.\n",
"2. Engaging with others on the platform by replying, retweeting, and mentioning other users.\n",
"3. Using relevant hashtags and participating in trending conversations.\n",
"4. Collaborating with influencers and other users with a large following.\n",
"5. Posting at optimal times when your target audience is most active.\n",
"6. Optimizing your profile by using a clear profile picture, catchy bio, and relevant links.\n",
"\n",
"Maximizing likes and bookmarks per tweet: The focus is on creating content that resonates with your existing audience and encourages engagement. Strategies include:\n",
"\n",
"1. Crafting engaging and well-written tweets that encourage users to like or save them.\n",
"2. Incorporating visually appealing elements, such as images, GIFs, or videos, that capture attention.\n",
"3. Asking questions, sharing opinions, or sparking conversations that encourage users to engage with your tweets.\n",
"4. Using analytics to understand the type of content that resonates with your audience and tailoring your tweets accordingly.\n",
"5. Posting a mix of educational, entertaining, and promotional content to maintain variety and interest.\n",
"6. Timing your tweets strategically to maximize engagement, likes, and bookmarks per tweet.\n",
"\n",
"Both strategies can overlap, and you may need to adapt your approach by understanding your target audience's preferences and analyzing your account's performance. However, it's essential to recognize that maximizing followers and maximizing likes and bookmarks per tweet have different focuses and require specific strategies. \n",
"\n",
"-> **Question**: Content meta data and how it impacts virality (e.g. ALT in images).\n",
"\n",
"**Answer**: There is no direct information in the provided context about how content metadata, such as ALT text in images, impacts the virality of a tweet or post. However, it's worth noting that including ALT text can improve the accessibility of your content for users who rely on screen readers, which may lead to increased engagement for a broader audience. Additionally, metadata can be used in search engine optimization, which might improve the visibility of the content, but the context provided does not mention any specific correlation with virality. \n",
"\n",
"-> **Question**: What are some unexpected fingerprints for spam factors?\n",
"\n",
"**Answer**: In the provided context, an unusual indicator of spam factors is when a tweet contains a non-media, non-news link. If the tweet has a link but does not have an image URL, video URL, or news URL, it is considered a potential spam vector, and a threshold for user reputation (tweepCredThreshold) is set to MIN_TWEEPCRED_WITH_LINK.\n",
"\n",
"While this rule may not cover all possible unusual spam indicators, it is derived from the specific codebase and logic shared in the context. \n",
"\n",
"-> **Question**: Is there any difference between company verified checkmarks and blue verified individual checkmarks?\n",
"\n",
"**Answer**: Yes, there is a distinction between the verified checkmarks for companies and blue verified checkmarks for individuals. The code snippet provided mentions \"Blue-verified account boost\" which indicates that there is a separate category for blue verified accounts. Typically, blue verified checkmarks are used to indicate notable individuals, while verified checkmarks are for companies or organizations. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}