Harrison/docs cleanup (#2633)

This commit is contained in:
Harrison Chase
2023-04-09 12:55:22 -07:00
committed by GitHub
parent e57f0e38c1
commit 7aba18ea77
6 changed files with 416 additions and 310 deletions

View File

@@ -0,0 +1,409 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "c7ad998d",
"metadata": {},
"source": [
"# Natural Language APIs\n",
"\n",
"Natural Language API Toolkits (NLAToolkits) permit LangChain Agents to efficiently plan and combine calls across endpoints. This notebook demonstrates a sample composition of the Speak, Klarna, and Spoonacluar APIs.\n",
"\n",
"For a detailed walkthrough of the OpenAPI chains wrapped within the NLAToolkit, see the [OpenAPI Operation Chain](openapi.ipynb) notebook.\n",
"\n",
"### First, import dependencies and load the LLM"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "6593f793",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from typing import List, Optional\n",
"from langchain.chains import LLMChain\n",
"from langchain.llms import OpenAI\n",
"from langchain.prompts import PromptTemplate\n",
"from langchain.requests import Requests\n",
"from langchain.tools import APIOperation, OpenAPISpec\n",
"from langchain.agents import AgentType, Tool, initialize_agent\n",
"from langchain.agents.agent_toolkits import NLAToolkit"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "dd720860",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Select the LLM to use. Here, we use text-davinci-003\n",
"llm = OpenAI(temperature=0, max_tokens=700) # You can swap between different core LLM's here."
]
},
{
"cell_type": "markdown",
"id": "4cadac9d",
"metadata": {
"tags": []
},
"source": [
"### Next, load the Natural Language API Toolkits"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6b208ab0",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Attempting to load an OpenAPI 3.0.1 spec. This may result in degraded performance. Convert your OpenAPI spec to 3.1.* spec for better support.\n",
"Attempting to load an OpenAPI 3.0.1 spec. This may result in degraded performance. Convert your OpenAPI spec to 3.1.* spec for better support.\n",
"Attempting to load an OpenAPI 3.0.1 spec. This may result in degraded performance. Convert your OpenAPI spec to 3.1.* spec for better support.\n"
]
}
],
"source": [
"speak_toolkit = NLAToolkit.from_llm_and_url(llm, \"https://api.speak.com/openapi.yaml\")\n",
"klarna_toolkit = NLAToolkit.from_llm_and_url(llm, \"https://www.klarna.com/us/shopping/public/openai/v0/api-docs/\")"
]
},
{
"cell_type": "markdown",
"id": "16c7336f",
"metadata": {},
"source": [
"### Create the Agent"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "730a0dc2-b4d0-46d5-a1e9-583803220973",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Slightly tweak the instructions from the default agent\n",
"openapi_format_instructions = \"\"\"Use the following format:\n",
"\n",
"Question: the input question you must answer\n",
"Thought: you should always think about what to do\n",
"Action: the action to take, should be one of [{tool_names}]\n",
"Action Input: what to instruct the AI Action representative.\n",
"Observation: The Agent's response\n",
"... (this Thought/Action/Action Input/Observation can repeat N times)\n",
"Thought: I now know the final answer. User can't see any of my observations, API responses, links, or tools.\n",
"Final Answer: the final answer to the original input question with the right amount of detail\n",
"\n",
"When responding with your Final Answer, remember that the person you are responding to CANNOT see any of your Thought/Action/Action Input/Observations, so if there is any relevant information there you need to include it explicitly in your response.\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "40a979c3",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"natural_language_tools = speak_toolkit.get_tools() + klarna_toolkit.get_tools()\n",
"mrkl = initialize_agent(natural_language_tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, \n",
" verbose=True, agent_kwargs={\"format_instructions\":openapi_format_instructions})"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "794380ba",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
"\u001b[32;1m\u001b[1;3m I need to find out what kind of Italian clothes are available\n",
"Action: Open_AI_Klarna_product_Api.productsUsingGET\n",
"Action Input: Italian clothes\u001b[0m\n",
"Observation: \u001b[31;1m\u001b[1;3mThe API response contains two products from the Alé brand in Italian Blue. The first is the Alé Colour Block Short Sleeve Jersey Men - Italian Blue, which costs $86.49, and the second is the Alé Dolid Flash Jersey Men - Italian Blue, which costs $40.00.\u001b[0m\n",
"Thought:\u001b[32;1m\u001b[1;3m I now know what kind of Italian clothes are available and how much they cost.\n",
"Final Answer: You can buy two products from the Alé brand in Italian Blue for your end of year party. The Alé Colour Block Short Sleeve Jersey Men - Italian Blue costs $86.49, and the Alé Dolid Flash Jersey Men - Italian Blue costs $40.00.\u001b[0m\n",
"\n",
"\u001b[1m> Finished chain.\u001b[0m\n"
]
},
{
"data": {
"text/plain": [
"'You can buy two products from the Alé brand in Italian Blue for your end of year party. The Alé Colour Block Short Sleeve Jersey Men - Italian Blue costs $86.49, and the Alé Dolid Flash Jersey Men - Italian Blue costs $40.00.'"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mrkl.run(\"I have an end of year party for my Italian class and have to buy some Italian clothes for it\")"
]
},
{
"cell_type": "markdown",
"id": "c61d92a8",
"metadata": {},
"source": [
"### Using Auth + Adding more Endpoints\n",
"\n",
"Some endpoints may require user authentication via things like access tokens. Here we show how to pass in the authentication information via the `Requests` wrapper object.\n",
"\n",
"Since each NLATool exposes a concisee natural language interface to its wrapped API, the top level conversational agent has an easier job incorporating each endpoint to satisfy a user's request."
]
},
{
"cell_type": "markdown",
"id": "f0d132cc",
"metadata": {},
"source": [
"**Adding the Spoonacular endpoints.**\n",
"\n",
"1. Go to the [Spoonacular API Console](https://spoonacular.com/food-api/console#Profile) and make a free account.\n",
"2. Click on `Profile` and copy your API key below."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "c2368b9c",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"spoonacular_api_key = \"\" # Copy from the API Console"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "fbd97c28-fef6-41b5-9600-a9611a32bfb3",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Attempting to load an OpenAPI 3.0.0 spec. This may result in degraded performance. Convert your OpenAPI spec to 3.1.* spec for better support.\n",
"Unsupported APIPropertyLocation \"header\" for parameter Content-Type. Valid values are ['path', 'query'] Ignoring optional parameter\n",
"Unsupported APIPropertyLocation \"header\" for parameter Accept. Valid values are ['path', 'query'] Ignoring optional parameter\n",
"Unsupported APIPropertyLocation \"header\" for parameter Content-Type. Valid values are ['path', 'query'] Ignoring optional parameter\n",
"Unsupported APIPropertyLocation \"header\" for parameter Accept. Valid values are ['path', 'query'] Ignoring optional parameter\n",
"Unsupported APIPropertyLocation \"header\" for parameter Content-Type. Valid values are ['path', 'query'] Ignoring optional parameter\n",
"Unsupported APIPropertyLocation \"header\" for parameter Accept. Valid values are ['path', 'query'] Ignoring optional parameter\n",
"Unsupported APIPropertyLocation \"header\" for parameter Content-Type. Valid values are ['path', 'query'] Ignoring optional parameter\n",
"Unsupported APIPropertyLocation \"header\" for parameter Accept. Valid values are ['path', 'query'] Ignoring optional parameter\n",
"Unsupported APIPropertyLocation \"header\" for parameter Content-Type. Valid values are ['path', 'query'] Ignoring optional parameter\n",
"Unsupported APIPropertyLocation \"header\" for parameter Content-Type. Valid values are ['path', 'query'] Ignoring optional parameter\n",
"Unsupported APIPropertyLocation \"header\" for parameter Content-Type. Valid values are ['path', 'query'] Ignoring optional parameter\n",
"Unsupported APIPropertyLocation \"header\" for parameter Content-Type. Valid values are ['path', 'query'] Ignoring optional parameter\n",
"Unsupported APIPropertyLocation \"header\" for parameter Accept. Valid values are ['path', 'query'] Ignoring optional parameter\n",
"Unsupported APIPropertyLocation \"header\" for parameter Content-Type. Valid values are ['path', 'query'] Ignoring optional parameter\n",
"Unsupported APIPropertyLocation \"header\" for parameter Accept. Valid values are ['path', 'query'] Ignoring optional parameter\n",
"Unsupported APIPropertyLocation \"header\" for parameter Accept. Valid values are ['path', 'query'] Ignoring optional parameter\n",
"Unsupported APIPropertyLocation \"header\" for parameter Accept. Valid values are ['path', 'query'] Ignoring optional parameter\n",
"Unsupported APIPropertyLocation \"header\" for parameter Content-Type. Valid values are ['path', 'query'] Ignoring optional parameter\n"
]
}
],
"source": [
"requests = Requests(headers={\"x-api-key\": spoonacular_api_key})\n",
"spoonacular_toolkit = NLAToolkit.from_llm_and_url(\n",
" llm, \n",
" \"https://spoonacular.com/application/frontend/downloads/spoonacular-openapi-3.json\",\n",
" requests=requests,\n",
" max_text_length=1800, # If you want to truncate the response text\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "81a6edac",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"34 tools loaded.\n"
]
}
],
"source": [
"natural_language_api_tools = (speak_toolkit.get_tools() \n",
" + klarna_toolkit.get_tools() \n",
" + spoonacular_toolkit.get_tools()[:30]\n",
" )\n",
"print(f\"{len(natural_language_api_tools)} tools loaded.\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "831f772d-5cd1-4467-b494-a3172af2ff48",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Create an agent with the new tools\n",
"mrkl = initialize_agent(natural_language_api_tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, \n",
" verbose=True, agent_kwargs={\"format_instructions\":openapi_format_instructions})"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "0385e04b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Make the query more complex!\n",
"user_input = (\n",
" \"I'm learning Italian, and my language class is having an end of year party... \"\n",
" \" Could you help me find an Italian outfit to wear and\"\n",
" \" an appropriate recipe to prepare so I can present for the class in Italian?\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "6ebd3f55",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
"\u001b[32;1m\u001b[1;3m I need to find a recipe and an outfit that is Italian-themed.\n",
"Action: spoonacular_API.searchRecipes\n",
"Action Input: Italian\u001b[0m\n",
"Observation: \u001b[36;1m\u001b[1;3mThe API response contains 10 Italian recipes, including Turkey Tomato Cheese Pizza, Broccolini Quinoa Pilaf, Bruschetta Style Pork & Pasta, Salmon Quinoa Risotto, Italian Tuna Pasta, Roasted Brussels Sprouts With Garlic, Asparagus Lemon Risotto, Italian Steamed Artichokes, Crispy Italian Cauliflower Poppers Appetizer, and Pappa Al Pomodoro.\u001b[0m\n",
"Thought:\u001b[32;1m\u001b[1;3m I need to find an Italian-themed outfit.\n",
"Action: Open_AI_Klarna_product_Api.productsUsingGET\n",
"Action Input: Italian\u001b[0m\n",
"Observation: \u001b[31;1m\u001b[1;3mI found 10 products related to 'Italian' in the API response. These products include Italian Gold Sparkle Perfectina Necklace - Gold, Italian Design Miami Cuban Link Chain Necklace - Gold, Italian Gold Miami Cuban Link Chain Necklace - Gold, Italian Gold Herringbone Necklace - Gold, Italian Gold Claddagh Ring - Gold, Italian Gold Herringbone Chain Necklace - Gold, Garmin QuickFit 22mm Italian Vacchetta Leather Band, Macy's Italian Horn Charm - Gold, Dolce & Gabbana Light Blue Italian Love Pour Homme EdT 1.7 fl oz.\u001b[0m\n",
"Thought:\u001b[32;1m\u001b[1;3m I now know the final answer.\n",
"Final Answer: To present for your Italian language class, you could wear an Italian Gold Sparkle Perfectina Necklace - Gold, an Italian Design Miami Cuban Link Chain Necklace - Gold, or an Italian Gold Miami Cuban Link Chain Necklace - Gold. For a recipe, you could make Turkey Tomato Cheese Pizza, Broccolini Quinoa Pilaf, Bruschetta Style Pork & Pasta, Salmon Quinoa Risotto, Italian Tuna Pasta, Roasted Brussels Sprouts With Garlic, Asparagus Lemon Risotto, Italian Steamed Artichokes, Crispy Italian Cauliflower Poppers Appetizer, or Pappa Al Pomodoro.\u001b[0m\n",
"\n",
"\u001b[1m> Finished chain.\u001b[0m\n"
]
},
{
"data": {
"text/plain": [
"'To present for your Italian language class, you could wear an Italian Gold Sparkle Perfectina Necklace - Gold, an Italian Design Miami Cuban Link Chain Necklace - Gold, or an Italian Gold Miami Cuban Link Chain Necklace - Gold. For a recipe, you could make Turkey Tomato Cheese Pizza, Broccolini Quinoa Pilaf, Bruschetta Style Pork & Pasta, Salmon Quinoa Risotto, Italian Tuna Pasta, Roasted Brussels Sprouts With Garlic, Asparagus Lemon Risotto, Italian Steamed Artichokes, Crispy Italian Cauliflower Poppers Appetizer, or Pappa Al Pomodoro.'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mrkl.run(user_input)"
]
},
{
"cell_type": "markdown",
"id": "a2959462",
"metadata": {},
"source": [
"## Thank you!"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "6fcda5f0",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"\"In Italian, you can say 'Buon appetito' to someone to wish them to enjoy their meal. This phrase is commonly used in Italy when someone is about to eat, often at the beginning of a meal. It's similar to saying 'Bon appétit' in French or 'Guten Appetit' in German.\""
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"natural_language_api_tools[1].run(\"Tell the LangChain audience to 'enjoy the meal' in Italian, please!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ab366dc0",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,292 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "c7ad998d",
"metadata": {},
"source": [
"# Multi-hop Task Execution with the NLAToolkit\n",
"\n",
"Natural Language API Toolkits (NLAToolkits) permit LangChain Agents to efficiently plan and combine calls across endpoints. This notebook demonstrates a sample composition of the Speak, Klarna, and Spoonacluar APIs.\n",
"\n",
"For a detailed walkthrough of the OpenAPI chains wrapped within the NLAToolkit, see the [OpenAPI Operation Chain](openapi.ipynb) notebook.\n",
"\n",
"### First, import dependencies and load the LLM"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6593f793",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from typing import List, Optional\n",
"from langchain.chains import LLMChain\n",
"from langchain.llms import OpenAI\n",
"from langchain.prompts import PromptTemplate\n",
"from langchain.requests import Requests\n",
"from langchain.tools import APIOperation, OpenAPISpec\n",
"from langchain.agents import AgentType, Tool, initialize_agent\n",
"from langchain.agents.agent_toolkits import NLAToolkit"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dd720860",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Select the LLM to use. Here, we use text-davinci-003\n",
"llm = OpenAI(temperature=0, max_tokens=700) # You can swap between different core LLM's here."
]
},
{
"cell_type": "markdown",
"id": "4cadac9d",
"metadata": {
"tags": []
},
"source": [
"### Next, load the Natural Language API Toolkits"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6b208ab0",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"speak_toolkit = NLAToolkit.from_llm_and_url(llm, \"https://api.speak.com/openapi.yaml\")\n",
"klarna_toolkit = NLAToolkit.from_llm_and_url(llm, \"https://www.klarna.com/us/shopping/public/openai/v0/api-docs/\")"
]
},
{
"cell_type": "markdown",
"id": "16c7336f",
"metadata": {},
"source": [
"### Create the Agent"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "730a0dc2-b4d0-46d5-a1e9-583803220973",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Slightly tweak the instructions from the default agent\n",
"openapi_format_instructions = \"\"\"Use the following format:\n",
"\n",
"Question: the input question you must answer\n",
"Thought: you should always think about what to do\n",
"Action: the action to take, should be one of [{tool_names}]\n",
"Action Input: what to instruct the AI Action representative.\n",
"Observation: The Agent's response\n",
"... (this Thought/Action/Action Input/Observation can repeat N times)\n",
"Thought: I now know the final answer. User can't see any of my observations, API responses, links, or tools.\n",
"Final Answer: the final answer to the original input question with the right amount of detail\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "40a979c3",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"natural_language_tools = speak_toolkit.get_tools() + klarna_toolkit.get_tools()\n",
"mrkl = initialize_agent(natural_language_tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, \n",
" verbose=True, agent_kwargs={\"format_instructions\":openapi_format_instructions})"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "794380ba",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"mrkl.run(\"I have an end of year party for my Italian class and have to buy some Italian clothes for it\")"
]
},
{
"cell_type": "markdown",
"id": "c61d92a8",
"metadata": {},
"source": [
"### Using Auth + Adding more Endpoints\n",
"\n",
"Some endpoints may require user authentication via things like access tokens. Here we show how to pass in the authentication information via the `Requests` wrapper object.\n",
"\n",
"Since each NLATool exposes a concisee natural language interface to its wrapped API, the top level conversational agent has an easier job incorporating each endpoint to satisfy a user's request."
]
},
{
"cell_type": "markdown",
"id": "f0d132cc",
"metadata": {},
"source": [
"**Adding the Spoonacular endpoints.**\n",
"\n",
"1. Go to the [Spoonacular API Console](https://spoonacular.com/food-api/console#Profile) and make a free account.\n",
"2. Click on `Profile` and copy your API key below."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c2368b9c",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"spoonacular_api_key = \"\" # Copy from the API Console"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fbd97c28-fef6-41b5-9600-a9611a32bfb3",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"requests = Requests(headers={\"x-api-key\": spoonacular_api_key})\n",
"spoonacular_toolkit = NLAToolkit.from_llm_and_url(\n",
" llm, \n",
" \"https://spoonacular.com/application/frontend/downloads/spoonacular-openapi-3.json\",\n",
" requests=requests,\n",
" max_text_length=1800, # If you want to truncate the response text\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "81a6edac",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"natural_language_api_tools = (speak_toolkit.get_tools() \n",
" + klarna_toolkit.get_tools() \n",
" + spoonacular_toolkit.get_tools()[:30]\n",
" )\n",
"print(f\"{len(natural_language_api_tools)} tools loaded.\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "831f772d-5cd1-4467-b494-a3172af2ff48",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Create an agent with the new tools\n",
"mrkl = initialize_agent(natural_language_api_tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, \n",
" verbose=True, agent_kwargs={\"format_instructions\":openapi_format_instructions})"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0385e04b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Make the query more complex!\n",
"user_input = (\n",
" \"I'm learning Italian, and my language class is having an end of year party... \"\n",
" \" Could you help me find an Italian outfit to wear and\"\n",
" \" an appropriate recipe to prepare so I can present for the class in Italian?\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6ebd3f55",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"mrkl.run(user_input)"
]
},
{
"cell_type": "markdown",
"id": "a2959462",
"metadata": {},
"source": [
"## Thank you!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6fcda5f0",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"natural_language_api_tools[1].run(\"Tell the LangChain audience to 'enjoy the meal' in Italian, please!\")['output']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ab366dc0",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,442 +0,0 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Analysis of Twitter the-algorithm source code with LangChain, GPT4 and Deep Lake\n",
"In this tutorial, we are going to use Langchain + Deep Lake with GPT4 to analyze the code base of the twitter algorithm. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python3 -m pip install --upgrade langchain deeplake openai tiktoken"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Define OpenAI embeddings, Deep Lake multi-modal vector store api and authenticate. For full documentation of Deep Lake please follow https://docs.activeloop.ai/ and API reference https://docs.deeplake.ai/en/latest/"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.vectorstores import DeepLake\n",
"\n",
"os.environ['OPENAI_API_KEY']='sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'\n",
"embeddings = OpenAIEmbeddings()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Authenticate into Deep Lake if you want to create your own dataset and publish it. You can get an API key from the platform at https://app.activeloop.ai"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!activeloop login -t <TOKEN>"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Index the code base (optional)\n",
"You can directly skip this part and directly jump into using already indexed dataset. To begin with, first we will clone the repository, then parse and chunk the code base and use OpenAI indexing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!git clone https://github.com/twitter/the-algorithm # replace any repository of your choice "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Load all files inside the repository"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from langchain.document_loaders import TextLoader\n",
"\n",
"root_dir = './the-algorithm'\n",
"docs = []\n",
"for dirpath, dirnames, filenames in os.walk(root_dir):\n",
" for file in filenames:\n",
" try: \n",
" loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')\n",
" docs.extend(loader.load_and_split())\n",
" except Exception as e: \n",
" pass"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, chunk the files"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import CharacterTextSplitter\n",
"\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"texts = text_splitter.split_documents(docs)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Execute the indexing. This will take about ~4 mins to compute embeddings and upload to Activeloop. You can then publish the dataset to be public."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"db = DeepLake.from_documents(texts, embeddings, dataset_path=\"hub://davitbun/twitter-algorithm\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Question Answering on Twitter algorithm codebase\n",
"First load the dataset, construct the retriever, then construct the Conversational Chain"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/davitbun/twitter-algorithm\n",
"\n",
"hub://davitbun/twitter-algorithm loaded successfully.\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Deep Lake Dataset in hub://davitbun/twitter-algorithm already exists, loading from the storage\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset(path='hub://davitbun/twitter-algorithm', read_only=True, tensors=['embedding', 'ids', 'metadata', 'text'])\n",
"\n",
" tensor htype shape dtype compression\n",
" ------- ------- ------- ------- ------- \n",
" embedding generic (23152, 1536) float32 None \n",
" ids text (23152, 1) str None \n",
" metadata json (23152, 1) str None \n",
" text text (23152, 1) str None \n"
]
}
],
"source": [
"db = DeepLake(dataset_path=\"hub://davitbun/twitter-algorithm\", read_only=True, embedding_function=embeddings)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"\n",
"retriever = db.as_retriever()\n",
"retriever.search_kwargs['distance_metric'] = 'cos'\n",
"retriever.search_kwargs['fetch_k'] = 100\n",
"retriever.search_kwargs['maximal_marginal_relevance'] = True\n",
"retriever.search_kwargs['k'] = 20"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also specify user defined functions using [Deep Lake filters](https://docs.deeplake.ai/en/latest/deeplake.core.dataset.html#deeplake.core.dataset.Dataset.filter)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"def filter(x):\n",
" # filter based on source code\n",
" if 'com.google' in x['text'].data()['value']:\n",
" return False\n",
" \n",
" # filter based on path e.g. extension\n",
" metadata = x['metadata'].data()['value']\n",
" return 'scala' in metadata['source'] or 'py' in metadata['source']\n",
"\n",
"### turn on below for custom filtering\n",
"# retriever.search_kwargs['filter'] = filter"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.chains import ConversationalRetrievalChain\n",
"\n",
"model = ChatOpenAI(model='gpt-4') # 'gpt-3.5-turbo',\n",
"qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"questions = [\n",
" \"What does favCountParams do?\",\n",
" \"is it Likes + Bookmarks, or not clear from the code?\",\n",
" \"What are the major negative modifiers that lower your linear ranking parameters?\", \n",
" \"How do you get assigned to SimClusters?\",\n",
" \"What is needed to migrate from one SimClusters to another SimClusters?\",\n",
" \"How much do I get boosted within my cluster?\", \n",
" \"How does Heavy ranker work. what are its main inputs?\",\n",
" \"How can one influence Heavy ranker?\",\n",
" \"why threads and long tweets do so well on the platform?\",\n",
" \"Are thread and long tweet creators building a following that reacts to only threads?\",\n",
" \"Do you need to follow different strategies to get most followers vs to get most likes and bookmarks per tweet?\",\n",
" \"Content meta data and how it impacts virality (e.g. ALT in images).\",\n",
" \"What are some unexpected fingerprints for spam factors?\",\n",
" \"Is there any difference between company verified checkmarks and blue verified individual checkmarks?\",\n",
"] \n",
"chat_history = []\n",
"\n",
"for question in questions: \n",
" result = qa({\"question\": question, \"chat_history\": chat_history})\n",
" chat_history.append((question, result['answer']))\n",
" print(f\"-> **Question**: {question} \\n\")\n",
" print(f\"**Answer**: {result['answer']} \\n\")\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"-> **Question**: is it Likes + Bookmarks, or not clear from the code?\n",
"\n",
"**Answer**: From the provided code, it is not clear if the favorite count metric is determined by the sum of likes and bookmarks. The favorite count is mentioned in the code, but there is no explicit reference to how it is calculated in terms of likes and bookmarks. \n",
"\n",
"-> **Question**: What are the major negative modifiers that lower your linear ranking parameters?\n",
"\n",
"**Answer**: In the given code, major negative modifiers that lower the linear ranking parameters are:\n",
"\n",
"1. `scoringData.querySpecificScore`: This score adjustment is based on the query-specific information. If its value is negative, it will lower the linear ranking parameters.\n",
"\n",
"2. `scoringData.authorSpecificScore`: This score adjustment is based on the author-specific information. If its value is negative, it will also lower the linear ranking parameters.\n",
"\n",
"Please note that I cannot provide more information on the exact calculations of these negative modifiers, as the code for their determination is not provided. \n",
"\n",
"-> **Question**: How do you get assigned to SimClusters?\n",
"\n",
"**Answer**: The assignment to SimClusters occurs through a Metropolis-Hastings sampling-based community detection algorithm that is run on the Producer-Producer similarity graph. This graph is created by computing the cosine similarity scores between the users who follow each producer. The algorithm identifies communities or clusters of Producers with similar followers, and takes a parameter *k* for specifying the number of communities to be detected.\n",
"\n",
"After the community detection, different users and content are represented as sparse, interpretable vectors within these identified communities (SimClusters). The resulting SimClusters embeddings can be used for various recommendation tasks. \n",
"\n",
"-> **Question**: What is needed to migrate from one SimClusters to another SimClusters?\n",
"\n",
"**Answer**: To migrate from one SimClusters representation to another, you can follow these general steps:\n",
"\n",
"1. **Prepare the new representation**: Create the new SimClusters representation using any necessary updates or changes in the clustering algorithm, similarity measures, or other model parameters. Ensure that this new representation is properly stored and indexed as needed.\n",
"\n",
"2. **Update the relevant code and configurations**: Modify the relevant code and configuration files to reference the new SimClusters representation. This may involve updating paths or dataset names to point to the new representation, as well as changing code to use the new clustering method or similarity functions if applicable.\n",
"\n",
"3. **Test the new representation**: Before deploying the changes to production, thoroughly test the new SimClusters representation to ensure its effectiveness and stability. This may involve running offline jobs like candidate generation and label candidates, validating the output, as well as testing the new representation in the evaluation environment using evaluation tools like TweetSimilarityEvaluationAdhocApp.\n",
"\n",
"4. **Deploy the changes**: Once the new representation has been tested and validated, deploy the changes to production. This may involve creating a zip file, uploading it to the packer, and then scheduling it with Aurora. Be sure to monitor the system to ensure a smooth transition between representations and verify that the new representation is being used in recommendations as expected.\n",
"\n",
"5. **Monitor and assess the new representation**: After the new representation has been deployed, continue to monitor its performance and impact on recommendations. Take note of any improvements or issues that arise and be prepared to iterate on the new representation if needed. Always ensure that the results and performance metrics align with the system's goals and objectives. \n",
"\n",
"-> **Question**: How much do I get boosted within my cluster?\n",
"\n",
"**Answer**: It's not possible to determine the exact amount your content is boosted within your cluster in the SimClusters representation without specific data about your content and its engagement metrics. However, a combination of factors, such as the favorite score and follow score, alongside other engagement signals and SimCluster calculations, influence the boosting of content. \n",
"\n",
"-> **Question**: How does Heavy ranker work. what are its main inputs?\n",
"\n",
"**Answer**: The Heavy Ranker is a machine learning model that plays a crucial role in ranking and scoring candidates within the recommendation algorithm. Its primary purpose is to predict the likelihood of a user engaging with a tweet or connecting with another user on the platform.\n",
"\n",
"Main inputs to the Heavy Ranker consist of:\n",
"\n",
"1. Static Features: These are features that can be computed directly from a tweet at the time it's created, such as whether it has a URL, has cards, has quotes, etc. These features are produced by the Index Ingester as the tweets are generated and stored in the index.\n",
"\n",
"2. Real-time Features: These per-tweet features can change after the tweet has been indexed. They mostly consist of social engagements like retweet count, favorite count, reply count, and some spam signals that are computed with later activities. The Signal Ingester, which is part of a Heron topology, processes multiple event streams to collect and compute these real-time features.\n",
"\n",
"3. User Table Features: These per-user features are obtained from the User Table Updater that processes a stream written by the user service. This input is used to store sparse real-time user information, which is later propagated to the tweet being scored by looking up the author of the tweet.\n",
"\n",
"4. Search Context Features: These features represent the context of the current searcher, like their UI language, their content consumption, and the current time (implied). They are combined with Tweet Data to compute some of the features used in scoring.\n",
"\n",
"These inputs are then processed by the Heavy Ranker to score and rank candidates based on their relevance and likelihood of engagement by the user. \n",
"\n",
"-> **Question**: How can one influence Heavy ranker?\n",
"\n",
"**Answer**: To influence the Heavy Ranker's output or ranking of content, consider the following actions:\n",
"\n",
"1. Improve content quality: Create high-quality and engaging content that is relevant, informative, and valuable to users. High-quality content is more likely to receive positive user engagement, which the Heavy Ranker considers when ranking content.\n",
"\n",
"2. Increase user engagement: Encourage users to interact with content through likes, retweets, replies, and comments. Higher engagement levels can lead to better ranking in the Heavy Ranker's output.\n",
"\n",
"3. Optimize your user profile: A user's reputation, based on factors such as their follower count and follower-to-following ratio, may impact the ranking of their content. Maintain a good reputation by following relevant users, keeping a reasonable follower-to-following ratio and engaging with your followers.\n",
"\n",
"4. Enhance content discoverability: Use relevant keywords, hashtags, and mentions in your tweets, making it easier for users to find and engage with your content. This increased discoverability may help improve the ranking of your content by the Heavy Ranker.\n",
"\n",
"5. Leverage multimedia content: Experiment with different content formats, such as videos, images, and GIFs, which may capture users' attention and increase engagement, resulting in better ranking by the Heavy Ranker.\n",
"\n",
"6. User feedback: Monitor and respond to feedback for your content. Positive feedback may improve your ranking, while negative feedback provides an opportunity to learn and improve.\n",
"\n",
"Note that the Heavy Ranker uses a combination of machine learning models and various features to rank the content. While the above actions may help influence the ranking, there are no guarantees as the ranking process is determined by a complex algorithm, which evolves over time. \n",
"\n",
"-> **Question**: why threads and long tweets do so well on the platform?\n",
"\n",
"**Answer**: Threads and long tweets perform well on the platform for several reasons:\n",
"\n",
"1. **More content and context**: Threads and long tweets provide more information and context about a topic, which can make the content more engaging and informative for users. People tend to appreciate a well-structured and detailed explanation of a subject or a story, and threads and long tweets can do that effectively.\n",
"\n",
"2. **Increased user engagement**: As threads and long tweets provide more content, they also encourage users to engage with the tweets through replies, retweets, and likes. This increased engagement can lead to better visibility of the content, as the Twitter algorithm considers user engagement when ranking and surfacing tweets.\n",
"\n",
"3. **Narrative structure**: Threads enable users to tell stories or present arguments in a step-by-step manner, making the information more accessible and easier to follow. This narrative structure can capture users' attention and encourage them to read through the entire thread and interact with the content.\n",
"\n",
"4. **Expanded reach**: When users engage with a thread, their interactions can bring the content to the attention of their followers, helping to expand the reach of the thread. This increased visibility can lead to more interactions and higher performance for the threaded tweets.\n",
"\n",
"5. **Higher content quality**: Generally, threads and long tweets require more thought and effort to create, which may lead to higher quality content. Users are more likely to appreciate and interact with high-quality, well-reasoned content, further improving the performance of these tweets within the platform.\n",
"\n",
"Overall, threads and long tweets perform well on Twitter because they encourage user engagement and provide a richer, more informative experience that users find valuable. \n",
"\n",
"-> **Question**: Are thread and long tweet creators building a following that reacts to only threads?\n",
"\n",
"**Answer**: Based on the provided code and context, there isn't enough information to conclude if the creators of threads and long tweets primarily build a following that engages with only thread-based content. The code provided is focused on Twitter's recommendation and ranking algorithms, as well as infrastructure components like Kafka, partitions, and the Follow Recommendations Service (FRS). To answer your question, data analysis of user engagement and results of specific edge cases would be required. \n",
"\n",
"-> **Question**: Do you need to follow different strategies to get most followers vs to get most likes and bookmarks per tweet?\n",
"\n",
"**Answer**: Yes, different strategies need to be followed to maximize the number of followers compared to maximizing likes and bookmarks per tweet. While there may be some overlap in the approaches, they target different aspects of user engagement.\n",
"\n",
"Maximizing followers: The primary focus is on growing your audience on the platform. Strategies include:\n",
"\n",
"1. Consistently sharing high-quality content related to your niche or industry.\n",
"2. Engaging with others on the platform by replying, retweeting, and mentioning other users.\n",
"3. Using relevant hashtags and participating in trending conversations.\n",
"4. Collaborating with influencers and other users with a large following.\n",
"5. Posting at optimal times when your target audience is most active.\n",
"6. Optimizing your profile by using a clear profile picture, catchy bio, and relevant links.\n",
"\n",
"Maximizing likes and bookmarks per tweet: The focus is on creating content that resonates with your existing audience and encourages engagement. Strategies include:\n",
"\n",
"1. Crafting engaging and well-written tweets that encourage users to like or save them.\n",
"2. Incorporating visually appealing elements, such as images, GIFs, or videos, that capture attention.\n",
"3. Asking questions, sharing opinions, or sparking conversations that encourage users to engage with your tweets.\n",
"4. Using analytics to understand the type of content that resonates with your audience and tailoring your tweets accordingly.\n",
"5. Posting a mix of educational, entertaining, and promotional content to maintain variety and interest.\n",
"6. Timing your tweets strategically to maximize engagement, likes, and bookmarks per tweet.\n",
"\n",
"Both strategies can overlap, and you may need to adapt your approach by understanding your target audience's preferences and analyzing your account's performance. However, it's essential to recognize that maximizing followers and maximizing likes and bookmarks per tweet have different focuses and require specific strategies. \n",
"\n",
"-> **Question**: Content meta data and how it impacts virality (e.g. ALT in images).\n",
"\n",
"**Answer**: There is no direct information in the provided context about how content metadata, such as ALT text in images, impacts the virality of a tweet or post. However, it's worth noting that including ALT text can improve the accessibility of your content for users who rely on screen readers, which may lead to increased engagement for a broader audience. Additionally, metadata can be used in search engine optimization, which might improve the visibility of the content, but the context provided does not mention any specific correlation with virality. \n",
"\n",
"-> **Question**: What are some unexpected fingerprints for spam factors?\n",
"\n",
"**Answer**: In the provided context, an unusual indicator of spam factors is when a tweet contains a non-media, non-news link. If the tweet has a link but does not have an image URL, video URL, or news URL, it is considered a potential spam vector, and a threshold for user reputation (tweepCredThreshold) is set to MIN_TWEEPCRED_WITH_LINK.\n",
"\n",
"While this rule may not cover all possible unusual spam indicators, it is derived from the specific codebase and logic shared in the context. \n",
"\n",
"-> **Question**: Is there any difference between company verified checkmarks and blue verified individual checkmarks?\n",
"\n",
"**Answer**: Yes, there is a distinction between the verified checkmarks for companies and blue verified checkmarks for individuals. The code snippet provided mentions \"Blue-verified account boost\" which indicates that there is a separate category for blue verified accounts. Typically, blue verified checkmarks are used to indicate notable individuals, while verified checkmarks are for companies or organizations. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}