Add ntbk, minor fix on embd

This commit is contained in:
Lance Martin
2024-01-25 10:54:29 -08:00
parent 11a9359b7f
commit 34183251eb
2 changed files with 354 additions and 1 deletions

View File

@@ -2,6 +2,7 @@ import os
from typing import List, Optional
import nomic # type: ignore
from nomic import embed
from langchain_core.embeddings import Embeddings
@@ -28,7 +29,7 @@ class NomicEmbeddings(Embeddings):
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Embed search docs."""
output = nomic.embed.text(
output = embed.text(
texts=texts,
model=self.model,
)

View File

@@ -0,0 +1,352 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d8da6094-30c7-43f3-a608-c91717b673db",
"metadata": {},
"source": [
"## Init\n",
"\n",
"Get your API token, then run:\n",
"```\n",
"! nomic login\n",
"```\n",
"\n",
"Then run with your generated API token \n",
"```\n",
"! nomic login < token > \n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8ab7434a-2930-42b5-9164-dc2c03abe232",
"metadata": {},
"outputs": [],
"source": [
"! nomic login\n",
"! nomic login token"
]
},
{
"cell_type": "markdown",
"id": "134475f2-f256-4c13-9712-c55783e6a4e2",
"metadata": {},
"source": [
"## Document Loading\n",
"\n",
"Let's test 3 interesting blog posts."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "01c4d270-171e-45c2-a1b6-e350faa74117",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import WebBaseLoader\n",
"\n",
"urls =[\"https://lilianweng.github.io/posts/2023-06-23-agent/\",\n",
" \"https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/\",\n",
" \"https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/\"]\n",
"\n",
"docs = [WebBaseLoader(url).load() for url in urls]\n",
"docs_list = [item for sublist in docs for item in sublist]"
]
},
{
"cell_type": "markdown",
"id": "75ab7f74-873c-4d84-af5a-5cf19c61239d",
"metadata": {},
"source": [
"## Splitting \n",
"\n",
"### Larger Context Models\n",
"\n",
"There's a lot of interesting considerations to think about on [text splitting](https://www.youtube.com/watch?v=8OJC21T2SL4). \n",
"\n",
"Many approaches to date have focused on very granular splitting by semantic groups or sub-sections, which is a challenge.\n",
"\n",
"The intution was: retrieve just the minimal context needed to address the question driven by:\n",
"\n",
"(1) Embedding models with smaller context size\n",
"\n",
"(2) LLMs with smaller context size\n",
"\n",
"This means, we need high `precision` in retreival: \n",
"\n",
"> We reject as many irrelevant chunks (false positives) as possible.\n",
"\n",
"Thus, all chunks we send to the model are relevant, but:\n",
"\n",
"(1) We can suffer lower `recall` (leave our importaint details) \n",
"\n",
"(2) We incur higher splitting complexity\n",
"\n",
"--- \n",
"\n",
"Embeddings models are starting to support larger context as discussed [here](https://hazyresearch.stanford.edu/blog/2024-01-11-m2-bert-retrieval).\n",
"\n",
"Nomic's release supports > 8k token limit locally (GPU today, CPU soon) and via API (soon).\n",
"\n",
"And LLMs are seeing context window expansion, as seen with [GPT-4 128k](https://openai.com/blog/new-models-and-developer-products-announced-at-devday) or Yarn LLaMA2 [here](https://x.com/mattshumer_/status/1720115354884514042?s=20), [here](https://ollama.ai/library/yarn-mistral). \n",
"\n",
"Here, we can try a workflow that is less concerned with `precision`:\n",
"\n",
"(1) We use larger context chunks and embedds to promote `recall` \n",
"\n",
"(2) Use use larger context LLMs that can \"sift\" through less relevant information to get our answer\n",
"\n",
"Lets pick a few interesting blog posts and see how long each document is using [TikToken](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "f512e128-629e-4304-926f-94fe5c999527",
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import CharacterTextSplitter\n",
"text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=10000, \n",
" chunk_overlap=100)\n",
"doc_splits = text_splitter.split_documents(docs_list)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "d2a69cf0-e3ab-4c92-a1d0-10da45c08b3b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The document is 8759 tokens\n",
"The document is 811 tokens\n",
"The document is 7083 tokens\n",
"The document is 9029 tokens\n",
"The document is 3488 tokens\n"
]
}
],
"source": [
"import tiktoken\n",
"encoding = tiktoken.get_encoding(\"cl100k_base\")\n",
"encoding = tiktoken.encoding_for_model(\"gpt-3.5-turbo\")\n",
"for d in doc_splits:\n",
" print(\"The document is %s tokens\"%len(encoding.encode(d.page_content)))"
]
},
{
"cell_type": "markdown",
"id": "c58d1e9b-e98e-4bd9-b52f-4dfc2a4e69f4",
"metadata": {},
"source": [
"## Index \n",
"\n",
"Nomic embeddings [here](https://docs.nomic.ai/reference/endpoints/nomic-embed-text). "
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "76447866-bf8b-412b-93bc-d6ea8ec35952",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from langchain_community.vectorstores import Chroma\n",
"from langchain_nomic.embeddings import NomicEmbeddings\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.runnables import RunnableLambda, RunnablePassthrough"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "15b3eab2-2689-49d4-8cb0-67ef2adcbc49",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\"> Authenticate with the Nomic API </span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m \u001b[0m\u001b[1mAuthenticate with the Nomic API\u001b[0m\u001b[1m \u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\"> </span><span style=\"color: #0000ff; text-decoration-color: #0000ff; text-decoration: underline\">https://atlas.nomic.ai/cli-login</span><span style=\"font-weight: bold\"> </span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m \u001b[0m\u001b[4;94mhttps://atlas.nomic.ai/cli-login\u001b[0m\u001b[1m \u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\"> Click the above link to retrieve your access token and then run `nomic login [token]` </span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m \u001b[0m\u001b[1mClick the above link to retrieve your access token and then run `nomic login \u001b[0m\u001b[1m[\u001b[0m\u001b[1mtoken\u001b[0m\u001b[1m]\u001b[0m\u001b[1m`\u001b[0m\u001b[1m \u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"ename": "NameError",
"evalue": "name 'exit' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[42], line 7\u001b[0m\n\u001b[1;32m 2\u001b[0m api_key \u001b[38;5;241m=\u001b[39m os\u001b[38;5;241m.\u001b[39mgetenv(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mNOMIC_API_KEY\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 3\u001b[0m \u001b[38;5;66;03m# api_key = \"eTiGYQ2ep1EMxFZlyTopWHRpk7JqjSU89FdTuLvbD132c\"\u001b[39;00m\n\u001b[1;32m 4\u001b[0m vectorstore \u001b[38;5;241m=\u001b[39m Chroma\u001b[38;5;241m.\u001b[39mfrom_documents(\n\u001b[1;32m 5\u001b[0m documents\u001b[38;5;241m=\u001b[39mtexts,\n\u001b[1;32m 6\u001b[0m collection_name\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mrag-chroma\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[0;32m----> 7\u001b[0m embedding\u001b[38;5;241m=\u001b[39m\u001b[43mNomicEmbeddings\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmodel\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mnomic-embed-text-v1\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 8\u001b[0m \u001b[43m \u001b[49m\u001b[43mnomic_api_key\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mapi_key\u001b[49m\u001b[43m)\u001b[49m,\n\u001b[1;32m 9\u001b[0m )\n\u001b[1;32m 10\u001b[0m retriever \u001b[38;5;241m=\u001b[39m vectorstore\u001b[38;5;241m.\u001b[39mas_retriever()\n",
"File \u001b[0;32m~/Desktop/Code/langchain-main/langchain/libs/partners/nomic/langchain_nomic/embeddings.py:27\u001b[0m, in \u001b[0;36mNomicEmbeddings.__init__\u001b[0;34m(self, model, nomic_api_key)\u001b[0m\n\u001b[1;32m 21\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"Initialize NomicEmbeddings model.\u001b[39;00m\n\u001b[1;32m 22\u001b[0m \n\u001b[1;32m 23\u001b[0m \u001b[38;5;124;03mArgs:\u001b[39;00m\n\u001b[1;32m 24\u001b[0m \u001b[38;5;124;03m model: model name\u001b[39;00m\n\u001b[1;32m 25\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 26\u001b[0m _api_key \u001b[38;5;241m=\u001b[39m nomic_api_key \u001b[38;5;129;01mor\u001b[39;00m os\u001b[38;5;241m.\u001b[39menviron\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mNOMIC_API_KEY\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m---> 27\u001b[0m \u001b[43mnomic\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlogin\u001b[49m\u001b[43m(\u001b[49m\u001b[43m_api_key\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 28\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmodel \u001b[38;5;241m=\u001b[39m model\n",
"File \u001b[0;32m~/miniforge3/envs/llama2/lib/python3.9/site-packages/nomic/cli.py:60\u001b[0m, in \u001b[0;36mlogin\u001b[0;34m(token, tenant, domain)\u001b[0m\n\u001b[1;32m 54\u001b[0m console\u001b[38;5;241m.\u001b[39mprint(auth0_auth_endpoint, style\u001b[38;5;241m=\u001b[39mstyle, justify\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcenter\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 55\u001b[0m console\u001b[38;5;241m.\u001b[39mprint(\n\u001b[1;32m 56\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mClick the above link to retrieve your access token and then run `nomic login \u001b[39m\u001b[38;5;124m\\\u001b[39m\u001b[38;5;124m[token]`\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m 57\u001b[0m style\u001b[38;5;241m=\u001b[39mstyle,\n\u001b[1;32m 58\u001b[0m justify\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcenter\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m 59\u001b[0m )\n\u001b[0;32m---> 60\u001b[0m \u001b[43mexit\u001b[49m()\n\u001b[1;32m 62\u001b[0m \u001b[38;5;66;03m# save credential\u001b[39;00m\n\u001b[1;32m 63\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m nomic_base_path\u001b[38;5;241m.\u001b[39mexists():\n",
"\u001b[0;31mNameError\u001b[0m: name 'exit' is not defined"
]
}
],
"source": [
"# Add to vectorDB\n",
"api_key = os.getenv(\"NOMIC_API_KEY\")\n",
"# api_key = \"xxx2\"\n",
"vectorstore = Chroma.from_documents(\n",
" documents=texts,\n",
" collection_name=\"rag-chroma\",\n",
" embedding=NomicEmbeddings(model='nomic-embed-text-v1',\n",
" nomic_api_key=api_key), # TO FIX \n",
")\n",
"retriever = vectorstore.as_retriever()"
]
},
{
"cell_type": "markdown",
"id": "41131122-3591-4566-aac1-ed19d496820a",
"metadata": {},
"source": [
"## RAG Chain\n",
"\n",
"To test locally, we can use Ollama [here](https://x.com/mattshumer_/status/1720115354884514042?s=20), [here](https://ollama.ai/library/yarn-mistral) - \n",
"```\n",
"ollama pull yarn-mistral\n",
"```\n",
"\n",
"Of course, we can also run [GPT-4 128k](https://openai.com/blog/new-models-and-developer-products-announced-at-devday). "
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "1397de64-5b4a-4001-adc5-570ff8d31ff6",
"metadata": {},
"outputs": [],
"source": [
"from langchain_openai import ChatOpenAI\n",
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_community.chat_models import ChatOllama\n",
"\n",
"# Prompt \n",
"template = \"\"\"Answer the question based only on the following context:\n",
"{context}\n",
"\n",
"Question: {question}\n",
"\"\"\"\n",
"prompt = ChatPromptTemplate.from_template(template)\n",
"\n",
"# Local LLM\n",
"ollama_llm = \"yarn-mistral\"\n",
"model = ChatOllama(model=ollama_llm)\n",
"\n",
"# LLM API\n",
"model = ChatOpenAI(temperature=0, model=\"gpt-4-1106-preview\")\n",
"\n",
"# Chain\n",
"chain = (\n",
" {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | model\n",
" | StrOutputParser()\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "1548e00c-1ff6-4e88-aa13-69badf2088fb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'In the context provided, the types of agent memory mentioned are:\\n\\n1. Sensory Memory: This is the earliest stage of memory, providing the ability to retain impressions of sensory information after the original stimuli have ended. It typically only lasts for a few seconds.\\n\\n2. Short-Term Memory (STM) or Working Memory: It stores information that is currently being used to carry out complex cognitive tasks such as learning and reasoning. Short-term memory has a limited capacity and duration.\\n\\n3. Long-Term Memory (LTM): This type of memory can store information for a remarkably long time, ranging from a few days to decades, with an essentially unlimited storage capacity. It is divided into two subtypes:\\n - Explicit / Declarative Memory: Memory of facts and events that can be consciously recalled, including episodic memory (events and experiences) and semantic memory (facts and concepts).\\n - Implicit / Procedural Memory: Unconscious memory involving skills and routines performed automatically, like riding a bike or typing on a keyboard.'"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Question\n",
"chain.invoke(\"What are the types of agent memory?\")"
]
},
{
"cell_type": "markdown",
"id": "81616653-be22-445a-9d90-34e0c2e788bc",
"metadata": {},
"source": [
"Some considerations are noted in the [needle in a haystack analysis](https://twitter.com/GregKamradt/status/1722386725635580292?lang=en):\n",
"\n",
"* LLMs may suffer with retrieval from large context depending on where the information is placed."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d067e4ea-fd84-40db-8ecb-1aac94f55417",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}