Add ntbk, minor fix on embd

2026-01-24 05:50:18 +00:00 · 2024-01-25 10:54:29 -08:00
parent 11a9359b7f
commit 34183251eb
2 changed files with 354 additions and 1 deletions
--- a/libs/partners/nomic/langchain_nomic/embeddings.py
+++ b/libs/partners/nomic/langchain_nomic/embeddings.py
@@ -2,6 +2,7 @@ import os
 from typing import List, Optional

 import nomic  # type: ignore
+from nomic import embed
 from langchain_core.embeddings import Embeddings


@@ -28,7 +29,7 @@ class NomicEmbeddings(Embeddings):

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Embed search docs."""
-        output = nomic.embed.text(
+        output = embed.text(
            texts=texts,
            model=self.model,
        )
--- a/libs/partners/nomic/ntbk/long_context_RAG.ipynb
+++ b/libs/partners/nomic/ntbk/long_context_RAG.ipynb
@@ -0,0 +1,352 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "d8da6094-30c7-43f3-a608-c91717b673db",
+   "metadata": {},
+   "source": [
+    "## Init\n",
+    "\n",
+    "Get your API token, then run:\n",
+    "```\n",
+    "! nomic login\n",
+    "```\n",
+    "\n",
+    "Then run with your generated API token \n",
+    "```\n",
+    "! nomic login < token > \n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8ab7434a-2930-42b5-9164-dc2c03abe232",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! nomic login\n",
+    "! nomic login token"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "134475f2-f256-4c13-9712-c55783e6a4e2",
+   "metadata": {},
+   "source": [
+    "## Document Loading\n",
+    "\n",
+    "Let's test 3 interesting blog posts."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "01c4d270-171e-45c2-a1b6-e350faa74117",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders import WebBaseLoader\n",
+    "\n",
+    "urls =[\"https://lilianweng.github.io/posts/2023-06-23-agent/\",\n",
+    "       \"https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/\",\n",
+    "       \"https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/\"]\n",
+    "\n",
+    "docs = [WebBaseLoader(url).load() for url in urls]\n",
+    "docs_list = [item for sublist in docs for item in sublist]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "75ab7f74-873c-4d84-af5a-5cf19c61239d",
+   "metadata": {},
+   "source": [
+    "## Splitting \n",
+    "\n",
+    "### Larger Context Models\n",
+    "\n",
+    "There's a lot of interesting considerations to think about on [text splitting](https://www.youtube.com/watch?v=8OJC21T2SL4). \n",
+    "\n",
+    "Many approaches to date have focused on very granular splitting by semantic groups or sub-sections, which is a challenge.\n",
+    "\n",
+    "The intution was: retrieve just the minimal context needed to address the question driven by:\n",
+    "\n",
+    "(1) Embedding models with smaller context size\n",
+    "\n",
+    "(2) LLMs with smaller context size\n",
+    "\n",
+    "This means, we need high `precision` in retreival: \n",
+    "\n",
+    "> We reject as many irrelevant chunks (false positives) as possible.\n",
+    "\n",
+    "Thus, all chunks we send to the model are relevant, but:\n",
+    "\n",
+    "(1) We can suffer lower `recall` (leave our importaint details) \n",
+    "\n",
+    "(2) We incur higher splitting complexity\n",
+    "\n",
+    "--- \n",
+    "\n",
+    "Embeddings models are starting to support larger context as discussed [here](https://hazyresearch.stanford.edu/blog/2024-01-11-m2-bert-retrieval).\n",
+    "\n",
+    "Nomic's release supports > 8k token limit locally (GPU today, CPU soon) and via API (soon).\n",
+    "\n",
+    "And LLMs are seeing context window expansion, as seen with [GPT-4 128k](https://openai.com/blog/new-models-and-developer-products-announced-at-devday) or Yarn LLaMA2 [here](https://x.com/mattshumer_/status/1720115354884514042?s=20), [here](https://ollama.ai/library/yarn-mistral). \n",
+    "\n",
+    "Here, we can try a workflow that is less concerned with `precision`:\n",
+    "\n",
+    "(1) We use larger context chunks and embedds to promote `recall` \n",
+    "\n",
+    "(2) Use use larger context LLMs that can \"sift\" through less relevant information to get our answer\n",
+    "\n",
+    "Lets pick a few interesting blog posts and see how long each document is using [TikToken](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "f512e128-629e-4304-926f-94fe5c999527",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.text_splitter import CharacterTextSplitter\n",
+    "text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=10000, \n",
+    "                                                            chunk_overlap=100)\n",
+    "doc_splits = text_splitter.split_documents(docs_list)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "d2a69cf0-e3ab-4c92-a1d0-10da45c08b3b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The document is 8759 tokens\n",
+      "The document is 811 tokens\n",
+      "The document is 7083 tokens\n",
+      "The document is 9029 tokens\n",
+      "The document is 3488 tokens\n"
+     ]
+    }
+   ],
+   "source": [
+    "import tiktoken\n",
+    "encoding = tiktoken.get_encoding(\"cl100k_base\")\n",
+    "encoding = tiktoken.encoding_for_model(\"gpt-3.5-turbo\")\n",
+    "for d in doc_splits:\n",
+    "    print(\"The document is %s tokens\"%len(encoding.encode(d.page_content)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c58d1e9b-e98e-4bd9-b52f-4dfc2a4e69f4",
+   "metadata": {},
+   "source": [
+    "## Index \n",
+    "\n",
+    "Nomic embeddings [here](https://docs.nomic.ai/reference/endpoints/nomic-embed-text). "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "id": "76447866-bf8b-412b-93bc-d6ea8ec35952",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from langchain_community.vectorstores import Chroma\n",
+    "from langchain_nomic.embeddings import NomicEmbeddings\n",
+    "from langchain_core.output_parsers import StrOutputParser\n",
+    "from langchain_core.runnables import RunnableLambda, RunnablePassthrough"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "id": "15b3eab2-2689-49d4-8cb0-67ef2adcbc49",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">                                          Authenticate with the Nomic API                                          </span>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[1m                                          \u001b[0m\u001b[1mAuthenticate with the Nomic API\u001b[0m\u001b[1m                                          \u001b[0m\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">                                         </span><span style=\"color: #0000ff; text-decoration-color: #0000ff; text-decoration: underline\">https://atlas.nomic.ai/cli-login</span><span style=\"font-weight: bold\">                                          </span>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[1m                                         \u001b[0m\u001b[4;94mhttps://atlas.nomic.ai/cli-login\u001b[0m\u001b[1m                                          \u001b[0m\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">               Click the above link to retrieve your access token and then run `nomic login [token]`               </span>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[1m               \u001b[0m\u001b[1mClick the above link to retrieve your access token and then run `nomic login \u001b[0m\u001b[1m[\u001b[0m\u001b[1mtoken\u001b[0m\u001b[1m]\u001b[0m\u001b[1m`\u001b[0m\u001b[1m               \u001b[0m\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "ename": "NameError",
+     "evalue": "name 'exit' is not defined",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
+      "Cell \u001b[0;32mIn[42], line 7\u001b[0m\n\u001b[1;32m      2\u001b[0m api_key \u001b[38;5;241m=\u001b[39m os\u001b[38;5;241m.\u001b[39mgetenv(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mNOMIC_API_KEY\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m      3\u001b[0m \u001b[38;5;66;03m# api_key = \"eTiGYQ2ep1EMxFZlyTopWHRpk7JqjSU89FdTuLvbD132c\"\u001b[39;00m\n\u001b[1;32m      4\u001b[0m vectorstore \u001b[38;5;241m=\u001b[39m Chroma\u001b[38;5;241m.\u001b[39mfrom_documents(\n\u001b[1;32m      5\u001b[0m     documents\u001b[38;5;241m=\u001b[39mtexts,\n\u001b[1;32m      6\u001b[0m     collection_name\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mrag-chroma\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[0;32m----> 7\u001b[0m     embedding\u001b[38;5;241m=\u001b[39m\u001b[43mNomicEmbeddings\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmodel\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mnomic-embed-text-v1\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m      8\u001b[0m \u001b[43m                              \u001b[49m\u001b[43mnomic_api_key\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mapi_key\u001b[49m\u001b[43m)\u001b[49m,\n\u001b[1;32m      9\u001b[0m )\n\u001b[1;32m     10\u001b[0m retriever \u001b[38;5;241m=\u001b[39m vectorstore\u001b[38;5;241m.\u001b[39mas_retriever()\n",
+      "File \u001b[0;32m~/Desktop/Code/langchain-main/langchain/libs/partners/nomic/langchain_nomic/embeddings.py:27\u001b[0m, in \u001b[0;36mNomicEmbeddings.__init__\u001b[0;34m(self, model, nomic_api_key)\u001b[0m\n\u001b[1;32m     21\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"Initialize NomicEmbeddings model.\u001b[39;00m\n\u001b[1;32m     22\u001b[0m \n\u001b[1;32m     23\u001b[0m \u001b[38;5;124;03mArgs:\u001b[39;00m\n\u001b[1;32m     24\u001b[0m \u001b[38;5;124;03m    model: model name\u001b[39;00m\n\u001b[1;32m     25\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m     26\u001b[0m _api_key \u001b[38;5;241m=\u001b[39m nomic_api_key \u001b[38;5;129;01mor\u001b[39;00m os\u001b[38;5;241m.\u001b[39menviron\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mNOMIC_API_KEY\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m---> 27\u001b[0m \u001b[43mnomic\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlogin\u001b[49m\u001b[43m(\u001b[49m\u001b[43m_api_key\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m     28\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmodel \u001b[38;5;241m=\u001b[39m model\n",
+      "File \u001b[0;32m~/miniforge3/envs/llama2/lib/python3.9/site-packages/nomic/cli.py:60\u001b[0m, in \u001b[0;36mlogin\u001b[0;34m(token, tenant, domain)\u001b[0m\n\u001b[1;32m     54\u001b[0m     console\u001b[38;5;241m.\u001b[39mprint(auth0_auth_endpoint, style\u001b[38;5;241m=\u001b[39mstyle, justify\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcenter\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m     55\u001b[0m     console\u001b[38;5;241m.\u001b[39mprint(\n\u001b[1;32m     56\u001b[0m         \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mClick the above link to retrieve your access token and then run `nomic login \u001b[39m\u001b[38;5;124m\\\u001b[39m\u001b[38;5;124m[token]`\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m     57\u001b[0m         style\u001b[38;5;241m=\u001b[39mstyle,\n\u001b[1;32m     58\u001b[0m         justify\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcenter\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m     59\u001b[0m     )\n\u001b[0;32m---> 60\u001b[0m     \u001b[43mexit\u001b[49m()\n\u001b[1;32m     62\u001b[0m \u001b[38;5;66;03m# save credential\u001b[39;00m\n\u001b[1;32m     63\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m nomic_base_path\u001b[38;5;241m.\u001b[39mexists():\n",
+      "\u001b[0;31mNameError\u001b[0m: name 'exit' is not defined"
+     ]
+    }
+   ],
+   "source": [
+    "# Add to vectorDB\n",
+    "api_key = os.getenv(\"NOMIC_API_KEY\")\n",
+    "# api_key = \"xxx2\"\n",
+    "vectorstore = Chroma.from_documents(\n",
+    "    documents=texts,\n",
+    "    collection_name=\"rag-chroma\",\n",
+    "    embedding=NomicEmbeddings(model='nomic-embed-text-v1',\n",
+    "                              nomic_api_key=api_key), # TO FIX \n",
+    ")\n",
+    "retriever = vectorstore.as_retriever()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "41131122-3591-4566-aac1-ed19d496820a",
+   "metadata": {},
+   "source": [
+    "## RAG Chain\n",
+    "\n",
+    "To test locally, we can use Ollama [here](https://x.com/mattshumer_/status/1720115354884514042?s=20), [here](https://ollama.ai/library/yarn-mistral) - \n",
+    "```\n",
+    "ollama pull yarn-mistral\n",
+    "```\n",
+    "\n",
+    "Of course, we can also run [GPT-4 128k](https://openai.com/blog/new-models-and-developer-products-announced-at-devday). "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "id": "1397de64-5b4a-4001-adc5-570ff8d31ff6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_openai import ChatOpenAI\n",
+    "from langchain_core.prompts import ChatPromptTemplate\n",
+    "from langchain_community.chat_models import ChatOllama\n",
+    "\n",
+    "# Prompt \n",
+    "template = \"\"\"Answer the question based only on the following context:\n",
+    "{context}\n",
+    "\n",
+    "Question: {question}\n",
+    "\"\"\"\n",
+    "prompt = ChatPromptTemplate.from_template(template)\n",
+    "\n",
+    "# Local LLM\n",
+    "ollama_llm = \"yarn-mistral\"\n",
+    "model = ChatOllama(model=ollama_llm)\n",
+    "\n",
+    "# LLM API\n",
+    "model = ChatOpenAI(temperature=0, model=\"gpt-4-1106-preview\")\n",
+    "\n",
+    "# Chain\n",
+    "chain = (\n",
+    "    {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
+    "    | prompt\n",
+    "    | model\n",
+    "    | StrOutputParser()\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "id": "1548e00c-1ff6-4e88-aa13-69badf2088fb",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'In the context provided, the types of agent memory mentioned are:\\n\\n1. Sensory Memory: This is the earliest stage of memory, providing the ability to retain impressions of sensory information after the original stimuli have ended. It typically only lasts for a few seconds.\\n\\n2. Short-Term Memory (STM) or Working Memory: It stores information that is currently being used to carry out complex cognitive tasks such as learning and reasoning. Short-term memory has a limited capacity and duration.\\n\\n3. Long-Term Memory (LTM): This type of memory can store information for a remarkably long time, ranging from a few days to decades, with an essentially unlimited storage capacity. It is divided into two subtypes:\\n   - Explicit / Declarative Memory: Memory of facts and events that can be consciously recalled, including episodic memory (events and experiences) and semantic memory (facts and concepts).\\n   - Implicit / Procedural Memory: Unconscious memory involving skills and routines performed automatically, like riding a bike or typing on a keyboard.'"
+      ]
+     },
+     "execution_count": 34,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Question\n",
+    "chain.invoke(\"What are the types of agent memory?\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "81616653-be22-445a-9d90-34e0c2e788bc",
+   "metadata": {},
+   "source": [
+    "Some considerations are noted in the [needle in a haystack analysis](https://twitter.com/GregKamradt/status/1722386725635580292?lang=en):\n",
+    "\n",
+    "* LLMs may suffer with retrieval from large context depending on where the information is placed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d067e4ea-fd84-40db-8ecb-1aac94f55417",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}