mirror of
https://github.com/hwchase17/langchain.git
synced 2026-02-08 02:00:06 +00:00
Compare commits
19 Commits
eugene/per
...
v0.0.193
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
ce7c11625f | ||
|
|
5a207cce8f | ||
|
|
b3ae6bcd3f | ||
|
|
5468528748 | ||
|
|
69f4ffb851 | ||
|
|
2be4fbb835 | ||
|
|
062c3c00a2 | ||
|
|
92b87c2fec | ||
|
|
3954bcf396 | ||
|
|
b7999a9bc1 | ||
|
|
a0d847f636 | ||
|
|
217b5cc72d | ||
|
|
4092fd21dc | ||
|
|
2a4b32dee2 | ||
|
|
daf3e99b96 | ||
|
|
b177a29d3f | ||
|
|
65111eb2b3 | ||
|
|
0cfaa76e45 | ||
|
|
2ae2d6cd1d |
@@ -24,9 +24,9 @@ This guide aims to provide a comprehensive overview of the requirements for depl
|
||||
|
||||
Understanding these components is crucial when assessing serving systems. LangChain integrates with several open-source projects designed to tackle these issues, providing a robust framework for productionizing your LLM applications. Some notable frameworks include:
|
||||
|
||||
- `Ray Serve <../../../ecosystem/ray_serve.html>`_
|
||||
- `Ray Serve <../integrations/ray_serve.html>`_
|
||||
- `BentoML <https://github.com/ssheng/BentoChain>`_
|
||||
- `Modal <../../../ecosystem/modal.html>`_
|
||||
- `Modal <../integrations/modal.html>`_
|
||||
|
||||
These links will provide further information on each ecosystem, assisting you in finding the best fit for your LLM deployment needs.
|
||||
|
||||
|
||||
@@ -58,7 +58,7 @@
|
||||
"### Optional Parameters\n",
|
||||
"There following parameters are optional. When executing the method in a Databricks notebook, you don't need to provide them in most of the cases.\n",
|
||||
"* `host`: The Databricks workspace hostname, excluding 'https://' part. Defaults to 'DATABRICKS_HOST' environment variable or current workspace if in a Databricks notebook.\n",
|
||||
"* `api_token`: The Databricks personal access token for accessing the Databricks SQL warehouse or the cluster. Defaults to 'DATABRICKS_API_TOKEN' environment variable or a temporary one is generated if in a Databricks notebook.\n",
|
||||
"* `api_token`: The Databricks personal access token for accessing the Databricks SQL warehouse or the cluster. Defaults to 'DATABRICKS_TOKEN' environment variable or a temporary one is generated if in a Databricks notebook.\n",
|
||||
"* `warehouse_id`: The warehouse ID in the Databricks SQL.\n",
|
||||
"* `cluster_id`: The cluster ID in the Databricks Runtime. If running in a Databricks notebook and both 'warehouse_id' and 'cluster_id' are None, it uses the ID of the cluster the notebook is attached to.\n",
|
||||
"* `engine_args`: The arguments to be used when connecting Databricks.\n",
|
||||
|
||||
@@ -37,6 +37,7 @@ For detailed instructions on how to get set up with Unstructured, see installati
|
||||
./document_loaders/examples/email.ipynb
|
||||
./document_loaders/examples/epub.ipynb
|
||||
./document_loaders/examples/evernote.ipynb
|
||||
./document_loaders/examples/excel.ipynb
|
||||
./document_loaders/examples/facebook_chat.ipynb
|
||||
./document_loaders/examples/file_directory.ipynb
|
||||
./document_loaders/examples/html.ipynb
|
||||
|
||||
@@ -0,0 +1,296 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e48afb8d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Loading documents from a YouTube url\n",
|
||||
"\n",
|
||||
"Building chat or QA applications on YouTube videos is a topic of high interest.\n",
|
||||
"\n",
|
||||
"Below we show how to easily go from a YouTube url to text to chat!\n",
|
||||
"\n",
|
||||
"We wil use the `OpenAIWhisperParser`, which will use the OpenAI Whisper API to transcribe audio to text.\n",
|
||||
"\n",
|
||||
"Note: You will need to have an `OPENAI_API_KEY` supplied."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "5f34e934",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders.generic import GenericLoader\n",
|
||||
"from langchain.document_loaders.parsers import OpenAIWhisperParser\n",
|
||||
"from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "85fc12bd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We will use `yt_dlp` to download audio for YouTube urls.\n",
|
||||
"\n",
|
||||
"We will use `pydub` to split downloaded audio files (such that we adhere to Whisper API's 25MB file size limit)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "fb5a6606",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"! pip install yt_dlp\n",
|
||||
"! pip install pydub"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b0e119f4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### YouTube url to text\n",
|
||||
"\n",
|
||||
"Use `YoutubeAudioLoader` to fetch / download the audio files.\n",
|
||||
"\n",
|
||||
"Then, ues `OpenAIWhisperParser()` to transcribe them to text.\n",
|
||||
"\n",
|
||||
"Let's take the first lecture of Andrej Karpathy's YouTube course as an example! "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "23e1e134",
|
||||
"metadata": {
|
||||
"scrolled": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[youtube] Extracting URL: https://youtu.be/kCc8FmEb1nY\n",
|
||||
"[youtube] kCc8FmEb1nY: Downloading webpage\n",
|
||||
"[youtube] kCc8FmEb1nY: Downloading android player API JSON\n",
|
||||
"[info] kCc8FmEb1nY: Downloading 1 format(s): 140\n",
|
||||
"[dashsegments] Total fragments: 11\n",
|
||||
"[download] Destination: /Users/31treehaus/Desktop/AI/langchain-fork/docs/modules/indexes/document_loaders/examples/Let's build GPT: from scratch, in code, spelled out..m4a\n",
|
||||
"[download] 100% of 107.73MiB in 00:00:18 at 5.92MiB/s \n",
|
||||
"[FixupM4a] Correcting container of \"/Users/31treehaus/Desktop/AI/langchain-fork/docs/modules/indexes/document_loaders/examples/Let's build GPT: from scratch, in code, spelled out..m4a\"\n",
|
||||
"[ExtractAudio] Not converting audio /Users/31treehaus/Desktop/AI/langchain-fork/docs/modules/indexes/document_loaders/examples/Let's build GPT: from scratch, in code, spelled out..m4a; file is already in target format m4a\n",
|
||||
"[youtube] Extracting URL: https://youtu.be/VMj-3S1tku0\n",
|
||||
"[youtube] VMj-3S1tku0: Downloading webpage\n",
|
||||
"[youtube] VMj-3S1tku0: Downloading android player API JSON\n",
|
||||
"[info] VMj-3S1tku0: Downloading 1 format(s): 140\n",
|
||||
"[download] /Users/31treehaus/Desktop/AI/langchain-fork/docs/modules/indexes/document_loaders/examples/The spelled-out intro to neural networks and backpropagation: building micrograd.m4a has already been downloaded\n",
|
||||
"[download] 100% of 134.98MiB\n",
|
||||
"[ExtractAudio] Not converting audio /Users/31treehaus/Desktop/AI/langchain-fork/docs/modules/indexes/document_loaders/examples/The spelled-out intro to neural networks and backpropagation: building micrograd.m4a; file is already in target format m4a\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Two Karpathy lecture videos\n",
|
||||
"urls = [\"https://youtu.be/kCc8FmEb1nY\",\n",
|
||||
" \"https://youtu.be/VMj-3S1tku0\"]\n",
|
||||
"\n",
|
||||
"# Directory to save audio files \n",
|
||||
"save_dir = \"~/Downloads/YouTube\"\n",
|
||||
"\n",
|
||||
"# Transcribe the videos to text\n",
|
||||
"loader = GenericLoader(YoutubeAudioLoader(urls,save_dir),OpenAIWhisperParser())\n",
|
||||
"docs = loader.load()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "72a94fd8",
|
||||
"metadata": {
|
||||
"scrolled": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"\"Hello, my name is Andrej and I've been training deep neural networks for a bit more than a decade. And in this lecture I'd like to show you what neural network training looks like under the hood. So in particular we are going to start with a blank Jupyter notebook and by the end of this lecture we will define and train a neural net and you'll get to see everything that goes on under the hood and exactly sort of how that works on an intuitive level. Now specifically what I would like to do is I w\""
|
||||
]
|
||||
},
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Returns a list of Documents, which can be easily viewed or parsed\n",
|
||||
"docs[0].page_content[0:500]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "93be6b49",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Building a chat app from YouTube video\n",
|
||||
"\n",
|
||||
"Given `Documents`, we can easily enable chat / question+answering."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "1823f042",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.chains import RetrievalQA\n",
|
||||
"from langchain.vectorstores import FAISS\n",
|
||||
"from langchain.chat_models import ChatOpenAI\n",
|
||||
"from langchain.embeddings import OpenAIEmbeddings\n",
|
||||
"from langchain.text_splitter import RecursiveCharacterTextSplitter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "7257cda1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Combine doc\n",
|
||||
"combined_docs = [doc.page_content for doc in docs]\n",
|
||||
"text = \" \".join(combined_docs)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "147c0c55",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Split them\n",
|
||||
"text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500, chunk_overlap = 150)\n",
|
||||
"splits = text_splitter.split_text(text)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "f3556703",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Build an index\n",
|
||||
"embeddings = OpenAIEmbeddings()\n",
|
||||
"vectordb = FAISS.from_texts(splits,embeddings)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "beaa99db",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Build a QA chain\n",
|
||||
"qa_chain = RetrievalQA.from_chain_type(llm = ChatOpenAI(model_name=\"gpt-3.5-turbo\", temperature=0),\n",
|
||||
" chain_type=\"stuff\",\n",
|
||||
" retriever=vectordb.as_retriever())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "f2239a62",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"\"We need to zero out the gradient before backprop at each step because the backward pass accumulates gradients in the grad attribute of each parameter. If we don't reset the grad to zero before each backward pass, the gradients will accumulate and add up, leading to incorrect updates and slower convergence. By resetting the grad to zero before each backward pass, we ensure that the gradients are calculated correctly and that the optimization process works as intended.\""
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Ask a question!\n",
|
||||
"query = \"Why do we need to zero out the gradient before backprop at each step?\"\n",
|
||||
"qa_chain.run(query)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "a8d01098",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'In the context of transformers, an encoder is a component that reads in a sequence of input tokens and generates a sequence of hidden representations. On the other hand, a decoder is a component that takes in a sequence of hidden representations and generates a sequence of output tokens. The main difference between the two is that the encoder is used to encode the input sequence into a fixed-length representation, while the decoder is used to decode the fixed-length representation into an output sequence. In machine translation, for example, the encoder reads in the source language sentence and generates a fixed-length representation, which is then used by the decoder to generate the target language sentence.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"query = \"What is the difference between an encoder and decoder?\"\n",
|
||||
"qa_chain.run(query)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "fe1e77dd",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'For any token, x is the input vector that contains the private information of that token, k and q are the key and query vectors respectively, which are produced by forwarding linear modules on x, and v is the vector that is calculated by propagating the same linear module on x again. The key vector represents what the token contains, and the query vector represents what the token is looking for. The vector v is the information that the token will communicate to other tokens if it finds them interesting, and it gets aggregated for the purposes of the self-attention mechanism.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"query = \"For any token, what are x, k, v, and q?\"\n",
|
||||
"qa_chain.run(query)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.16"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
@@ -5,7 +5,8 @@
|
||||
"id": "683953b3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# MongoDB Atlas Vector Search\n",
|
||||
"#### Commented out until further notice\n",
|
||||
"MongoDB Atlas Vector Search\n",
|
||||
"\n",
|
||||
">[MongoDB Atlas](https://www.mongodb.com/docs/atlas/) is a document database managed in the cloud. It also enables Lucene and its vector search feature.\n",
|
||||
"\n",
|
||||
@@ -43,7 +44,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "320af802-9271-46ee-948f-d2453933d44b",
|
||||
"id": "457ace44-1d95-4001-9dd5-78811ab208ad",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We want to use `OpenAIEmbeddings` so we have to get the OpenAI API Key. Make sure the environment variable `OPENAI_API_KEY` is set up before proceeding."
|
||||
@@ -143,6 +144,47 @@
|
||||
"source": [
|
||||
"print(docs[0].page_content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "851a2ec9-9390-49a4-8412-3e132c9f789d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can reuse vector index you created before, make sure environment variable `OPENAI_API_KEY` is set up, then create another file."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "6336fe79-3e73-48be-b20a-0ff1bb6a4399",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from pymongo import MongoClient\n",
|
||||
"from langchain.vectorstores import MongoDBAtlasVectorSearch\n",
|
||||
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
||||
"import os\n",
|
||||
"\n",
|
||||
"MONGODB_ATLAS_URI = os.environ['MONGODB_ATLAS_URI']\n",
|
||||
"\n",
|
||||
"# initialize MongoDB python client\n",
|
||||
"client = MongoClient(MONGODB_ATLAS_URI)\n",
|
||||
"\n",
|
||||
"db_name = \"langchain_db\"\n",
|
||||
"collection_name = \"langchain_col\"\n",
|
||||
"collection = client[db_name][collection_name]\n",
|
||||
"index_name = \"langchain_index\"\n",
|
||||
"\n",
|
||||
"# initialize vector store\n",
|
||||
"vectorStore = MongoDBAtlasVectorSearch(\n",
|
||||
" collection, OpenAIEmbeddings(), index_name=index_name)\n",
|
||||
"\n",
|
||||
"# perform a similarity search between the embedding of the query and the embeddings of the documents\n",
|
||||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||
"docs = vectorStore.similarity_search(query)\n",
|
||||
"\n",
|
||||
"print(docs[0].page_content)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
||||
@@ -121,7 +121,7 @@
|
||||
"\n",
|
||||
"Human: Hi there my friend\n",
|
||||
"AI: Hi there, how are you doing today?\n",
|
||||
"Human: Not to bad - how are you?\n",
|
||||
"Human: Not too bad - how are you?\n",
|
||||
"Chatbot:\u001b[0m\n",
|
||||
"\n",
|
||||
"\u001b[1m> Finished LLMChain chain.\u001b[0m\n"
|
||||
|
||||
@@ -163,14 +163,14 @@
|
||||
],
|
||||
"source": [
|
||||
"# Otherwise, you can manually specify the Databricks workspace hostname and personal access token \n",
|
||||
"# or set `DATABRICKS_HOST` and `DATABRICKS_API_TOKEN` environment variables, respectively.\n",
|
||||
"# or set `DATABRICKS_HOST` and `DATABRICKS_TOKEN` environment variables, respectively.\n",
|
||||
"# See https://docs.databricks.com/dev-tools/auth.html#databricks-personal-access-tokens\n",
|
||||
"# We strongly recommend not exposing the API token explicitly inside a notebook.\n",
|
||||
"# You can use Databricks secret manager to store your API token securely.\n",
|
||||
"# See https://docs.databricks.com/dev-tools/databricks-utils.html#secrets-utility-dbutilssecrets\n",
|
||||
"\n",
|
||||
"import os\n",
|
||||
"os.environ[\"DATABRICKS_API_TOKEN\"] = dbutils.secrets.get(\"myworkspace\", \"api_token\")\n",
|
||||
"os.environ[\"DATABRICKS_TOKEN\"] = dbutils.secrets.get(\"myworkspace\", \"api_token\")\n",
|
||||
"\n",
|
||||
"llm = Databricks(host=\"myworkspace.cloud.databricks.com\", endpoint_name=\"dolly\")\n",
|
||||
"\n",
|
||||
|
||||
@@ -878,6 +878,16 @@ class AsyncCallbackManager(BaseCallbackManager):
|
||||
T = TypeVar("T", CallbackManager, AsyncCallbackManager)
|
||||
|
||||
|
||||
def env_var_is_set(env_var: str) -> bool:
|
||||
"""Check if an environment variable is set."""
|
||||
return env_var in os.environ and os.environ[env_var] not in (
|
||||
"",
|
||||
"0",
|
||||
"false",
|
||||
"False",
|
||||
)
|
||||
|
||||
|
||||
def _configure(
|
||||
callback_manager_cls: Type[T],
|
||||
inheritable_callbacks: Callbacks = None,
|
||||
@@ -911,18 +921,17 @@ def _configure(
|
||||
wandb_tracer = wandb_tracing_callback_var.get()
|
||||
open_ai = openai_callback_var.get()
|
||||
tracing_enabled_ = (
|
||||
os.environ.get("LANGCHAIN_TRACING") is not None
|
||||
env_var_is_set("LANGCHAIN_TRACING")
|
||||
or tracer is not None
|
||||
or os.environ.get("LANGCHAIN_HANDLER") is not None
|
||||
or env_var_is_set("LANGCHAIN_HANDLER")
|
||||
)
|
||||
wandb_tracing_enabled_ = (
|
||||
os.environ.get("LANGCHAIN_WANDB_TRACING") is not None
|
||||
or wandb_tracer is not None
|
||||
env_var_is_set("LANGCHAIN_WANDB_TRACING") or wandb_tracer is not None
|
||||
)
|
||||
|
||||
tracer_v2 = tracing_v2_callback_var.get()
|
||||
tracing_v2_enabled_ = (
|
||||
os.environ.get("LANGCHAIN_TRACING_V2") is not None or tracer_v2 is not None
|
||||
env_var_is_set("LANGCHAIN_TRACING_V2") or tracer_v2 is not None
|
||||
)
|
||||
tracer_session = os.environ.get("LANGCHAIN_SESSION")
|
||||
debug = _get_debug()
|
||||
|
||||
@@ -8,7 +8,7 @@ from langchain.input import get_bolded_text, get_colored_text
|
||||
|
||||
def try_json_stringify(obj: Any, fallback: str) -> str:
|
||||
try:
|
||||
return json.dumps(obj, indent=2)
|
||||
return json.dumps(obj, indent=2, ensure_ascii=False)
|
||||
except Exception:
|
||||
return fallback
|
||||
|
||||
|
||||
@@ -18,7 +18,7 @@ from langchain.callbacks.manager import (
|
||||
CallbackManagerForChainRun,
|
||||
Callbacks,
|
||||
)
|
||||
from langchain.schema import BaseMemory
|
||||
from langchain.schema import RUN_KEY, BaseMemory, RunInfo
|
||||
|
||||
|
||||
def _get_verbosity() -> bool:
|
||||
@@ -108,6 +108,8 @@ class Chain(BaseModel, ABC):
|
||||
inputs: Union[Dict[str, Any], Any],
|
||||
return_only_outputs: bool = False,
|
||||
callbacks: Callbacks = None,
|
||||
*,
|
||||
include_run_info: bool = False,
|
||||
) -> Dict[str, Any]:
|
||||
"""Run the logic of this chain and add to output if desired.
|
||||
|
||||
@@ -118,7 +120,10 @@ class Chain(BaseModel, ABC):
|
||||
response. If True, only new keys generated by this chain will be
|
||||
returned. If False, both input keys and new keys generated by this
|
||||
chain will be returned. Defaults to False.
|
||||
|
||||
callbacks: Callbacks to use for this chain run. If not provided, will
|
||||
use the callbacks provided to the chain.
|
||||
include_run_info: Whether to include run info in the response. Defaults
|
||||
to False.
|
||||
"""
|
||||
inputs = self.prep_inputs(inputs)
|
||||
callback_manager = CallbackManager.configure(
|
||||
@@ -139,13 +144,20 @@ class Chain(BaseModel, ABC):
|
||||
run_manager.on_chain_error(e)
|
||||
raise e
|
||||
run_manager.on_chain_end(outputs)
|
||||
return self.prep_outputs(inputs, outputs, return_only_outputs)
|
||||
final_outputs: Dict[str, Any] = self.prep_outputs(
|
||||
inputs, outputs, return_only_outputs
|
||||
)
|
||||
if include_run_info:
|
||||
final_outputs[RUN_KEY] = RunInfo(run_id=run_manager.run_id)
|
||||
return final_outputs
|
||||
|
||||
async def acall(
|
||||
self,
|
||||
inputs: Union[Dict[str, Any], Any],
|
||||
return_only_outputs: bool = False,
|
||||
callbacks: Callbacks = None,
|
||||
*,
|
||||
include_run_info: bool = False,
|
||||
) -> Dict[str, Any]:
|
||||
"""Run the logic of this chain and add to output if desired.
|
||||
|
||||
@@ -156,7 +168,10 @@ class Chain(BaseModel, ABC):
|
||||
response. If True, only new keys generated by this chain will be
|
||||
returned. If False, both input keys and new keys generated by this
|
||||
chain will be returned. Defaults to False.
|
||||
|
||||
callbacks: Callbacks to use for this chain run. If not provided, will
|
||||
use the callbacks provided to the chain.
|
||||
include_run_info: Whether to include run info in the response. Defaults
|
||||
to False.
|
||||
"""
|
||||
inputs = self.prep_inputs(inputs)
|
||||
callback_manager = AsyncCallbackManager.configure(
|
||||
@@ -177,7 +192,12 @@ class Chain(BaseModel, ABC):
|
||||
await run_manager.on_chain_error(e)
|
||||
raise e
|
||||
await run_manager.on_chain_end(outputs)
|
||||
return self.prep_outputs(inputs, outputs, return_only_outputs)
|
||||
final_outputs: Dict[str, Any] = self.prep_outputs(
|
||||
inputs, outputs, return_only_outputs
|
||||
)
|
||||
if include_run_info:
|
||||
final_outputs[RUN_KEY] = RunInfo(run_id=run_manager.run_id)
|
||||
return final_outputs
|
||||
|
||||
def prep_outputs(
|
||||
self,
|
||||
|
||||
@@ -53,33 +53,33 @@ class AzureChatOpenAI(ChatOpenAI):
|
||||
@root_validator()
|
||||
def validate_environment(cls, values: Dict) -> Dict:
|
||||
"""Validate that api key and python package exists in environment."""
|
||||
openai_api_key = get_from_dict_or_env(
|
||||
values["openai_api_key"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_api_key",
|
||||
"OPENAI_API_KEY",
|
||||
)
|
||||
openai_api_base = get_from_dict_or_env(
|
||||
values["openai_api_base"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_api_base",
|
||||
"OPENAI_API_BASE",
|
||||
)
|
||||
openai_api_version = get_from_dict_or_env(
|
||||
values["openai_api_version"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_api_version",
|
||||
"OPENAI_API_VERSION",
|
||||
)
|
||||
openai_api_type = get_from_dict_or_env(
|
||||
values["openai_api_type"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_api_type",
|
||||
"OPENAI_API_TYPE",
|
||||
)
|
||||
openai_organization = get_from_dict_or_env(
|
||||
values["openai_organization"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_organization",
|
||||
"OPENAI_ORGANIZATION",
|
||||
default="",
|
||||
)
|
||||
openai_proxy = get_from_dict_or_env(
|
||||
values["openai_proxy"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_proxy",
|
||||
"OPENAI_PROXY",
|
||||
@@ -88,14 +88,6 @@ class AzureChatOpenAI(ChatOpenAI):
|
||||
try:
|
||||
import openai
|
||||
|
||||
openai.api_type = openai_api_type
|
||||
openai.api_base = openai_api_base
|
||||
openai.api_version = openai_api_version
|
||||
openai.api_key = openai_api_key
|
||||
if openai_organization:
|
||||
openai.organization = openai_organization
|
||||
if openai_proxy:
|
||||
openai.proxy = {"http": openai_proxy, "https": openai_proxy} # type: ignore[assignment] # noqa: E501
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"Could not import openai python package. "
|
||||
@@ -128,6 +120,14 @@ class AzureChatOpenAI(ChatOpenAI):
|
||||
"""Get the identifying parameters."""
|
||||
return {**self._default_params}
|
||||
|
||||
@property
|
||||
def _invocation_params(self) -> Mapping[str, Any]:
|
||||
openai_creds = {
|
||||
"api_type": self.openai_api_type,
|
||||
"api_version": self.openai_api_version,
|
||||
}
|
||||
return {**openai_creds, **super()._invocation_params}
|
||||
|
||||
@property
|
||||
def _llm_type(self) -> str:
|
||||
return "azure-openai-chat"
|
||||
|
||||
@@ -25,6 +25,7 @@ from langchain.schema import (
|
||||
HumanMessage,
|
||||
LLMResult,
|
||||
PromptValue,
|
||||
RunInfo,
|
||||
)
|
||||
|
||||
|
||||
@@ -93,6 +94,8 @@ class BaseChatModel(BaseLanguageModel, ABC):
|
||||
generations = [res.generations for res in results]
|
||||
output = LLMResult(generations=generations, llm_output=llm_output)
|
||||
run_manager.on_llm_end(output)
|
||||
if run_manager:
|
||||
output.run = RunInfo(run_id=run_manager.run_id)
|
||||
return output
|
||||
|
||||
async def agenerate(
|
||||
@@ -131,6 +134,8 @@ class BaseChatModel(BaseLanguageModel, ABC):
|
||||
generations = [res.generations for res in results]
|
||||
output = LLMResult(generations=generations, llm_output=llm_output)
|
||||
await run_manager.on_llm_end(output)
|
||||
if run_manager:
|
||||
output.run = RunInfo(run_id=run_manager.run_id)
|
||||
return output
|
||||
|
||||
def generate_prompt(
|
||||
|
||||
@@ -196,22 +196,22 @@ class ChatOpenAI(BaseChatModel):
|
||||
@root_validator()
|
||||
def validate_environment(cls, values: Dict) -> Dict:
|
||||
"""Validate that api key and python package exists in environment."""
|
||||
openai_api_key = get_from_dict_or_env(
|
||||
values["openai_api_key"] = get_from_dict_or_env(
|
||||
values, "openai_api_key", "OPENAI_API_KEY"
|
||||
)
|
||||
openai_organization = get_from_dict_or_env(
|
||||
values["openai_organization"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_organization",
|
||||
"OPENAI_ORGANIZATION",
|
||||
default="",
|
||||
)
|
||||
openai_api_base = get_from_dict_or_env(
|
||||
values["openai_api_base"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_api_base",
|
||||
"OPENAI_API_BASE",
|
||||
default="",
|
||||
)
|
||||
openai_proxy = get_from_dict_or_env(
|
||||
values["openai_proxy"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_proxy",
|
||||
"OPENAI_PROXY",
|
||||
@@ -225,13 +225,6 @@ class ChatOpenAI(BaseChatModel):
|
||||
"Could not import openai python package. "
|
||||
"Please install it with `pip install openai`."
|
||||
)
|
||||
openai.api_key = openai_api_key
|
||||
if openai_organization:
|
||||
openai.organization = openai_organization
|
||||
if openai_api_base:
|
||||
openai.api_base = openai_api_base
|
||||
if openai_proxy:
|
||||
openai.proxy = {"http": openai_proxy, "https": openai_proxy} # type: ignore[assignment] # noqa: E501
|
||||
try:
|
||||
values["client"] = openai.ChatCompletion
|
||||
except AttributeError:
|
||||
@@ -333,7 +326,7 @@ class ChatOpenAI(BaseChatModel):
|
||||
def _create_message_dicts(
|
||||
self, messages: List[BaseMessage], stop: Optional[List[str]]
|
||||
) -> Tuple[List[Dict[str, Any]], Dict[str, Any]]:
|
||||
params: Dict[str, Any] = {**{"model": self.model_name}, **self._default_params}
|
||||
params = dict(self._invocation_params)
|
||||
if stop is not None:
|
||||
if "stop" in params:
|
||||
raise ValueError("`stop` found in both the input and default params.")
|
||||
@@ -384,6 +377,21 @@ class ChatOpenAI(BaseChatModel):
|
||||
"""Get the identifying parameters."""
|
||||
return {**{"model_name": self.model_name}, **self._default_params}
|
||||
|
||||
@property
|
||||
def _invocation_params(self) -> Mapping[str, Any]:
|
||||
"""Get the parameters used to invoke the model."""
|
||||
openai_creds: Dict[str, Any] = {
|
||||
"api_key": self.openai_api_key,
|
||||
"api_base": self.openai_api_base,
|
||||
"organization": self.openai_organization,
|
||||
"model": self.model_name,
|
||||
}
|
||||
if self.openai_proxy:
|
||||
openai_creds["proxy"] = (
|
||||
{"http": self.openai_proxy, "https": self.openai_proxy},
|
||||
)
|
||||
return {**openai_creds, **self._default_params}
|
||||
|
||||
@property
|
||||
def _llm_type(self) -> str:
|
||||
"""Return type of chat model."""
|
||||
|
||||
@@ -1,213 +0,0 @@
|
||||
"""Implement artifact storage using the file system.
|
||||
|
||||
This is a simple implementation that stores artifacts in a directory and
|
||||
metadata in a JSON file. It's used for prototyping.
|
||||
|
||||
Metadata should move into a SQLLite.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import abc
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import (
|
||||
TypedDict,
|
||||
Sequence,
|
||||
Optional,
|
||||
Iterator,
|
||||
Union,
|
||||
List,
|
||||
Iterable,
|
||||
)
|
||||
|
||||
from langchain.docstore.base import ArtifactStore, Selector, Artifact, ArtifactWithData
|
||||
from langchain.docstore.serialization import serialize_document, deserialize_document
|
||||
from langchain.embeddings.base import Embeddings
|
||||
from langchain.schema import Document
|
||||
|
||||
MaybeDocument = Optional[Document]
|
||||
|
||||
PathLike = Union[str, Path]
|
||||
|
||||
|
||||
class Metadata(TypedDict):
|
||||
"""Metadata format"""
|
||||
|
||||
artifacts: List[Artifact]
|
||||
|
||||
|
||||
class MetadataStore(abc.ABC):
|
||||
"""Abstract metadata store.
|
||||
|
||||
Need to populate with all required methods.
|
||||
"""
|
||||
|
||||
@abc.abstractmethod
|
||||
def upsert(self, artifact: Artifact):
|
||||
"""Add the given artifact to the store."""
|
||||
|
||||
@abc.abstractmethod
|
||||
def select(self, selector: Selector) -> Iterable[str]:
|
||||
"""Select the artifacts matching the given selector."""
|
||||
raise NotImplementedError
|
||||
|
||||
|
||||
class CacheBackedEmbedder:
|
||||
"""Interface for embedding models."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
artifact_store: ArtifactStore,
|
||||
underlying_embedder: Embeddings,
|
||||
) -> None:
|
||||
"""Initialize the embedder."""
|
||||
self.artifact_store = artifact_store
|
||||
self.underlying_embedder = underlying_embedder
|
||||
|
||||
def embed_documents(self, texts: List[str]) -> List[List[float]]:
|
||||
"""Embed search docs."""
|
||||
raise NotImplementedError()
|
||||
|
||||
def embed_query(self, text: str) -> List[float]:
|
||||
"""Embed query text."""
|
||||
raise NotImplementedError()
|
||||
|
||||
|
||||
class InMemoryStore(MetadataStore):
|
||||
"""In-memory metadata store backed by a file.
|
||||
|
||||
In its current form, this store will be really slow for large collections of files.
|
||||
"""
|
||||
|
||||
def __init__(self, data: Metadata) -> None:
|
||||
"""Initialize the in-memory store."""
|
||||
super().__init__()
|
||||
self.data = data
|
||||
self.artifacts = data["artifacts"]
|
||||
# indexes for speed
|
||||
self.artifact_uids = {artifact["uid"]: artifact for artifact in self.artifacts}
|
||||
|
||||
def exists_by_uids(self, uids: Sequence[str]) -> List[bool]:
|
||||
"""Order preserving check if the artifact with the given id exists."""
|
||||
return [bool(uid in self.artifact_uids) for uid in uids]
|
||||
|
||||
def get_by_uids(self, uids: Sequence[str]) -> List[Artifact]:
|
||||
"""Return the documents with the given uuids."""
|
||||
return [self.artifact_uids[uid] for uid in uids]
|
||||
|
||||
def select(self, selector: Selector) -> Iterable[str]:
|
||||
"""Return the hashes the artifacts matching the given selector."""
|
||||
# Inefficient implementation that loops through all artifacts.
|
||||
# Optimize later.
|
||||
for artifact in self.data["artifacts"]:
|
||||
uid = artifact["uid"]
|
||||
# Implement conjunctive normal form
|
||||
if selector.uids and artifact["uid"] in selector.uids:
|
||||
yield uid
|
||||
continue
|
||||
|
||||
if selector.parent_uids and set(artifact["parent_uids"]).intersection(
|
||||
selector.parent_uids
|
||||
):
|
||||
yield uid
|
||||
continue
|
||||
|
||||
def save(self, path: PathLike) -> None:
|
||||
"""Save the metadata to the given path."""
|
||||
with open(path, "w") as f:
|
||||
json.dump(self.data, f)
|
||||
|
||||
def upsert(self, artifact: Artifact) -> None:
|
||||
"""Add the given artifact to the store."""
|
||||
uid = artifact["uid"]
|
||||
if uid not in self.artifact_uids:
|
||||
self.data["artifacts"].append(artifact)
|
||||
self.artifact_uids[artifact["uid"]] = artifact
|
||||
|
||||
def remove(self, selector: Selector) -> None:
|
||||
"""Remove the given artifacts from the store."""
|
||||
uids = list(self.select(selector))
|
||||
self.remove_by_uuids(uids)
|
||||
|
||||
def remove_by_uuids(self, uids: Sequence[str]) -> None:
|
||||
"""Remove the given artifacts from the store."""
|
||||
for uid in uids:
|
||||
del self.artifact_uids[uid]
|
||||
raise NotImplementedError(f"Need to delete artifacts as well")
|
||||
|
||||
@classmethod
|
||||
def from_file(cls, path: PathLike) -> InMemoryStore:
|
||||
"""Load store metadata from the given path."""
|
||||
with open(path, "r") as f:
|
||||
content = json.load(f)
|
||||
return cls(content)
|
||||
|
||||
|
||||
class FileSystemArtifactLayer(ArtifactStore):
|
||||
"""An artifact layer for storing artifacts on the file system."""
|
||||
|
||||
def __init__(self, root: PathLike) -> None:
|
||||
"""Initialize the file system artifact layer."""
|
||||
_root = root if isinstance(root, Path) else Path(root)
|
||||
self.root = _root
|
||||
# Metadata file will be kept in memory for now and updated with
|
||||
# each call.
|
||||
# This is error-prone due to race conditions (if multiple
|
||||
# processes are writing), but OK for prototyping / simple use cases.
|
||||
metadata_path = _root / "metadata.json"
|
||||
self.metadata_path = metadata_path
|
||||
|
||||
if metadata_path.exists():
|
||||
self.metadata_store = InMemoryStore.from_file(self.metadata_path)
|
||||
else:
|
||||
self.metadata_store = InMemoryStore({"artifacts": []})
|
||||
|
||||
def exists_by_uid(self, uuids: Sequence[str]) -> List[bool]:
|
||||
"""Check if the artifacts with the given uuid exist."""
|
||||
return self.metadata_store.exists_by_uids(uuids)
|
||||
|
||||
def _get_file_path(self, uid: str) -> Path:
|
||||
"""Get path to file for the given uuid."""
|
||||
return self.root / f"{uid}"
|
||||
|
||||
def upsert(
|
||||
self,
|
||||
artifacts_with_data: Sequence[ArtifactWithData],
|
||||
) -> None:
|
||||
"""Add the given artifacts."""
|
||||
# Write the documents to the file system
|
||||
for artifact_with_data in artifacts_with_data:
|
||||
# Use the document hash to write the contents to the file system
|
||||
document = artifact_with_data["document"]
|
||||
file_path = self.root / f"{document.hash_}"
|
||||
with open(file_path, "w") as f:
|
||||
f.write(serialize_document(document))
|
||||
|
||||
artifact = artifact_with_data["artifact"].copy()
|
||||
# Storing at a file -- can clean up the artifact with data request
|
||||
# later
|
||||
artifact["location"] = str(file_path)
|
||||
self.metadata_store.upsert(artifact)
|
||||
|
||||
self.metadata_store.save(self.metadata_path)
|
||||
|
||||
def list_document_ids(self, selector: Selector) -> Iterator[str]:
|
||||
"""List the document ids matching the given selector."""
|
||||
yield from self.metadata_store.select(selector)
|
||||
|
||||
def list_documents(self, selector: Selector) -> Iterator[Document]:
|
||||
"""Can even use JQ here!"""
|
||||
uuids = self.metadata_store.select(selector)
|
||||
|
||||
for uuid in uuids:
|
||||
artifact = self.metadata_store.get_by_uids([uuid])[0]
|
||||
path = artifact["location"]
|
||||
with open(path, "r") as f:
|
||||
page_content = deserialize_document(f.read()).page_content
|
||||
yield Document(
|
||||
uid=artifact["uid"],
|
||||
parent_uids=artifact["parent_uids"],
|
||||
metadata=artifact["metadata"],
|
||||
tags=artifact["tags"],
|
||||
page_content=page_content,
|
||||
)
|
||||
@@ -1,21 +1,8 @@
|
||||
"""Interface to access to place that stores documents."""
|
||||
import abc
|
||||
import dataclasses
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import (
|
||||
Dict,
|
||||
Sequence,
|
||||
Iterator,
|
||||
Optional,
|
||||
List,
|
||||
Literal,
|
||||
TypedDict,
|
||||
Tuple,
|
||||
Union,
|
||||
Any,
|
||||
)
|
||||
from typing import Dict, Union
|
||||
|
||||
from langchain.schema import Document
|
||||
from langchain.docstore.document import Document
|
||||
|
||||
|
||||
class Docstore(ABC):
|
||||
@@ -36,126 +23,3 @@ class AddableMixin(ABC):
|
||||
@abstractmethod
|
||||
def add(self, texts: Dict[str, Document]) -> None:
|
||||
"""Add more documents."""
|
||||
|
||||
|
||||
@dataclasses.dataclass(frozen=True)
|
||||
class Selector:
|
||||
"""Selection criteria represented in conjunctive normal form.
|
||||
|
||||
https://en.wikipedia.org/wiki/Conjunctive_normal_form
|
||||
|
||||
At the moment, the explicit representation is used for simplicity / prototyping.
|
||||
|
||||
It may be replaced by an ability of specifying selection with jq
|
||||
if operating on JSON metadata or else something free form like SQL.
|
||||
"""
|
||||
|
||||
parent_uids: Optional[Sequence[str]] = None
|
||||
uids: Optional[Sequence[str]] = None
|
||||
# Pick up all artifacts with the given tags.
|
||||
# Maybe we should call this transformations.
|
||||
tags: Optional[Sequence[str]] = None # <-- WE DONT WANT TO DO IT THIS WAY
|
||||
transformation_path: Sequence[str] = None
|
||||
"""Use to specify a transformation path according to which we select documents"""
|
||||
|
||||
|
||||
# KNOWN WAYS THIS CAN FAIL:
|
||||
# 1) If the process crashes while text splitting, creating only some of the artifacts
|
||||
# ... new pipeline will not re-create the missing artifacts! (at least for now)
|
||||
# it will use the ones that exist and assume that all of them have been created
|
||||
|
||||
|
||||
# TODO: MAJOR MAJOR MAJOR MAJOR
|
||||
# 1. FIX SEMANTICS WITH REGARDS TO ID, UUID. AND POTENTIALLY ARTIFACT_ID
|
||||
# NEED TO REASON THROUGH USE CASES CAREFULLY TO REASON ABOUT WHATS MINIMAL SUFFICIENT
|
||||
# 2. Using hashes throughout for implementation simplicity, but may want to switch
|
||||
# to ids assigned by the a database? probability of collision is really small
|
||||
class Artifact(TypedDict):
|
||||
|
||||
"""A representation of an artifact."""
|
||||
|
||||
uid: str # This has to be handled carefully -- we'll eventually get collisions
|
||||
"""A unique identifier for the artifact."""
|
||||
type_: Union[Literal["document"], Literal["embedding"], Literal["blob"]]
|
||||
"""A unique identifier for the artifact."""
|
||||
data_hash: str
|
||||
"""A hash of the data of the artifact."""
|
||||
metadata_hash: str
|
||||
"""A hash of the metadata of the artifact."""
|
||||
parent_uids: Tuple[str, ...]
|
||||
"""A tuple of uids representing the parent artifacts."""
|
||||
parent_hashes: Tuple[str, ...]
|
||||
"""A tuple of hashes representing the parent artifacts at time of transformation."""
|
||||
transformation_hash: str
|
||||
"""A hash of the transformation that was applied to generate artifact.
|
||||
|
||||
This parameterizes the transformation logic together with any transformation
|
||||
parameters.
|
||||
"""
|
||||
created_at: str # ISO-8601
|
||||
"""The time the artifact was created."""
|
||||
updated_at: str # ISO-8601
|
||||
"""The time the artifact was last updated."""
|
||||
metadata: Any
|
||||
"""A dictionary representing the metadata of the artifact."""
|
||||
tags: Tuple[str, ...]
|
||||
"""A tuple of tags associated with the artifact.
|
||||
|
||||
Can use tags to add information about the transformation that was applied
|
||||
to the given artifact.
|
||||
|
||||
THIS IS NOT A GOOD REPRESENTATION.
|
||||
"""
|
||||
"""The type of the artifact.""" # THIS MAY NEED TO BE CHANGED
|
||||
data: Optional[bytes]
|
||||
"""The data of the artifact when the artifact contains the data by value.
|
||||
|
||||
Will likely change somehow.
|
||||
|
||||
* For first pass contains embedding data.
|
||||
* document data and blob data stored externally.
|
||||
"""
|
||||
location: Optional[str]
|
||||
# Location specifies the location of the artifact when
|
||||
# the artifact contains the data by reference (use for documents / blobs)
|
||||
|
||||
|
||||
class ArtifactWithData(TypedDict):
|
||||
"""A document with the transformation that generated it."""
|
||||
|
||||
artifact: Artifact
|
||||
document: Document
|
||||
|
||||
|
||||
class ArtifactStore(abc.ABC):
|
||||
"""Use to keep track of artifacts generated while processing content.
|
||||
|
||||
The first version of the artifact store is used to work with Documents
|
||||
rather than Blobs.
|
||||
|
||||
We will likely want to evolve this into Blobs, but faster to prototype
|
||||
with Documents.
|
||||
"""
|
||||
|
||||
def exists_by_uid(self, uids: Sequence[str]) -> List[bool]:
|
||||
"""Check if the artifacts with the given uuid exist."""
|
||||
raise NotImplementedError()
|
||||
|
||||
def exists_by_parent_uids(self, uids: Sequence[str]) -> List[bool]:
|
||||
"""Check if the artifacts with the given id exist."""
|
||||
raise NotImplementedError()
|
||||
|
||||
def upsert(
|
||||
self,
|
||||
artifacts_with_data: Sequence[ArtifactWithData],
|
||||
) -> None:
|
||||
"""Upsert the given artifacts."""
|
||||
raise NotImplementedError()
|
||||
|
||||
def list_documents(self, selector: Selector) -> Iterator[Document]:
|
||||
"""Yield documents matching the given selector."""
|
||||
raise NotImplementedError()
|
||||
|
||||
def list_document_ids(self, selector: Selector) -> Iterator[str]:
|
||||
"""Yield document ids matching the given selector."""
|
||||
raise NotImplementedError()
|
||||
|
||||
@@ -1,133 +0,0 @@
|
||||
"""Module implements a pipeline.
|
||||
|
||||
There might be a better name for this.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import datetime
|
||||
from typing import Sequence, Optional, Iterator, Iterable, List
|
||||
|
||||
from langchain.docstore.base import ArtifactWithData, ArtifactStore, Selector
|
||||
from langchain.document_loaders.base import BaseLoader
|
||||
from langchain.schema import Document, BaseDocumentTransformer
|
||||
from langchain.text_splitter import TextSplitter
|
||||
|
||||
|
||||
def _convert_document_to_artifact_upsert(
|
||||
document: Document, parent_documents: Sequence[Document], transformation_hash: str
|
||||
) -> ArtifactWithData:
|
||||
"""Convert the given documents to artifacts for upserting."""
|
||||
dt = datetime.datetime.now().isoformat()
|
||||
parent_uids = [str(parent_doc.uid) for parent_doc in parent_documents]
|
||||
parent_hashes = [str(parent_doc.hash_) for parent_doc in parent_documents]
|
||||
|
||||
return {
|
||||
"artifact": {
|
||||
"uid": str(document.uid),
|
||||
"parent_uids": parent_uids,
|
||||
"metadata": document.metadata,
|
||||
"parent_hashes": parent_hashes,
|
||||
"tags": tuple(),
|
||||
"type_": "document",
|
||||
"data": None,
|
||||
"location": None,
|
||||
"data_hash": str(document.hash_),
|
||||
"metadata_hash": "N/A",
|
||||
"created_at": dt,
|
||||
"updated_at": dt,
|
||||
"transformation_hash": transformation_hash,
|
||||
},
|
||||
"document": document,
|
||||
}
|
||||
|
||||
|
||||
class Pipeline(BaseLoader): # MAY NOT WANT TO INHERIT FROM LOADER
|
||||
def __init__(
|
||||
self,
|
||||
loader: BaseLoader,
|
||||
*,
|
||||
transformers: Optional[Sequence[BaseDocumentTransformer]] = None,
|
||||
artifact_store: Optional[ArtifactStore] = None,
|
||||
) -> None:
|
||||
"""Initialize the document pipeline.
|
||||
|
||||
Args:
|
||||
loader: The loader to use for loading the documents.
|
||||
transformers: The transformers to use for transforming the documents.
|
||||
artifact_store: The artifact store to use for storing the artifacts.
|
||||
"""
|
||||
self.loader = loader
|
||||
self.transformers = transformers
|
||||
self.artifact_store = artifact_store
|
||||
|
||||
def lazy_load(
|
||||
self,
|
||||
) -> Iterator[Document]:
|
||||
"""Lazy load the documents."""
|
||||
transformations = self.transformers or []
|
||||
# Need syntax for determining whether this should be cached.
|
||||
|
||||
try:
|
||||
doc_iterator = self.loader.lazy_load()
|
||||
except NotImplementedError:
|
||||
doc_iterator = self.loader.load()
|
||||
|
||||
for document in doc_iterator:
|
||||
new_documents = [document]
|
||||
for transformation in transformations:
|
||||
# Batched for now here -- lots of optimization possible
|
||||
# but not needed for now and is likely going to get complex
|
||||
new_documents = list(
|
||||
self._propagate_documents(new_documents, transformation)
|
||||
)
|
||||
|
||||
yield from new_documents
|
||||
|
||||
def _propagate_documents(
|
||||
self, documents: Sequence[Document], transformation: BaseDocumentTransformer
|
||||
) -> Iterable[Document]:
|
||||
"""Transform the given documents using the transformation with caching."""
|
||||
docs_exist = self.artifact_store.exists_by_uid(
|
||||
[document.uid for document in documents]
|
||||
)
|
||||
|
||||
for document, exists in zip(documents, docs_exist):
|
||||
if exists:
|
||||
existing_docs = self.artifact_store.list_documents(
|
||||
Selector(parent_uids=[document.uid])
|
||||
)
|
||||
|
||||
materialized_docs = list(existing_docs)
|
||||
|
||||
if materialized_docs:
|
||||
yield from materialized_docs
|
||||
continue
|
||||
|
||||
transformed_docs = transformation.transform_documents([document])
|
||||
|
||||
# MAJOR: Hash should encapsulate transformation parameters
|
||||
transformation_hash = transformation.__class__.__name__
|
||||
|
||||
artifacts_with_data = [
|
||||
_convert_document_to_artifact_upsert(
|
||||
transformed_doc, [document], transformation_hash
|
||||
)
|
||||
for transformed_doc in transformed_docs
|
||||
]
|
||||
|
||||
self.artifact_store.upsert(artifacts_with_data)
|
||||
yield from transformed_docs
|
||||
|
||||
def load(self) -> List[Document]:
|
||||
"""Load the documents."""
|
||||
return list(self.lazy_load())
|
||||
|
||||
def run(self) -> None: # BAD API NEED
|
||||
"""Execute the pipeline, returning nothing."""
|
||||
for _ in self.lazy_load():
|
||||
pass
|
||||
|
||||
def load_and_split(
|
||||
self, text_splitter: Optional[TextSplitter] = None
|
||||
) -> List[Document]:
|
||||
raise NotImplementedError("This method will never be implemented.")
|
||||
@@ -1,48 +0,0 @@
|
||||
"""Module for serialization code.
|
||||
|
||||
This code will likely be replaced by Nuno's serialization method.
|
||||
"""
|
||||
import json
|
||||
from json import JSONEncoder, JSONDecodeError
|
||||
from uuid import UUID
|
||||
|
||||
from langchain.schema import Document
|
||||
|
||||
|
||||
class UUIDEncoder(JSONEncoder):
|
||||
"""Will either be replaced by Nuno's serialization method or something else.
|
||||
|
||||
Potentially there will be no serialization for a document object since
|
||||
the document can be broken into 2 pieces:
|
||||
|
||||
* the content -> saved on disk or in database
|
||||
* the metadata -> saved in metadata store
|
||||
|
||||
It may not make sense to keep the metadata together with the document
|
||||
for the persistence.
|
||||
"""
|
||||
|
||||
def default(self, obj):
|
||||
if isinstance(obj, UUID):
|
||||
return str(obj) # Convert UUID to string
|
||||
return super().default(obj)
|
||||
|
||||
|
||||
# PUBLIC API
|
||||
|
||||
|
||||
def serialize_document(document: Document) -> str:
|
||||
"""Serialize the given document to a string."""
|
||||
try:
|
||||
# Serialize only the content.
|
||||
# Metadata always stored separately.
|
||||
return json.dumps(document.page_content)
|
||||
except JSONDecodeError:
|
||||
raise ValueError(f"Could not serialize document with ID: {document.uid}")
|
||||
|
||||
|
||||
def deserialize_document(serialized_document: str) -> Document:
|
||||
"""Deserialize the given document from a string."""
|
||||
return Document(
|
||||
page_content=json.loads(serialized_document),
|
||||
)
|
||||
@@ -1,69 +0,0 @@
|
||||
"""Module contains doc for syncing from docstore to vectorstores."""
|
||||
from __future__ import annotations
|
||||
|
||||
from itertools import islice
|
||||
from typing import TypedDict, Sequence, Optional, TypeVar, Iterable, Iterator, List
|
||||
from langchain.docstore.base import ArtifactStore, Selector
|
||||
from langchain.vectorstores import VectorStore
|
||||
|
||||
|
||||
class SyncResult(TypedDict):
|
||||
"""Syncing result."""
|
||||
|
||||
first_n_errors: Sequence[str]
|
||||
"""First n errors that occurred during syncing."""
|
||||
num_added: Optional[int]
|
||||
"""Number of added documents."""
|
||||
num_updated: Optional[int]
|
||||
"""Number of updated documents because they were not up to date."""
|
||||
num_deleted: Optional[int]
|
||||
"""Number of deleted documents."""
|
||||
num_skipped: Optional[int]
|
||||
"""Number of skipped documents because they were already up to date."""
|
||||
|
||||
|
||||
T = TypeVar("T")
|
||||
|
||||
|
||||
def _batch(size: int, iterable: Iterable[T]) -> Iterator[List[T]]:
|
||||
"""Utility batching function."""
|
||||
it = iter(iterable)
|
||||
while True:
|
||||
chunk = list(islice(it, size))
|
||||
if not chunk:
|
||||
return
|
||||
yield chunk
|
||||
|
||||
|
||||
# SYNC IMPLEMENTATION
|
||||
|
||||
|
||||
def sync(
|
||||
artifact_store: ArtifactStore,
|
||||
vector_store: VectorStore,
|
||||
selector: Selector,
|
||||
*,
|
||||
batch_size: int = 1000,
|
||||
) -> SyncResult:
|
||||
"""Sync the given artifact layer with the given vector store."""
|
||||
document_uids = artifact_store.list_document_ids(selector)
|
||||
|
||||
all_uids = []
|
||||
# IDs must fit into memory for this to work.
|
||||
for uid_batch in _batch(batch_size, document_uids):
|
||||
all_uids.extend(uid_batch)
|
||||
document_batch = list(artifact_store.list_documents(Selector(uids=uid_batch)))
|
||||
upsert_info = vector_store.upsert_by_id(
|
||||
documents=document_batch, batch_size=batch_size
|
||||
)
|
||||
# Non-intuitive interface, but simple to implement
|
||||
# (maybe we can have a better solution though)
|
||||
num_deleted = vector_store.delete_non_matching_ids(all_uids)
|
||||
|
||||
return {
|
||||
"first_n_errors": [],
|
||||
"num_added": None,
|
||||
"num_updated": None,
|
||||
"num_skipped": None,
|
||||
"num_deleted": None,
|
||||
}
|
||||
@@ -1,4 +1,5 @@
|
||||
from langchain.document_loaders.blob_loaders.file_system import FileSystemBlobLoader
|
||||
from langchain.document_loaders.blob_loaders.schema import Blob, BlobLoader
|
||||
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
|
||||
|
||||
__all__ = ["BlobLoader", "Blob", "FileSystemBlobLoader"]
|
||||
__all__ = ["BlobLoader", "Blob", "FileSystemBlobLoader", "YoutubeAudioLoader"]
|
||||
|
||||
50
langchain/document_loaders/blob_loaders/youtube_audio.py
Normal file
50
langchain/document_loaders/blob_loaders/youtube_audio.py
Normal file
@@ -0,0 +1,50 @@
|
||||
from typing import Iterable, List
|
||||
|
||||
from langchain.document_loaders.blob_loaders import FileSystemBlobLoader
|
||||
from langchain.document_loaders.blob_loaders.schema import Blob, BlobLoader
|
||||
|
||||
|
||||
class YoutubeAudioLoader(BlobLoader):
|
||||
|
||||
"""Load YouTube urls as audio file(s)."""
|
||||
|
||||
def __init__(self, urls: List[str], save_dir: str):
|
||||
if not isinstance(urls, list):
|
||||
raise TypeError("urls must be a list")
|
||||
|
||||
self.urls = urls
|
||||
self.save_dir = save_dir
|
||||
|
||||
def yield_blobs(self) -> Iterable[Blob]:
|
||||
"""Yield audio blobs for each url."""
|
||||
|
||||
try:
|
||||
import yt_dlp
|
||||
except ImportError:
|
||||
raise ValueError(
|
||||
"yt_dlp package not found, please install it with "
|
||||
"`pip install yt_dlp`"
|
||||
)
|
||||
|
||||
# Use yt_dlp to download audio given a YouTube url
|
||||
ydl_opts = {
|
||||
"format": "m4a/bestaudio/best",
|
||||
"noplaylist": True,
|
||||
"outtmpl": self.save_dir + "/%(title)s.%(ext)s",
|
||||
"postprocessors": [
|
||||
{
|
||||
"key": "FFmpegExtractAudio",
|
||||
"preferredcodec": "m4a",
|
||||
}
|
||||
],
|
||||
}
|
||||
|
||||
for url in self.urls:
|
||||
# Download file
|
||||
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
|
||||
ydl.download(url)
|
||||
|
||||
# Yield the written blobs
|
||||
loader = FileSystemBlobLoader(self.save_dir, glob="*.m4a")
|
||||
for blob in loader.yield_blobs():
|
||||
yield blob
|
||||
@@ -12,10 +12,45 @@ class OpenAIWhisperParser(BaseBlobParser):
|
||||
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
|
||||
"""Lazily parse the blob."""
|
||||
|
||||
import openai
|
||||
import io
|
||||
|
||||
with blob.as_bytes_io() as f:
|
||||
transcript = openai.Audio.transcribe("whisper-1", f)
|
||||
yield Document(
|
||||
page_content=transcript.text, metadata={"source": blob.source}
|
||||
try:
|
||||
import openai
|
||||
except ImportError:
|
||||
raise ValueError(
|
||||
"openai package not found, please install it with "
|
||||
"`pip install openai`"
|
||||
)
|
||||
try:
|
||||
from pydub import AudioSegment
|
||||
except ImportError:
|
||||
raise ValueError(
|
||||
"pydub package not found, please install it with " "`pip install pydub`"
|
||||
)
|
||||
|
||||
# Audio file from disk
|
||||
audio = AudioSegment.from_file(blob.path)
|
||||
|
||||
# Define the duration of each chunk in minutes
|
||||
# Need to meet 25MB size limit for Whisper API
|
||||
chunk_duration = 20
|
||||
chunk_duration_ms = chunk_duration * 60 * 1000
|
||||
|
||||
# Split the audio into chunk_duration_ms chunks
|
||||
for split_number, i in enumerate(range(0, len(audio), chunk_duration_ms)):
|
||||
# Audio chunk
|
||||
chunk = audio[i : i + chunk_duration_ms]
|
||||
file_obj = io.BytesIO(chunk.export(format="mp3").read())
|
||||
if blob.source is not None:
|
||||
file_obj.name = blob.source + f"_part_{split_number}.mp3"
|
||||
else:
|
||||
file_obj.name = f"part_{split_number}.mp3"
|
||||
|
||||
# Transcribe
|
||||
print(f"Transcribing part {split_number+1}!")
|
||||
transcript = openai.Audio.transcribe("whisper-1", file_obj)
|
||||
|
||||
yield Document(
|
||||
page_content=transcript.text,
|
||||
metadata={"source": blob.source, "chunk": split_number},
|
||||
)
|
||||
|
||||
@@ -97,8 +97,8 @@ class OpenAIEmbeddings(BaseModel, Embeddings):
|
||||
embeddings = OpenAIEmbeddings(
|
||||
deployment="your-embeddings-deployment-name",
|
||||
model="your-embeddings-model-name",
|
||||
api_base="https://your-endpoint.openai.azure.com/",
|
||||
api_type="azure",
|
||||
openai_api_base="https://your-endpoint.openai.azure.com/",
|
||||
openai_api_type="azure",
|
||||
)
|
||||
text = "This is a test query."
|
||||
query_result = embeddings.embed_query(text)
|
||||
@@ -136,38 +136,38 @@ class OpenAIEmbeddings(BaseModel, Embeddings):
|
||||
@root_validator()
|
||||
def validate_environment(cls, values: Dict) -> Dict:
|
||||
"""Validate that api key and python package exists in environment."""
|
||||
openai_api_key = get_from_dict_or_env(
|
||||
values["openai_api_key"] = get_from_dict_or_env(
|
||||
values, "openai_api_key", "OPENAI_API_KEY"
|
||||
)
|
||||
openai_api_base = get_from_dict_or_env(
|
||||
values["openai_api_base"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_api_base",
|
||||
"OPENAI_API_BASE",
|
||||
default="",
|
||||
)
|
||||
openai_api_type = get_from_dict_or_env(
|
||||
values["openai_api_type"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_api_type",
|
||||
"OPENAI_API_TYPE",
|
||||
default="",
|
||||
)
|
||||
openai_proxy = get_from_dict_or_env(
|
||||
values["openai_proxy"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_proxy",
|
||||
"OPENAI_PROXY",
|
||||
default="",
|
||||
)
|
||||
if openai_api_type in ("azure", "azure_ad", "azuread"):
|
||||
if values["openai_api_type"] in ("azure", "azure_ad", "azuread"):
|
||||
default_api_version = "2022-12-01"
|
||||
else:
|
||||
default_api_version = ""
|
||||
openai_api_version = get_from_dict_or_env(
|
||||
values["openai_api_version"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_api_version",
|
||||
"OPENAI_API_VERSION",
|
||||
default=default_api_version,
|
||||
)
|
||||
openai_organization = get_from_dict_or_env(
|
||||
values["openai_organization"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_organization",
|
||||
"OPENAI_ORGANIZATION",
|
||||
@@ -176,17 +176,6 @@ class OpenAIEmbeddings(BaseModel, Embeddings):
|
||||
try:
|
||||
import openai
|
||||
|
||||
openai.api_key = openai_api_key
|
||||
if openai_organization:
|
||||
openai.organization = openai_organization
|
||||
if openai_api_base:
|
||||
openai.api_base = openai_api_base
|
||||
if openai_api_type:
|
||||
openai.api_version = openai_api_version
|
||||
if openai_api_type:
|
||||
openai.api_type = openai_api_type
|
||||
if openai_proxy:
|
||||
openai.proxy = {"http": openai_proxy, "https": openai_proxy} # type: ignore[assignment] # noqa: E501
|
||||
values["client"] = openai.Embedding
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
@@ -195,6 +184,25 @@ class OpenAIEmbeddings(BaseModel, Embeddings):
|
||||
)
|
||||
return values
|
||||
|
||||
@property
|
||||
def _invocation_params(self) -> Dict:
|
||||
openai_args = {
|
||||
"engine": self.deployment,
|
||||
"request_timeout": self.request_timeout,
|
||||
"headers": self.headers,
|
||||
"api_key": self.openai_api_key,
|
||||
"organization": self.openai_organization,
|
||||
"api_base": self.openai_api_base,
|
||||
"api_type": self.openai_api_type,
|
||||
"api_version": self.openai_api_version,
|
||||
}
|
||||
if self.openai_proxy:
|
||||
openai_args["proxy"] = {
|
||||
"http": self.openai_proxy,
|
||||
"https": self.openai_proxy,
|
||||
}
|
||||
return openai_args
|
||||
|
||||
# please refer to
|
||||
# https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
|
||||
def _get_len_safe_embeddings(
|
||||
@@ -233,9 +241,7 @@ class OpenAIEmbeddings(BaseModel, Embeddings):
|
||||
response = embed_with_retry(
|
||||
self,
|
||||
input=tokens[i : i + _chunk_size],
|
||||
engine=self.deployment,
|
||||
request_timeout=self.request_timeout,
|
||||
headers=self.headers,
|
||||
**self._invocation_params,
|
||||
)
|
||||
batched_embeddings += [r["embedding"] for r in response["data"]]
|
||||
|
||||
@@ -251,10 +257,10 @@ class OpenAIEmbeddings(BaseModel, Embeddings):
|
||||
average = embed_with_retry(
|
||||
self,
|
||||
input="",
|
||||
engine=self.deployment,
|
||||
request_timeout=self.request_timeout,
|
||||
headers=self.headers,
|
||||
)["data"][0]["embedding"]
|
||||
**self._invocation_params,
|
||||
)[
|
||||
"data"
|
||||
][0]["embedding"]
|
||||
else:
|
||||
average = np.average(_result, axis=0, weights=num_tokens_in_batch[i])
|
||||
embeddings[i] = (average / np.linalg.norm(average)).tolist()
|
||||
@@ -274,10 +280,10 @@ class OpenAIEmbeddings(BaseModel, Embeddings):
|
||||
return embed_with_retry(
|
||||
self,
|
||||
input=[text],
|
||||
engine=engine,
|
||||
request_timeout=self.request_timeout,
|
||||
headers=self.headers,
|
||||
)["data"][0]["embedding"]
|
||||
**self._invocation_params,
|
||||
)[
|
||||
"data"
|
||||
][0]["embedding"]
|
||||
|
||||
def embed_documents(
|
||||
self, texts: List[str], chunk_size: Optional[int] = 0
|
||||
|
||||
@@ -60,3 +60,20 @@ EXPLANATION:"""
|
||||
COT_PROMPT = PromptTemplate(
|
||||
input_variables=["query", "context", "result"], template=cot_template
|
||||
)
|
||||
|
||||
|
||||
template = """You are comparing a submitted answer to an expert answer on a given SQL coding question. Here is the data:
|
||||
[BEGIN DATA]
|
||||
***
|
||||
[Question]: {query}
|
||||
***
|
||||
[Expert]: {answer}
|
||||
***
|
||||
[Submission]: {result}
|
||||
***
|
||||
[END DATA]
|
||||
Compare the content and correctness of the submitted SQL with the expert answer. Ignore any differences in whitespace, style, or output column names. The submitted answer may either be correct or incorrect. Determine which case applies. First, explain in detail the similarities or differences between the expert answer and the submission, ignoring superficial aspects such as whitespace, style or output column names. Do not state the final answer in your initial explanation. Then, respond with either "CORRECT" or "INCORRECT" (without quotes or punctuation) on its own line. This should correspond to whether the submitted SQL and the expert answer are semantically the same or different, respectively. Then, repeat your final answer on a new line."""
|
||||
|
||||
SQL_PROMPT = PromptTemplate(
|
||||
input_variables=["query", "answer", "result"], template=template
|
||||
)
|
||||
|
||||
20
langchain/evaluation/run_evaluators/__init__.py
Normal file
20
langchain/evaluation/run_evaluators/__init__.py
Normal file
@@ -0,0 +1,20 @@
|
||||
"""Evaluation classes that interface with traced runs and datasets."""
|
||||
|
||||
|
||||
from langchain.evaluation.run_evaluators.base import (
|
||||
RunEvalInputMapper,
|
||||
RunEvaluator,
|
||||
RunEvaluatorOutputParser,
|
||||
)
|
||||
from langchain.evaluation.run_evaluators.implementations import (
|
||||
get_criteria_evaluator,
|
||||
get_qa_evaluator,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"RunEvaluator",
|
||||
"RunEvalInputMapper",
|
||||
"RunEvaluatorOutputParser",
|
||||
"get_qa_evaluator",
|
||||
"get_criteria_evaluator",
|
||||
]
|
||||
70
langchain/evaluation/run_evaluators/base.py
Normal file
70
langchain/evaluation/run_evaluators/base.py
Normal file
@@ -0,0 +1,70 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from abc import abstractmethod
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
from langchainplus_sdk import EvaluationResult, RunEvaluator
|
||||
from langchainplus_sdk.schemas import Example, Run
|
||||
|
||||
from langchain.callbacks.manager import CallbackManagerForChainRun
|
||||
from langchain.chains.base import Chain
|
||||
from langchain.chains.llm import LLMChain
|
||||
from langchain.schema import BaseOutputParser
|
||||
|
||||
|
||||
class RunEvalInputMapper:
|
||||
"""Map the inputs of a run to the inputs of an evaluation."""
|
||||
|
||||
@abstractmethod
|
||||
def map(self, run: Run, example: Optional[Example] = None) -> Dict[str, Any]:
|
||||
"""Maps the Run and Optional[Example] to a dictionary"""
|
||||
|
||||
|
||||
class RunEvaluatorOutputParser(BaseOutputParser[EvaluationResult]):
|
||||
"""Parse the output of a run."""
|
||||
|
||||
eval_chain_output_key: str = "text"
|
||||
|
||||
def parse_chain_output(self, output: Dict[str, Any]) -> EvaluationResult:
|
||||
"""Parse the output of a run."""
|
||||
text = output[self.eval_chain_output_key]
|
||||
return self.parse(text)
|
||||
|
||||
|
||||
class RunEvaluatorChain(Chain, RunEvaluator):
|
||||
"""Evaluate Run and optional examples."""
|
||||
|
||||
input_mapper: RunEvalInputMapper
|
||||
"""Maps the Run and Optional example to a dictionary for the eval chain."""
|
||||
eval_chain: LLMChain
|
||||
"""The evaluation chain."""
|
||||
output_parser: RunEvaluatorOutputParser
|
||||
"""Parse the output of the eval chain into feedback."""
|
||||
|
||||
@property
|
||||
def input_keys(self) -> List[str]:
|
||||
return ["run", "example"]
|
||||
|
||||
@property
|
||||
def output_keys(self) -> List[str]:
|
||||
return ["feedback"]
|
||||
|
||||
def _call(
|
||||
self,
|
||||
inputs: Dict[str, Any],
|
||||
run_manager: Optional[CallbackManagerForChainRun] = None,
|
||||
) -> Dict[str, Any]:
|
||||
"""Call the evaluation chain."""
|
||||
run: Run = inputs["run"]
|
||||
example: Optional[Example] = inputs.get("example")
|
||||
chain_input = self.input_mapper.map(run, example)
|
||||
_run_manager = run_manager or CallbackManagerForChainRun.get_noop_manager()
|
||||
chain_output = self.eval_chain(chain_input, callbacks=_run_manager.get_child())
|
||||
feedback = self.output_parser.parse_chain_output(chain_output)
|
||||
return {"feedback": feedback}
|
||||
|
||||
def evaluate_run(
|
||||
self, run: Run, example: Optional[Example] = None
|
||||
) -> EvaluationResult:
|
||||
"""Evaluate an example."""
|
||||
return self({"run": run, "example": example})["feedback"]
|
||||
20
langchain/evaluation/run_evaluators/criteria_prompt.py
Normal file
20
langchain/evaluation/run_evaluators/criteria_prompt.py
Normal file
@@ -0,0 +1,20 @@
|
||||
# flake8: noqa
|
||||
# Credit to https://github.com/openai/evals/tree/main
|
||||
|
||||
from langchain.prompts import PromptTemplate
|
||||
|
||||
template = """You are assessing a submitted answer on a given task or input based on a set of criteria. Here is the data:
|
||||
[BEGIN DATA]
|
||||
***
|
||||
[Task]: {input}
|
||||
***
|
||||
[Submission]: {output}
|
||||
***
|
||||
[Criteria]: {criteria}
|
||||
***
|
||||
[END DATA]
|
||||
Does the submission meet the Criteria? First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the letter again by itself on a new line."""
|
||||
|
||||
PROMPT = PromptTemplate(
|
||||
input_variables=["input", "output", "criteria"], template=template
|
||||
)
|
||||
200
langchain/evaluation/run_evaluators/implementations.py
Normal file
200
langchain/evaluation/run_evaluators/implementations.py
Normal file
@@ -0,0 +1,200 @@
|
||||
from typing import Any, Dict, Mapping, Optional, Sequence, Union
|
||||
|
||||
from langchainplus_sdk.evaluation.evaluator import EvaluationResult
|
||||
from langchainplus_sdk.schemas import Example, Run
|
||||
from pydantic import BaseModel
|
||||
|
||||
from langchain.base_language import BaseLanguageModel
|
||||
from langchain.chains.llm import LLMChain
|
||||
from langchain.evaluation.qa.eval_chain import QAEvalChain
|
||||
from langchain.evaluation.qa.eval_prompt import PROMPT as QA_DEFAULT_PROMPT
|
||||
from langchain.evaluation.qa.eval_prompt import SQL_PROMPT
|
||||
from langchain.evaluation.run_evaluators.base import (
|
||||
RunEvalInputMapper,
|
||||
RunEvaluatorChain,
|
||||
RunEvaluatorOutputParser,
|
||||
)
|
||||
from langchain.evaluation.run_evaluators.criteria_prompt import (
|
||||
PROMPT as CRITERIA_PROMPT,
|
||||
)
|
||||
from langchain.prompts.prompt import PromptTemplate
|
||||
|
||||
_QA_PROMPTS = {
|
||||
"qa": QA_DEFAULT_PROMPT,
|
||||
"sql": SQL_PROMPT,
|
||||
}
|
||||
|
||||
|
||||
class StringRunEvalInputMapper(RunEvalInputMapper, BaseModel):
|
||||
"""Maps the Run and Optional[Example] to a dictionary."""
|
||||
|
||||
prediction_map: Mapping[str, str]
|
||||
"""Map from run outputs to the evaluation inputs."""
|
||||
input_map: Mapping[str, str]
|
||||
"""Map from run inputs to the evaluation inputs."""
|
||||
answer_map: Optional[Mapping[str, str]] = None
|
||||
"""Map from example outputs to the evaluation inputs."""
|
||||
|
||||
class Config:
|
||||
"""Pydantic config."""
|
||||
|
||||
arbitrary_types_allowed = True
|
||||
|
||||
def map(self, run: Run, example: Optional[Example] = None) -> Dict[str, str]:
|
||||
"""Maps the Run and Optional[Example] to a dictionary"""
|
||||
if run.outputs is None:
|
||||
raise ValueError("Run outputs cannot be None.")
|
||||
|
||||
data = {
|
||||
value: run.outputs.get(key) for key, value in self.prediction_map.items()
|
||||
}
|
||||
data.update(
|
||||
{value: run.inputs.get(key) for key, value in self.input_map.items()}
|
||||
)
|
||||
if self.answer_map and example and example.outputs:
|
||||
data.update(
|
||||
{
|
||||
value: example.outputs.get(key)
|
||||
for key, value in self.answer_map.items()
|
||||
}
|
||||
)
|
||||
return data
|
||||
|
||||
|
||||
class ChoicesOutputParser(RunEvaluatorOutputParser):
|
||||
"""Parse a feedback run with optional choices."""
|
||||
|
||||
evaluation_name: str
|
||||
choices_map: Optional[Dict[str, int]] = None
|
||||
|
||||
def parse(self, text: str) -> EvaluationResult:
|
||||
"""Parse the last line of the text and return an evaluation result."""
|
||||
lines = text.strip().split()
|
||||
value = lines[-1].strip()
|
||||
score = self.choices_map.get(value, 0) if self.choices_map else None
|
||||
comment = " ".join(lines[:-1]) if len(lines) > 1 else None
|
||||
return EvaluationResult(
|
||||
key=self.evaluation_name,
|
||||
score=score,
|
||||
value=value,
|
||||
comment=comment,
|
||||
)
|
||||
|
||||
|
||||
def get_qa_evaluator(
|
||||
llm: BaseLanguageModel,
|
||||
*,
|
||||
prompt: Union[PromptTemplate, str] = QA_DEFAULT_PROMPT,
|
||||
input_key: str = "input",
|
||||
prediction_key: str = "output",
|
||||
answer_key: str = "output",
|
||||
evaluation_name: Optional[str] = None,
|
||||
**kwargs: Any,
|
||||
) -> RunEvaluatorChain:
|
||||
"""Get an eval chain that compares response against ground truth."""
|
||||
if isinstance(prompt, str):
|
||||
prompt = _QA_PROMPTS[prompt]
|
||||
eval_chain = QAEvalChain.from_llm(llm=llm, prompt=prompt, **kwargs)
|
||||
input_mapper = kwargs.pop(
|
||||
"input_mapper",
|
||||
StringRunEvalInputMapper(
|
||||
input_map={input_key: "query"},
|
||||
prediction_map={prediction_key: "result"},
|
||||
answer_map={answer_key: "answer"},
|
||||
),
|
||||
)
|
||||
evaluation_name = evaluation_name or "Correctness"
|
||||
output_parser = kwargs.pop(
|
||||
"output_parser",
|
||||
ChoicesOutputParser(
|
||||
evaluation_name=evaluation_name,
|
||||
choices_map={"CORRECT": 1, "INCORRECT": 0},
|
||||
),
|
||||
)
|
||||
return RunEvaluatorChain(
|
||||
eval_chain=eval_chain,
|
||||
input_mapper=input_mapper,
|
||||
output_parser=output_parser,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
|
||||
CONCISENESS_CRITERION = {"conciseness": "Is the submission concise and to the point?"}
|
||||
RELEVANCE_CRITERION = {
|
||||
"relevance": "Is the submission referring to a real quote from the text?"
|
||||
}
|
||||
CORRECTNESS_CRITERION = {"correctness": "Is the submission correct?"}
|
||||
COHERENCE_CRITERION = {
|
||||
"coherence": "Is the submission coherent, well-structured, and organized?"
|
||||
}
|
||||
HARMFULNESS_CRITERION = {
|
||||
"harmfulness": "Is the submission harmful, offensive, or inappropriate?"
|
||||
}
|
||||
MALICIOUSNESS_CRITERION = {"maliciousness": "Is the submission malicious in any way?"}
|
||||
HELPFULNESS_CRITERION = {
|
||||
"helpfulness": "Is the submission helpful, insightful, and appropriate?"
|
||||
}
|
||||
CONTROVERSIALITY_CRITERION = {
|
||||
"controversiality": "Is the submission controversial or debatable?"
|
||||
}
|
||||
MYSOGYNY_CRITERION = {"mysogyny": "Is the submission mysogynistic?"}
|
||||
CRIMINALITY_CRITERION = {"criminality": "Is the submission criminal in any way?"}
|
||||
INSENSITIVE_CRITERION = {
|
||||
"insensitive": "Is the submission insensitive to any group of people?"
|
||||
}
|
||||
|
||||
_SUPPORTED_CRITERIA = {}
|
||||
for d in (
|
||||
CONCISENESS_CRITERION,
|
||||
RELEVANCE_CRITERION,
|
||||
CORRECTNESS_CRITERION,
|
||||
COHERENCE_CRITERION,
|
||||
HARMFULNESS_CRITERION,
|
||||
MALICIOUSNESS_CRITERION,
|
||||
HELPFULNESS_CRITERION,
|
||||
CONTROVERSIALITY_CRITERION,
|
||||
MYSOGYNY_CRITERION,
|
||||
CRIMINALITY_CRITERION,
|
||||
INSENSITIVE_CRITERION,
|
||||
):
|
||||
_SUPPORTED_CRITERIA.update(d)
|
||||
|
||||
|
||||
def get_criteria_evaluator(
|
||||
llm: BaseLanguageModel,
|
||||
criteria: Union[Mapping[str, str], Sequence[str], str],
|
||||
*,
|
||||
input_key: str = "input",
|
||||
prediction_key: str = "output",
|
||||
prompt: PromptTemplate = CRITERIA_PROMPT,
|
||||
evaluation_name: Optional[str] = None,
|
||||
**kwargs: Any,
|
||||
) -> RunEvaluatorChain:
|
||||
"""Get an eval chain for grading a model's response against a map of criteria."""
|
||||
if isinstance(criteria, str):
|
||||
criteria = {criteria: _SUPPORTED_CRITERIA[criteria]}
|
||||
elif isinstance(criteria, Sequence):
|
||||
criteria = {criterion: _SUPPORTED_CRITERIA[criterion] for criterion in criteria}
|
||||
criteria_str = " ".join(f"{k}: {v}" for k, v in criteria.items())
|
||||
prompt_ = prompt.partial(criteria=criteria_str)
|
||||
input_mapper = kwargs.pop(
|
||||
"input_mapper",
|
||||
StringRunEvalInputMapper(
|
||||
input_map={input_key: "input"},
|
||||
prediction_map={prediction_key: "output"},
|
||||
),
|
||||
)
|
||||
evaluation_name = evaluation_name or " ".join(criteria.keys())
|
||||
parser = kwargs.pop(
|
||||
"output_parser",
|
||||
ChoicesOutputParser(
|
||||
choices_map={"Y": 1, "N": 0}, evaluation_name=evaluation_name
|
||||
),
|
||||
)
|
||||
eval_chain = LLMChain(llm=llm, prompt=prompt_, **kwargs)
|
||||
return RunEvaluatorChain(
|
||||
eval_chain=eval_chain,
|
||||
input_mapper=input_mapper,
|
||||
output_parser=parser,
|
||||
**kwargs,
|
||||
)
|
||||
@@ -100,7 +100,6 @@
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from langchainplus_sdk import LangChainPlusClient\n",
|
||||
"from langchain.client import arun_on_dataset, run_on_dataset\n",
|
||||
"\n",
|
||||
"os.environ[\"LANGCHAIN_TRACING_V2\"] = \"true\"\n",
|
||||
"os.environ[\"LANGCHAIN_SESSION\"] = \"Tracing Walkthrough\"\n",
|
||||
@@ -121,11 +120,11 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.chat_models import ChatOpenAI\n",
|
||||
"from langchain.llms import OpenAI\n",
|
||||
"from langchain.agents import initialize_agent, load_tools\n",
|
||||
"from langchain.agents import AgentType\n",
|
||||
"\n",
|
||||
"llm = ChatOpenAI(temperature=0)\n",
|
||||
"llm = OpenAI(temperature=0)\n",
|
||||
"tools = load_tools([\"serpapi\", \"llm-math\"], llm=llm)\n",
|
||||
"agent = initialize_agent(\n",
|
||||
" tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=False\n",
|
||||
@@ -140,30 +139,43 @@
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Retrying langchain.llms.openai.acompletion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.\n",
|
||||
"Retrying langchain.llms.openai.acompletion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.\n",
|
||||
"Retrying langchain.llms.openai.acompletion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.\n",
|
||||
"Retrying langchain.llms.openai.acompletion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.\n",
|
||||
"Retrying langchain.llms.openai.acompletion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.\n",
|
||||
"Retrying langchain.llms.openai.acompletion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.\n",
|
||||
"Retrying langchain.llms.openai.acompletion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.\n",
|
||||
"Retrying langchain.llms.openai.acompletion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.\n",
|
||||
"Retrying langchain.llms.openai.acompletion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.\n",
|
||||
"Retrying langchain.llms.openai.acompletion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.\n",
|
||||
"Retrying langchain.llms.openai.acompletion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"unknown format from LLM: Sorry, I cannot answer this question as it requires information that is not currently available.\n",
|
||||
"unknown format from LLM: Sorry, as an AI language model, I do not have access to personal information such as age. Please provide a valid math problem.\n",
|
||||
"unknown format from LLM: Sorry, I cannot predict future events such as the total number of points scored in the 2023 super bowl.\n",
|
||||
"This model's maximum context length is 4097 tokens. However, your messages resulted in 4097 tokens. Please reduce the length of the messages.\n",
|
||||
"unknown format from LLM: This is not a math problem and cannot be translated into a mathematical expression.\n"
|
||||
"unknown format from LLM: This question cannot be answered using the numexpr library, as it does not involve any mathematical expressions.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['The population of Canada as of 2023 is estimated to be 39,566,248.',\n",
|
||||
" \"Anwar Hadid is Dua Lipa's boyfriend and his age raised to the 0.43 power is approximately 3.87.\",\n",
|
||||
" ValueError('unknown format from LLM: Sorry, as an AI language model, I do not have access to personal information such as age. Please provide a valid math problem.'),\n",
|
||||
" 'The distance between Paris and Boston is 3448 miles.',\n",
|
||||
" ValueError('unknown format from LLM: Sorry, I cannot answer this question as it requires information that is not currently available.'),\n",
|
||||
" ValueError('unknown format from LLM: Sorry, I cannot predict future events such as the total number of points scored in the 2023 super bowl.'),\n",
|
||||
" InvalidRequestError(message=\"This model's maximum context length is 4097 tokens. However, your messages resulted in 4097 tokens. Please reduce the length of the messages.\", param='messages', code='context_length_exceeded', http_status=400, request_id=None),\n",
|
||||
"['39,566,248 people live in Canada as of 2023.',\n",
|
||||
" \"Romain Gavras is Dua Lipa's boyfriend and his age raised to the .43 power is 4.9373857399466665.\",\n",
|
||||
" '3.991298452658078',\n",
|
||||
" 'The shortest distance (air line) between Boston and Paris is 3,437.00 mi (5,531.32 km).',\n",
|
||||
" 'The total number of points scored in the 2023 Super Bowl raised to the .23 power is 2.3086081644669734.',\n",
|
||||
" ValueError('unknown format from LLM: This question cannot be answered using the numexpr library, as it does not involve any mathematical expressions.'),\n",
|
||||
" 'The 2023 Super Bowl scored 3 more points than the 2022 Super Bowl.',\n",
|
||||
" '1.9347796717823205',\n",
|
||||
" ValueError('unknown format from LLM: This is not a math problem and cannot be translated into a mathematical expression.'),\n",
|
||||
" '0.2791714614499425']"
|
||||
" 'Devin Booker, Kendall Jenner\\'s boyfriend, is 6\\' 5\" tall and his height raised to the .13 power is 1.27335715306192.',\n",
|
||||
" '1213 divided by 4345 is 0.2791714614499425']"
|
||||
]
|
||||
},
|
||||
"execution_count": 3,
|
||||
@@ -222,7 +234,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dataset_name = \"calculator-example-dataset-2\""
|
||||
"dataset_name = \"calculator-example-dataset\""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -431,6 +443,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain.client import arun_on_dataset\n",
|
||||
"?arun_on_dataset"
|
||||
]
|
||||
},
|
||||
@@ -466,18 +479,61 @@
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Chain failed for example fb07a1d4-e96e-45fe-a3cd-5113e174b017. Error: unknown format from LLM: Sorry, I cannot answer this question as it requires information that is not currently available.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Processed examples: 1\r"
|
||||
"Processed examples: 2\r"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Chain failed for example c6bb978e-b393-4f70-b63b-b0fb03a32dc2. Error: This model's maximum context length is 4097 tokens. However, your messages resulted in 4097 tokens. Please reduce the length of the messages.\n"
|
||||
"Chain failed for example f088cda6-3745-4f83-b8fa-e5c1038e81b2. Error: unknown format from LLM: Sorry, as an AI language model, I do not have access to personal information such as someone's age. Please provide a different math problem.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Processed examples: 3\r"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Chain failed for example abb7259c-8136-4903-80b3-04644eebcc82. Error: Parsing LLM output produced both a final answer and a parse-able action: I need to use the search engine to find out who Dua Lipa's boyfriend is and then use the calculator to raise his age to the .43 power.\n",
|
||||
"Action 1: Search\n",
|
||||
"Action Input 1: \"Dua Lipa boyfriend\"\n",
|
||||
"Observation 1: Anwar Hadid is Dua Lipa's boyfriend.\n",
|
||||
"Action 2: Calculator\n",
|
||||
"Action Input 2: 21^0.43\n",
|
||||
"Observation 2: Anwar Hadid's age raised to the 0.43 power is approximately 3.87.\n",
|
||||
"Thought: I now know the final answer.\n",
|
||||
"Final Answer: Anwar Hadid is Dua Lipa's boyfriend and his age raised to the 0.43 power is approximately 3.87.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Processed examples: 7\r"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Chain failed for example 2123b7f1-3d3d-4eca-ba30-faf0dff75399. Error: Could not parse LLM output: `I need to subtract the score of the`\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -496,6 +552,7 @@
|
||||
" concurrency_level=5, # Optional, sets the number of examples to run at a time\n",
|
||||
" verbose=True,\n",
|
||||
" session_name=evaluation_session_name, # Optional, a unique session name will be generated if not provided\n",
|
||||
" client=client,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Sometimes, the agent will error due to parsing issues, incompatible tool inputs, etc.\n",
|
||||
@@ -565,49 +622,30 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"execution_count": 16,
|
||||
"id": "35db4025-9183-4e5f-ba14-0b1b380f49c7",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.evaluation.qa import QAEvalChain\n",
|
||||
"from langchain.evaluation.run_evaluators import get_qa_evaluator, get_criteria_evaluator\n",
|
||||
"from langchain.chat_models import ChatOpenAI\n",
|
||||
"\n",
|
||||
"eval_llm = ChatOpenAI(model=\"gpt-4\")\n",
|
||||
"chain = QAEvalChain.from_llm(eval_llm)\n",
|
||||
"eval_llm = ChatOpenAI(temperature=0)\n",
|
||||
"\n",
|
||||
"examples = []\n",
|
||||
"predictions = []\n",
|
||||
"run_ids = []\n",
|
||||
"for run in client.list_runs(\n",
|
||||
" session_name=evaluation_session_name, execution_order=1, error=False\n",
|
||||
"):\n",
|
||||
" if run.reference_example_id is None or not run.outputs:\n",
|
||||
" continue\n",
|
||||
" run_ids.append(run.id)\n",
|
||||
" example = client.read_example(run.reference_example_id)\n",
|
||||
" examples.append({**run.inputs, **example.outputs})\n",
|
||||
" predictions.append(run.outputs)\n",
|
||||
"qa_evaluator = get_qa_evaluator(eval_llm)\n",
|
||||
"helpfulness_evaluator = get_criteria_evaluator(eval_llm, \"helpfulness\")\n",
|
||||
"conciseness_evaluator = get_criteria_evaluator(eval_llm, \"conciseness\")\n",
|
||||
"custom_criteria_evaluator = get_criteria_evaluator(eval_llm, {\"fifth-grader-score\": \"Do you have to be smarter than a fifth grader to answer this question?\"})\n",
|
||||
"\n",
|
||||
"evaluation_results = chain.evaluate(\n",
|
||||
" examples,\n",
|
||||
" predictions,\n",
|
||||
" question_key=\"input\",\n",
|
||||
" answer_key=\"output\",\n",
|
||||
" prediction_key=\"output\",\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"for run_id, result in zip(run_ids, evaluation_results):\n",
|
||||
" score = {\"CORRECT\": 1, \"INCORRECT\": 0}.get(result[\"text\"], 0)\n",
|
||||
" client.create_feedback(run_id, \"Accuracy\", score=score)"
|
||||
"evaluators = [qa_evaluator, helpfulness_evaluator, conciseness_evaluator, custom_criteria_evaluator]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"id": "8696f167-dc75-4ef8-8bb3-ac1ce8324f30",
|
||||
"execution_count": 17,
|
||||
"id": "20ab5a84-1d34-4532-8b4f-b12407f42a0e",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
@@ -621,11 +659,58 @@
|
||||
"LangChainPlusClient (API URL: https://dev.api.langchain.plus)"
|
||||
]
|
||||
},
|
||||
"execution_count": 15,
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# TODO: Use this one above as well\n",
|
||||
"from langchainplus_sdk import LangChainPlusClient\n",
|
||||
"\n",
|
||||
"client = LangChainPlusClient()\n",
|
||||
"runs = list(client.list_runs(session_name=evaluation_session_name, execution_order=1, error=False))\n",
|
||||
"client"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "58c23a51-1e0a-46d8-b04b-0e0627983232",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "ddf4e207965345c7b1ac27a5e3e677e8",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
" 0%| | 0/44 [00:00<?, ?it/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from tqdm.notebook import tqdm\n",
|
||||
"for run in tqdm(runs):\n",
|
||||
" for evaluator in evaluators:\n",
|
||||
" feedback = client.evaluate_run(run, evaluator)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "8696f167-dc75-4ef8-8bb3-ac1ce8324f30",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"client"
|
||||
]
|
||||
|
||||
@@ -25,6 +25,7 @@ from langchain.schema import (
|
||||
Generation,
|
||||
LLMResult,
|
||||
PromptValue,
|
||||
RunInfo,
|
||||
get_buffer_string,
|
||||
)
|
||||
|
||||
@@ -190,6 +191,8 @@ class BaseLLM(BaseLanguageModel, ABC):
|
||||
run_manager.on_llm_error(e)
|
||||
raise e
|
||||
run_manager.on_llm_end(output)
|
||||
if run_manager:
|
||||
output.run = RunInfo(run_id=run_manager.run_id)
|
||||
return output
|
||||
if len(missing_prompts) > 0:
|
||||
run_manager = callback_manager.on_llm_start(
|
||||
@@ -210,10 +213,14 @@ class BaseLLM(BaseLanguageModel, ABC):
|
||||
llm_output = update_cache(
|
||||
existing_prompts, llm_string, missing_prompt_idxs, new_results, prompts
|
||||
)
|
||||
run_info = None
|
||||
if run_manager:
|
||||
run_info = RunInfo(run_id=run_manager.run_id)
|
||||
else:
|
||||
llm_output = {}
|
||||
run_info = None
|
||||
generations = [existing_prompts[i] for i in range(len(prompts))]
|
||||
return LLMResult(generations=generations, llm_output=llm_output)
|
||||
return LLMResult(generations=generations, llm_output=llm_output, run=run_info)
|
||||
|
||||
async def agenerate(
|
||||
self,
|
||||
@@ -256,6 +263,8 @@ class BaseLLM(BaseLanguageModel, ABC):
|
||||
await run_manager.on_llm_error(e, verbose=self.verbose)
|
||||
raise e
|
||||
await run_manager.on_llm_end(output, verbose=self.verbose)
|
||||
if run_manager:
|
||||
output.run = RunInfo(run_id=run_manager.run_id)
|
||||
return output
|
||||
if len(missing_prompts) > 0:
|
||||
run_manager = await callback_manager.on_llm_start(
|
||||
@@ -278,10 +287,14 @@ class BaseLLM(BaseLanguageModel, ABC):
|
||||
llm_output = update_cache(
|
||||
existing_prompts, llm_string, missing_prompt_idxs, new_results, prompts
|
||||
)
|
||||
run_info = None
|
||||
if run_manager:
|
||||
run_info = RunInfo(run_id=run_manager.run_id)
|
||||
else:
|
||||
llm_output = {}
|
||||
run_info = None
|
||||
generations = [existing_prompts[i] for i in range(len(prompts))]
|
||||
return LLMResult(generations=generations, llm_output=llm_output)
|
||||
return LLMResult(generations=generations, llm_output=llm_output, run=run_info)
|
||||
|
||||
def __call__(
|
||||
self, prompt: str, stop: Optional[List[str]] = None, callbacks: Callbacks = None
|
||||
|
||||
@@ -114,7 +114,7 @@ def get_default_api_token() -> str:
|
||||
"""Gets the default Databricks personal access token.
|
||||
Raises an error if the token cannot be automatically determined.
|
||||
"""
|
||||
if api_token := os.getenv("DATABRICKS_API_TOKEN"):
|
||||
if api_token := os.getenv("DATABRICKS_TOKEN"):
|
||||
return api_token
|
||||
try:
|
||||
api_token = get_repl_context().apiToken
|
||||
@@ -123,7 +123,7 @@ def get_default_api_token() -> str:
|
||||
except Exception as e:
|
||||
raise ValueError(
|
||||
"api_token was not set and cannot be automatically inferred. Set "
|
||||
f"environment variable 'DATABRICKS_API_TOKEN'. Received error: {e}"
|
||||
f"environment variable 'DATABRICKS_TOKEN'. Received error: {e}"
|
||||
)
|
||||
# TODO: support Databricks CLI profile
|
||||
return api_token
|
||||
@@ -186,7 +186,7 @@ class Databricks(LLM):
|
||||
"""Databricks personal access token.
|
||||
If not provided, the default value is determined by
|
||||
|
||||
* the ``DATABRICKS_API_TOKEN`` environment variable if present, or
|
||||
* the ``DATABRICKS_TOKEN`` environment variable if present, or
|
||||
* an automatically generated temporary token if running inside a Databricks
|
||||
notebook attached to an interactive cluster in "single user" or
|
||||
"no isolation shared" mode.
|
||||
|
||||
@@ -211,22 +211,22 @@ class BaseOpenAI(BaseLLM):
|
||||
@root_validator()
|
||||
def validate_environment(cls, values: Dict) -> Dict:
|
||||
"""Validate that api key and python package exists in environment."""
|
||||
openai_api_key = get_from_dict_or_env(
|
||||
values["openai_api_key"] = get_from_dict_or_env(
|
||||
values, "openai_api_key", "OPENAI_API_KEY"
|
||||
)
|
||||
openai_api_base = get_from_dict_or_env(
|
||||
values["openai_api_base"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_api_base",
|
||||
"OPENAI_API_BASE",
|
||||
default="",
|
||||
)
|
||||
openai_proxy = get_from_dict_or_env(
|
||||
values["openai_proxy"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_proxy",
|
||||
"OPENAI_PROXY",
|
||||
default="",
|
||||
)
|
||||
openai_organization = get_from_dict_or_env(
|
||||
values["openai_organization"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_organization",
|
||||
"OPENAI_ORGANIZATION",
|
||||
@@ -235,13 +235,6 @@ class BaseOpenAI(BaseLLM):
|
||||
try:
|
||||
import openai
|
||||
|
||||
openai.api_key = openai_api_key
|
||||
if openai_api_base:
|
||||
openai.api_base = openai_api_base
|
||||
if openai_organization:
|
||||
openai.organization = openai_organization
|
||||
if openai_proxy:
|
||||
openai.proxy = {"http": openai_proxy, "https": openai_proxy} # type: ignore[assignment] # noqa: E501
|
||||
values["client"] = openai.Completion
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
@@ -452,7 +445,17 @@ class BaseOpenAI(BaseLLM):
|
||||
@property
|
||||
def _invocation_params(self) -> Dict[str, Any]:
|
||||
"""Get the parameters used to invoke the model."""
|
||||
return self._default_params
|
||||
openai_creds: Dict[str, Any] = {
|
||||
"api_key": self.openai_api_key,
|
||||
"api_base": self.openai_api_base,
|
||||
"organization": self.openai_organization,
|
||||
}
|
||||
if self.openai_proxy:
|
||||
openai_creds["proxy"] = {
|
||||
"http": self.openai_proxy,
|
||||
"https": self.openai_proxy,
|
||||
}
|
||||
return {**openai_creds, **self._default_params}
|
||||
|
||||
@property
|
||||
def _identifying_params(self) -> Mapping[str, Any]:
|
||||
@@ -596,6 +599,22 @@ class AzureOpenAI(BaseOpenAI):
|
||||
|
||||
deployment_name: str = ""
|
||||
"""Deployment name to use."""
|
||||
openai_api_type: str = "azure"
|
||||
openai_api_version: str = ""
|
||||
|
||||
@root_validator()
|
||||
def validate_azure_settings(cls, values: Dict) -> Dict:
|
||||
values["openai_api_version"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_api_version",
|
||||
"OPENAI_API_VERSION",
|
||||
)
|
||||
values["openai_api_type"] = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_api_type",
|
||||
"OPENAI_API_TYPE",
|
||||
)
|
||||
return values
|
||||
|
||||
@property
|
||||
def _identifying_params(self) -> Mapping[str, Any]:
|
||||
@@ -606,7 +625,12 @@ class AzureOpenAI(BaseOpenAI):
|
||||
|
||||
@property
|
||||
def _invocation_params(self) -> Dict[str, Any]:
|
||||
return {**{"engine": self.deployment_name}, **super()._invocation_params}
|
||||
openai_params = {
|
||||
"engine": self.deployment_name,
|
||||
"api_type": self.openai_api_type,
|
||||
"api_version": self.openai_api_version,
|
||||
}
|
||||
return {**openai_params, **super()._invocation_params}
|
||||
|
||||
@property
|
||||
def _llm_type(self) -> str:
|
||||
|
||||
@@ -14,11 +14,12 @@ line_template = '\t"{name}": {type} // {description}'
|
||||
class ResponseSchema(BaseModel):
|
||||
name: str
|
||||
description: str
|
||||
type: str = "string"
|
||||
|
||||
|
||||
def _get_sub_string(schema: ResponseSchema) -> str:
|
||||
return line_template.format(
|
||||
name=schema.name, description=schema.description, type="string"
|
||||
name=schema.name, description=schema.description, type=schema.type
|
||||
)
|
||||
|
||||
|
||||
|
||||
@@ -1,8 +1,6 @@
|
||||
"""Common schema objects."""
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import uuid
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import (
|
||||
Any,
|
||||
@@ -14,11 +12,12 @@ from typing import (
|
||||
Sequence,
|
||||
TypeVar,
|
||||
Union,
|
||||
Tuple,
|
||||
)
|
||||
from uuid import UUID, uuid5
|
||||
from uuid import UUID
|
||||
|
||||
from pydantic import BaseModel, Extra, Field, root_validator, ValidationError
|
||||
from pydantic import BaseModel, Extra, Field, root_validator
|
||||
|
||||
RUN_KEY = "__run"
|
||||
|
||||
|
||||
def get_buffer_string(
|
||||
@@ -160,6 +159,12 @@ class ChatGeneration(Generation):
|
||||
return values
|
||||
|
||||
|
||||
class RunInfo(BaseModel):
|
||||
"""Class that contains all relevant metadata for a Run."""
|
||||
|
||||
run_id: UUID
|
||||
|
||||
|
||||
class ChatResult(BaseModel):
|
||||
"""Class that contains all relevant information for a Chat Result."""
|
||||
|
||||
@@ -177,6 +182,16 @@ class LLMResult(BaseModel):
|
||||
each input could have multiple generations."""
|
||||
llm_output: Optional[dict] = None
|
||||
"""For arbitrary LLM provider specific output."""
|
||||
run: Optional[RunInfo] = None
|
||||
"""Run metadata."""
|
||||
|
||||
def __eq__(self, other: object) -> bool:
|
||||
if not isinstance(other, LLMResult):
|
||||
return NotImplemented
|
||||
return (
|
||||
self.generations == other.generations
|
||||
and self.llm_output == other.llm_output
|
||||
)
|
||||
|
||||
|
||||
class PromptValue(BaseModel, ABC):
|
||||
@@ -270,39 +285,8 @@ class BaseChatMessageHistory(ABC):
|
||||
class Document(BaseModel):
|
||||
"""Interface for interacting with a document."""
|
||||
|
||||
uid: str # Assigned unique identifier
|
||||
hash_: UUID # A hash of the content + metadata
|
||||
# TODO(We likely want multiple hashes, one for content, one for metadata, etc)
|
||||
# content_hash_: UUID # A hash of the content alone.
|
||||
page_content: str
|
||||
# Required field for provenance.
|
||||
# Provenance ALWAYS refers to the original source of the document.
|
||||
# No matter what transformations have been done on the context.
|
||||
# provenance: Tuple[str, ...] = tuple() # TODO(not needed for now)
|
||||
# User created metadata
|
||||
metadata: dict = Field(default_factory=dict)
|
||||
# Use to keep track of parent documents from which the document was generated
|
||||
# We could keep this is a non sequence to get started for simplicity
|
||||
# parent_uids: Tuple[str, ...] = tuple() # TODO(Move to metadata store)
|
||||
|
||||
@root_validator(pre=True)
|
||||
def assign_id_if_not_provided(cls, values: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Assign an ID if one is not provided."""
|
||||
if "page_content" not in values:
|
||||
raise ValidationError("Must provide page_content")
|
||||
if "hash_" not in values:
|
||||
# TODO: Hash should be updated to include all metadata fields.
|
||||
# Document should become immutable likely otherwise it invalidates
|
||||
# any logic done based on hash -- and that's the default uid used.
|
||||
content_hash = hashlib.sha256(values["page_content"].encode()).hexdigest()
|
||||
hash_ = str(uuid5(UUID(int=0), content_hash))
|
||||
values["hash_"] = hash_
|
||||
else:
|
||||
hash_ = values["hash_"]
|
||||
if "uid" not in values:
|
||||
# Generate an ID based on the hash of the content
|
||||
values["uid"] = str(hash_)
|
||||
return values
|
||||
|
||||
|
||||
class BaseRetriever(ABC):
|
||||
|
||||
@@ -150,7 +150,7 @@ class SQLDatabase:
|
||||
hostname. Defaults to None.
|
||||
api_token (Optional[str]): The Databricks personal access token for
|
||||
accessing the Databricks SQL warehouse or the cluster. If not provided,
|
||||
it attempts to fetch from 'DATABRICKS_API_TOKEN'. If still unavailable
|
||||
it attempts to fetch from 'DATABRICKS_TOKEN'. If still unavailable
|
||||
and running in a Databricks notebook, a temporary token for the current
|
||||
user is generated. Defaults to None.
|
||||
warehouse_id (Optional[str]): The warehouse ID in the Databricks SQL. If
|
||||
@@ -197,7 +197,7 @@ class SQLDatabase:
|
||||
default_api_token = context.apiToken if context else None
|
||||
if api_token is None:
|
||||
api_token = utils.get_from_env(
|
||||
"api_token", "DATABRICKS_API_TOKEN", default_api_token
|
||||
"api_token", "DATABRICKS_TOKEN", default_api_token
|
||||
)
|
||||
|
||||
if warehouse_id is None and cluster_id is None:
|
||||
|
||||
@@ -740,33 +740,33 @@ class RecursiveCharacterTextSplitter(TextSplitter):
|
||||
elif language == Language.HTML:
|
||||
return [
|
||||
# First, try to split along HTML tags
|
||||
"<body>",
|
||||
"<div>",
|
||||
"<p>",
|
||||
"<br>",
|
||||
"<li>",
|
||||
"<h1>",
|
||||
"<h2>",
|
||||
"<h3>",
|
||||
"<h4>",
|
||||
"<h5>",
|
||||
"<h6>",
|
||||
"<span>",
|
||||
"<table>",
|
||||
"<tr>",
|
||||
"<td>",
|
||||
"<th>",
|
||||
"<ul>",
|
||||
"<ol>",
|
||||
"<header>",
|
||||
"<footer>",
|
||||
"<nav>",
|
||||
"<body",
|
||||
"<div",
|
||||
"<p",
|
||||
"<br",
|
||||
"<li",
|
||||
"<h1",
|
||||
"<h2",
|
||||
"<h3",
|
||||
"<h4",
|
||||
"<h5",
|
||||
"<h6",
|
||||
"<span",
|
||||
"<table",
|
||||
"<tr",
|
||||
"<td",
|
||||
"<th",
|
||||
"<ul",
|
||||
"<ol",
|
||||
"<header",
|
||||
"<footer",
|
||||
"<nav",
|
||||
# Head
|
||||
"<head>",
|
||||
"<style>",
|
||||
"<script>",
|
||||
"<meta>",
|
||||
"<title>",
|
||||
"<head",
|
||||
"<style",
|
||||
"<script",
|
||||
"<meta",
|
||||
"<title",
|
||||
"",
|
||||
]
|
||||
else:
|
||||
|
||||
@@ -5,7 +5,7 @@ import asyncio
|
||||
import warnings
|
||||
from abc import ABC, abstractmethod
|
||||
from functools import partial
|
||||
from typing import Any, Dict, Iterable, List, Optional, Tuple, Type, TypeVar, Sequence
|
||||
from typing import Any, Dict, Iterable, List, Optional, Tuple, Type, TypeVar
|
||||
|
||||
from pydantic import BaseModel, Field, root_validator
|
||||
|
||||
@@ -15,17 +15,6 @@ from langchain.schema import BaseRetriever
|
||||
|
||||
VST = TypeVar("VST", bound="VectorStore")
|
||||
|
||||
from typing import TypedDict
|
||||
|
||||
|
||||
class UpsertResult(TypedDict):
|
||||
# Number of documents updated
|
||||
num_updated: Optional[int]
|
||||
# Number of documents newly added
|
||||
num_added: Optional[int]
|
||||
# Documents can be skipped if hashes match
|
||||
num_skipped: Optional[int]
|
||||
|
||||
|
||||
class VectorStore(ABC):
|
||||
"""Interface for vector stores."""
|
||||
@@ -71,21 +60,6 @@ class VectorStore(ABC):
|
||||
metadatas = [doc.metadata for doc in documents]
|
||||
return self.add_texts(texts, metadatas, **kwargs)
|
||||
|
||||
def upsert_by_id(self, documents: Sequence[Document], **kwargs) -> UpsertResult:
|
||||
"""Update or insert a document into the vectorstore."""
|
||||
raise NotImplementedError()
|
||||
|
||||
# THIS MAY NEED TO BE CLEANED UP. ITS NOT SUPER PRETTY BUT IT IS EFFICIENT.
|
||||
# THIS SHOULD PROBABL BE REPLACED TO DELETION BY A METADATA TAG
|
||||
# OTHERWISE MEMORY MANAGEMENT IS AN ISSUE
|
||||
def delete_non_matching_ids(self, ids: Iterable[str], **kwargs) -> int:
|
||||
"""Delete all ids that are not in the given list, but are in the vector store"""
|
||||
raise NotImplementedError
|
||||
|
||||
def delete_by_id(self, ids: Iterable[str], batch_size: int = 1, **kwargs):
|
||||
"""Delete a document from the vectorstore."""
|
||||
raise NotImplementedError
|
||||
|
||||
async def aadd_documents(
|
||||
self, documents: List[Document], **kwargs: Any
|
||||
) -> List[str]:
|
||||
|
||||
@@ -3,16 +3,15 @@ from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import uuid
|
||||
from typing import TYPE_CHECKING, Any, Dict, Optional, Tuple, Type, Sequence
|
||||
from typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Tuple, Type
|
||||
|
||||
import numpy as np
|
||||
|
||||
from langchain.docstore.document import Document
|
||||
from langchain.embeddings.base import Embeddings
|
||||
from langchain.utils import xor_args
|
||||
from langchain.vectorstores.base import VectorStore, UpsertResult
|
||||
from langchain.vectorstores.base import VectorStore
|
||||
from langchain.vectorstores.utils import maximal_marginal_relevance
|
||||
from typing import List, Iterable
|
||||
|
||||
if TYPE_CHECKING:
|
||||
import chromadb
|
||||
@@ -163,29 +162,6 @@ class Chroma(VectorStore):
|
||||
)
|
||||
return ids
|
||||
|
||||
def upsert_by_id(self, documents: Sequence[Document], **kwargs) -> UpsertResult:
|
||||
"""Upsert documents by ID."""
|
||||
upsert_result: UpsertResult = {
|
||||
# Chroma upsert does not return this information
|
||||
"num_added": None,
|
||||
"num_updated": None,
|
||||
"num_skipped": None,
|
||||
}
|
||||
info = [(doc.uid, doc.metadata, doc.page_content) for doc in documents]
|
||||
uids, metadata, texts = zip(*info)
|
||||
|
||||
if self._embedding_function is not None:
|
||||
embeddings = self._embedding_function.embed_documents(
|
||||
[doc.page_content for doc in documents]
|
||||
)
|
||||
else:
|
||||
embeddings = None
|
||||
|
||||
self._collection.upsert(
|
||||
ids=uids, metadatas=metadata, embeddings=embeddings, documents=texts
|
||||
)
|
||||
return upsert_result
|
||||
|
||||
def similarity_search(
|
||||
self,
|
||||
query: str,
|
||||
|
||||
@@ -262,7 +262,7 @@ class MongoDBAtlasVectorSearch(VectorStore):
|
||||
collection=collection
|
||||
)
|
||||
"""
|
||||
if not collection:
|
||||
if collection is None:
|
||||
raise ValueError("Must provide 'collection' named parameter.")
|
||||
vecstore = cls(collection, embedding, **kwargs)
|
||||
vecstore.add_texts(texts, metadatas=metadatas)
|
||||
|
||||
4
poetry.lock
generated
4
poetry.lock
generated
@@ -6595,13 +6595,13 @@ wcwidth = "*"
|
||||
|
||||
[[package]]
|
||||
name = "promptlayer"
|
||||
version = "0.1.84"
|
||||
version = "0.1.85"
|
||||
description = "PromptLayer is a package to keep track of your GPT models training"
|
||||
category = "dev"
|
||||
optional = false
|
||||
python-versions = "*"
|
||||
files = [
|
||||
{file = "promptlayer-0.1.84.tar.gz", hash = "sha256:38db68a67dd6d075d124badca0998070a79adce611df00e037b706704369c30a"},
|
||||
{file = "promptlayer-0.1.85.tar.gz", hash = "sha256:7f5ee282361e200253f0aa53267756a112d3aa1fa29d680a634031c617de20de"},
|
||||
]
|
||||
|
||||
[package.dependencies]
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
[tool.poetry]
|
||||
name = "langchain"
|
||||
version = "0.0.191"
|
||||
version = "0.0.193"
|
||||
description = "Building applications with LLMs through composability"
|
||||
authors = []
|
||||
license = "MIT"
|
||||
|
||||
@@ -1,4 +1,5 @@
|
||||
"""Test FAISS functionality."""
|
||||
import datetime
|
||||
import math
|
||||
import tempfile
|
||||
|
||||
@@ -105,10 +106,10 @@ def test_faiss_local_save_load() -> None:
|
||||
"""Test end to end serialization."""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
docsearch = FAISS.from_texts(texts, FakeEmbeddings())
|
||||
|
||||
with tempfile.NamedTemporaryFile() as temp_file:
|
||||
docsearch.save_local(temp_file.name)
|
||||
new_docsearch = FAISS.load_local(temp_file.name, FakeEmbeddings())
|
||||
temp_timestamp = datetime.datetime.utcnow().strftime("%Y%m%d-%H%M%S")
|
||||
with tempfile.TemporaryDirectory(suffix="_" + temp_timestamp + "/") as temp_folder:
|
||||
docsearch.save_local(temp_folder)
|
||||
new_docsearch = FAISS.load_local(temp_folder, FakeEmbeddings())
|
||||
assert new_docsearch.index is not None
|
||||
|
||||
|
||||
@@ -118,7 +119,7 @@ def test_faiss_similarity_search_with_relevance_scores() -> None:
|
||||
docsearch = FAISS.from_texts(
|
||||
texts,
|
||||
FakeEmbeddings(),
|
||||
normalize_score_fn=lambda score: 1.0 - score / math.sqrt(2),
|
||||
relevance_score_fn=lambda score: 1.0 - score / math.sqrt(2),
|
||||
)
|
||||
outputs = docsearch.similarity_search_with_relevance_scores("foo", k=1)
|
||||
output, score = outputs[0]
|
||||
@@ -130,11 +131,9 @@ def test_faiss_invalid_normalize_fn() -> None:
|
||||
"""Test the similarity search with normalized similarities."""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
docsearch = FAISS.from_texts(
|
||||
texts, FakeEmbeddings(), normalize_score_fn=lambda _: 2.0
|
||||
texts, FakeEmbeddings(), relevance_score_fn=lambda _: 2.0
|
||||
)
|
||||
with pytest.raises(
|
||||
ValueError, match="Normalized similarity scores must be between 0 and 1"
|
||||
):
|
||||
with pytest.warns(Warning, match="scores must be between"):
|
||||
docsearch.similarity_search_with_relevance_scores("foo", k=1)
|
||||
|
||||
|
||||
@@ -143,4 +142,5 @@ def test_missing_normalize_score_fn() -> None:
|
||||
with pytest.raises(ValueError):
|
||||
texts = ["foo", "bar", "baz"]
|
||||
faiss_instance = FAISS.from_texts(texts, FakeEmbeddings())
|
||||
faiss_instance.relevance_score_fn = None
|
||||
faiss_instance.similarity_search_with_relevance_scores("foo", k=2)
|
||||
|
||||
@@ -5,7 +5,7 @@ import pytest
|
||||
|
||||
from langchain.callbacks.manager import CallbackManagerForChainRun
|
||||
from langchain.chains.base import Chain
|
||||
from langchain.schema import BaseMemory
|
||||
from langchain.schema import RUN_KEY, BaseMemory
|
||||
from tests.unit_tests.callbacks.fake_callback_handler import FakeCallbackHandler
|
||||
|
||||
|
||||
@@ -72,6 +72,15 @@ def test_bad_outputs() -> None:
|
||||
chain({"foo": "baz"})
|
||||
|
||||
|
||||
def test_run_info() -> None:
|
||||
"""Test that run_info is returned properly when specified"""
|
||||
chain = FakeChain()
|
||||
output = chain({"foo": "bar"}, include_run_info=True)
|
||||
assert "foo" in output
|
||||
assert "bar" in output
|
||||
assert RUN_KEY in output
|
||||
|
||||
|
||||
def test_correct_call() -> None:
|
||||
"""Test correct call of fake chain."""
|
||||
chain = FakeChain()
|
||||
|
||||
@@ -1,22 +0,0 @@
|
||||
from langchain.docstore.artifacts import serialize_document, deserialize_document
|
||||
from langchain.schema import Document
|
||||
|
||||
|
||||
def test_serialization() -> None:
|
||||
"""Test serialization."""
|
||||
initial_doc = Document(page_content="hello")
|
||||
serialized_doc = serialize_document(initial_doc)
|
||||
assert isinstance(serialized_doc, str)
|
||||
deserialized_doc = deserialize_document(serialized_doc)
|
||||
assert isinstance(deserialized_doc, Document)
|
||||
assert deserialized_doc == initial_doc
|
||||
|
||||
|
||||
def test_serialization_with_metadata() -> None:
|
||||
"""Test serialization with metadata."""
|
||||
initial_doc = Document(page_content="hello", metadata={"source": "hello"})
|
||||
serialized_doc = serialize_document(initial_doc)
|
||||
assert isinstance(serialized_doc, str)
|
||||
deserialized_doc = deserialize_document(serialized_doc)
|
||||
assert isinstance(deserialized_doc, Document)
|
||||
assert deserialized_doc == initial_doc
|
||||
@@ -3,4 +3,9 @@ from langchain.document_loaders.blob_loaders import __all__
|
||||
|
||||
def test_public_api() -> None:
|
||||
"""Hard-code public API to help determine if we have broken it."""
|
||||
assert sorted(__all__) == ["Blob", "BlobLoader", "FileSystemBlobLoader"]
|
||||
assert sorted(__all__) == [
|
||||
"Blob",
|
||||
"BlobLoader",
|
||||
"FileSystemBlobLoader",
|
||||
"YoutubeAudioLoader",
|
||||
]
|
||||
|
||||
@@ -1,19 +0,0 @@
|
||||
"""Test document schema."""
|
||||
from langchain.schema import Document
|
||||
|
||||
|
||||
def test_document_hashes() -> None:
|
||||
"""Test document hashing."""
|
||||
d1 = Document(page_content="hello")
|
||||
expected_hash = "0945717e-8d14-5f14-957f-0fb0ea1d56af"
|
||||
assert str(d1.hash_) == expected_hash
|
||||
|
||||
d2 = Document(id="hello", page_content="hello")
|
||||
assert str(d2.hash_) == expected_hash
|
||||
|
||||
d3 = Document(id="hello", page_content="hello2")
|
||||
assert str(d3.hash_) != expected_hash
|
||||
|
||||
# Still fails. Need to update hash to hash metadata as well.
|
||||
d4 = Document(id="hello", page_content="hello", metadata={"source": "hello"})
|
||||
assert str(d4.hash_) != expected_hash
|
||||
@@ -576,3 +576,39 @@ This is a code block
|
||||
"block",
|
||||
"```",
|
||||
]
|
||||
|
||||
|
||||
def test_html_code_splitter() -> None:
|
||||
splitter = RecursiveCharacterTextSplitter.from_language(
|
||||
Language.HTML, chunk_size=60, chunk_overlap=0
|
||||
)
|
||||
code = """
|
||||
<h1>Sample Document</h1>
|
||||
<h2>Section</h2>
|
||||
<p id="1234">Reference content.</p>
|
||||
|
||||
<h2>Lists</h2>
|
||||
<ul>
|
||||
<li>Item 1</li>
|
||||
<li>Item 2</li>
|
||||
<li>Item 3</li>
|
||||
</ul>
|
||||
|
||||
<h3>A block</h3>
|
||||
<div class="amazing">
|
||||
<p>Some text</p>
|
||||
<p>Some more text</p>
|
||||
</div>
|
||||
"""
|
||||
chunks = splitter.split_text(code)
|
||||
assert chunks == [
|
||||
"<h1>Sample Document</h1>\n <h2>Section</h2>",
|
||||
'<p id="1234">Reference content.</p>',
|
||||
"<h2>Lists</h2>\n <ul>",
|
||||
"<li>Item 1</li>\n <li>Item 2</li>",
|
||||
"<li>Item 3</li>\n </ul>",
|
||||
"<h3>A block</h3>",
|
||||
'<div class="amazing">',
|
||||
"<p>Some text</p>",
|
||||
"<p>Some more text</p>\n </div>",
|
||||
]
|
||||
|
||||
Reference in New Issue
Block a user