mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-13 13:36:15 +00:00
community[minor]: Add indexing via locality sensitive hashing to the Yellowbrick vector store (#20856)
- **Description:** Add LSH-based indexing to the Yellowbrick vector store module - **Twitter handle:** @markcusack --------- Co-authored-by: markcusack <markcusack@markcusacksmac.lan> Co-authored-by: markcusack <markcusack@Mark-Cusack-sMac.local> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
This commit is contained in:
@@ -98,7 +98,7 @@
|
||||
"import psycopg2\n",
|
||||
"from IPython.display import Markdown, display\n",
|
||||
"from langchain.chains import LLMChain, RetrievalQAWithSourcesChain\n",
|
||||
"from langchain_community.docstore.document import Document\n",
|
||||
"from langchain.schema import Document\n",
|
||||
"from langchain_community.vectorstores import Yellowbrick\n",
|
||||
"from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n",
|
||||
"from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
|
||||
@@ -209,14 +209,12 @@
|
||||
"\n",
|
||||
"# Define the SQL statement to create a table\n",
|
||||
"create_table_query = f\"\"\"\n",
|
||||
"CREATE TABLE if not exists {embedding_table} (\n",
|
||||
" id uuid,\n",
|
||||
" embedding_id integer,\n",
|
||||
" text character varying(60000),\n",
|
||||
" metadata character varying(1024),\n",
|
||||
" embedding double precision\n",
|
||||
"CREATE TABLE IF NOT EXISTS {embedding_table} (\n",
|
||||
" doc_id uuid NOT NULL,\n",
|
||||
" embedding_id smallint NOT NULL,\n",
|
||||
" embedding double precision NOT NULL\n",
|
||||
")\n",
|
||||
"DISTRIBUTE ON (id);\n",
|
||||
"DISTRIBUTE ON (doc_id);\n",
|
||||
"truncate table {embedding_table};\n",
|
||||
"\"\"\"\n",
|
||||
"\n",
|
||||
@@ -257,6 +255,8 @@
|
||||
" f\"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YB_DOC_DATABASE}\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print(yellowbrick_doc_connection_string)\n",
|
||||
"\n",
|
||||
"# Establish a connection to the Yellowbrick database\n",
|
||||
"conn = psycopg2.connect(yellowbrick_doc_connection_string)\n",
|
||||
"\n",
|
||||
@@ -324,7 +324,7 @@
|
||||
"vector_store = Yellowbrick.from_documents(\n",
|
||||
" documents=split_docs,\n",
|
||||
" embedding=embeddings,\n",
|
||||
" connection_string=yellowbrick_connection_string,\n",
|
||||
" connection_info=yellowbrick_connection_string,\n",
|
||||
" table=embedding_table,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
@@ -403,6 +403,88 @@
|
||||
"print_result_sources(\"Whats an easy way to add users in bulk to Yellowbrick?\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1f39fd30",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Part 6: Introducing an Index to Increase Performance\n",
|
||||
"\n",
|
||||
"Yellowbrick also supports indexing using the Locality-Sensitive Hashing approach. This is an approximate nearest-neighbor search technique, and allows one to trade off similarity search time at the expense of accuracy. The index introduces two new tunable parameters:\n",
|
||||
"\n",
|
||||
"- The number of hyperplanes, which is provided as an argument to `create_lsh_index(num_hyperplanes)`. The more documents, the more hyperplanes are needed. LSH is a form of dimensionality reduction. The original embeddings are transformed into lower dimensional vectors where the number of components is the same as the number of hyperplanes.\n",
|
||||
"- The Hamming distance, an integer representing the breadth of the search. Smaller Hamming distances result in faster retreival but lower accuracy.\n",
|
||||
"\n",
|
||||
"Here's how you can create an index on the embeddings we loaded into Yellowbrick. We'll also re-run the previous chat session, but this time the retrieval will use the index. Note that for such a small number of documents, you won't see the benefit of indexing in terms of performance."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "02ba61c4",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"system_template = \"\"\"Use the following pieces of context to answer the users question.\n",
|
||||
"Take note of the sources and include them in the answer in the format: \"SOURCES: source1 source2\", use \"SOURCES\" in capital letters regardless of the number of sources.\n",
|
||||
"If you don't know the answer, just say that \"I don't know\", don't try to make up an answer.\n",
|
||||
"----------------\n",
|
||||
"{summaries}\"\"\"\n",
|
||||
"messages = [\n",
|
||||
" SystemMessagePromptTemplate.from_template(system_template),\n",
|
||||
" HumanMessagePromptTemplate.from_template(\"{question}\"),\n",
|
||||
"]\n",
|
||||
"prompt = ChatPromptTemplate.from_messages(messages)\n",
|
||||
"\n",
|
||||
"vector_store = Yellowbrick(\n",
|
||||
" OpenAIEmbeddings(),\n",
|
||||
" yellowbrick_connection_string,\n",
|
||||
" embedding_table, # Change the table name to reflect your embeddings\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"lsh_params = Yellowbrick.IndexParams(\n",
|
||||
" Yellowbrick.IndexType.LSH, {\"num_hyperplanes\": 8, \"hamming_distance\": 2}\n",
|
||||
")\n",
|
||||
"vector_store.create_index(lsh_params)\n",
|
||||
"\n",
|
||||
"chain_type_kwargs = {\"prompt\": prompt}\n",
|
||||
"llm = ChatOpenAI(\n",
|
||||
" model_name=\"gpt-3.5-turbo\", # Modify model_name if you have access to GPT-4\n",
|
||||
" temperature=0,\n",
|
||||
" max_tokens=256,\n",
|
||||
")\n",
|
||||
"chain = RetrievalQAWithSourcesChain.from_chain_type(\n",
|
||||
" llm=llm,\n",
|
||||
" chain_type=\"stuff\",\n",
|
||||
" retriever=vector_store.as_retriever(\n",
|
||||
" k=5, search_kwargs={\"index_params\": lsh_params}\n",
|
||||
" ),\n",
|
||||
" return_source_documents=True,\n",
|
||||
" chain_type_kwargs=chain_type_kwargs,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def print_result_sources(query):\n",
|
||||
" result = chain(query)\n",
|
||||
" output_text = f\"\"\"### Question: \n",
|
||||
" {query}\n",
|
||||
" ### Answer: \n",
|
||||
" {result['answer']}\n",
|
||||
" ### Sources: \n",
|
||||
" {result['sources']}\n",
|
||||
" ### All relevant sources:\n",
|
||||
" {', '.join(list(set([doc.metadata['source'] for doc in result['source_documents']])))}\n",
|
||||
" \"\"\"\n",
|
||||
" display(Markdown(output_text))\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Use the chain to query\n",
|
||||
"\n",
|
||||
"print_result_sources(\"How many databases can be in a Yellowbrick Instance?\")\n",
|
||||
"\n",
|
||||
"print_result_sources(\"Whats an easy way to add users in bulk to Yellowbrick?\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "697c8a38",
|
||||
@@ -418,9 +500,9 @@
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "langchain_venv",
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "langchain_venv"
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
@@ -60,7 +60,7 @@
|
||||
" * document addition by id (`add_documents` method with `ids` argument)\n",
|
||||
" * delete by id (`delete` method with `ids` argument)\n",
|
||||
"\n",
|
||||
"Compatible Vectorstores: `AnalyticDB`, `AstraDB`, `AzureCosmosDBVectorSearch`, `AzureSearch`, `AwaDB`, `Bagel`, `Cassandra`, `Chroma`, `CouchbaseVectorStore`, `DashVector`, `DatabricksVectorSearch`, `DeepLake`, `Dingo`, `ElasticVectorSearch`, `ElasticsearchStore`, `FAISS`, `HanaDB`, `LanceDB`, `Milvus`, `MyScale`, `OpenSearchVectorSearch`, `PGVector`, `Pinecone`, `Qdrant`, `Redis`, `Rockset`, `ScaNN`, `SupabaseVectorStore`, `SurrealDBStore`, `TimescaleVector`, `UpstashVectorStore`, `Vald`, `VDMS`, `Vearch`, `VespaStore`, `Weaviate`, `ZepVectorStore`, `TencentVectorDB`, `OpenSearchVectorSearch`.\n",
|
||||
"Compatible Vectorstores: `AnalyticDB`, `AstraDB`, `AzureCosmosDBVectorSearch`, `AzureSearch`, `AwaDB`, `Bagel`, `Cassandra`, `Chroma`, `CouchbaseVectorStore`, `DashVector`, `DatabricksVectorSearch`, `DeepLake`, `Dingo`, `ElasticVectorSearch`, `ElasticsearchStore`, `FAISS`, `HanaDB`, `LanceDB`, `Milvus`, `MyScale`, `OpenSearchVectorSearch`, `PGVector`, `Pinecone`, `Qdrant`, `Redis`, `Rockset`, `ScaNN`, `SupabaseVectorStore`, `SurrealDBStore`, `TimescaleVector`, `UpstashVectorStore`, `Vald`, `VDMS`, `Vearch`, `VespaStore`, `Weaviate`, `ZepVectorStore`, `TencentVectorDB`, `OpenSearchVectorSearch`, `Yellowbrick`.\n",
|
||||
" \n",
|
||||
"## Caution\n",
|
||||
"\n",
|
||||
|
Reference in New Issue
Block a user