LanceDB integration update (#22869)

Added : 

- [x] relevance search (w/wo scores)
- [x] maximal marginal search
- [x] image ingestion
- [x] filtering support
- [x] hybrid search w reranking 

make test, lint_diff and format checked.
This commit is contained in:
Raghav Dixit
2024-06-18 04:54:26 +01:00
committed by GitHub
parent 62c8a67f56
commit 55705c0f5e
3 changed files with 793 additions and 142 deletions

View File

@@ -12,6 +12,16 @@
"This notebook shows how to use functionality related to the `LanceDB` vector database based on the Lance data format."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b1051ba9",
"metadata": {},
"outputs": [],
"source": [
"! pip install tantivy"
]
},
{
"cell_type": "code",
"execution_count": null,
@@ -29,7 +39,7 @@
"metadata": {},
"outputs": [],
"source": [
"! pip install lancedb"
"! pip install lancedb==0.6.13"
]
},
{
@@ -42,7 +52,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 1,
"id": "a0361f5c-e6f4-45f4-b829-11680cf03cec",
"metadata": {
"tags": []
@@ -57,7 +67,17 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 2,
"id": "d114ed78",
"metadata": {},
"outputs": [],
"source": [
"! rm -rf /tmp/lancedb"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a3c3999a",
"metadata": {},
"outputs": [],
@@ -94,19 +114,130 @@
" embedding=embeddings,\n",
" table_name='langchain_test'\n",
" )\n",
"```\n"
"```\n",
"\n",
"You can also add `region`, `api_key`, `uri` to `from_documents()` classmethod\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 4,
"id": "6e104aee",
"metadata": {},
"outputs": [],
"source": [
"docsearch = LanceDB.from_documents(documents, embeddings)\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = docsearch.similarity_search(query)"
"from lancedb.rerankers import LinearCombinationReranker\n",
"\n",
"reranker = LinearCombinationReranker(weight=0.3)\n",
"\n",
"docsearch = LanceDB.from_documents(documents, embeddings, reranker=reranker)\n",
"query = \"What did the president say about Ketanji Brown Jackson\""
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "259c7988",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"relevance score - 0.7066475030191711\n",
"text- They were responding to a 9-1-1 call when a man shot and killed them with a stolen gun. \n",
"\n",
"Officer Mora was 27 years old. \n",
"\n",
"Officer Rivera was 22. \n",
"\n",
"Both Dominican Americans whod grown up on the same streets they later chose to patrol as police officers. \n",
"\n",
"I spoke with their families and told them that we are forever in debt for their sacrifice, and we will carry on their mission to restore the trust and safety every community deserves. \n",
"\n",
"Ive worked on these issues a long time. \n",
"\n",
"I know what works: Investing in crime prevention and community police officers wholl walk the beat, wholl know the neighborhood, and who can restore trust and safety. \n",
"\n",
"So lets not abandon our streets. Or choose between safety and equal justice. \n",
"\n",
"Lets come together to protect our communities, restore trust, and hold law enforcement accountable. \n",
"\n",
"Thats why the Justice Department required body cameras, banned chokeholds, and restricted no-knock warrants for its officers. \n",
"\n",
"Thats why the American Rescue \n"
]
}
],
"source": [
"docs = docsearch.similarity_search_with_relevance_scores(query)\n",
"print(\"relevance score - \", docs[0][1])\n",
"print(\"text- \", docs[0][0].page_content[:1000])"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "9fa29dae",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"distance - 0.30000001192092896\n",
"text- My administration is providing assistance with job training and housing, and now helping lower-income veterans get VA care debt-free. \n",
"\n",
"Our troops in Iraq and Afghanistan faced many dangers. \n",
"\n",
"One was stationed at bases and breathing in toxic smoke from “burn pits” that incinerated wastes of war—medical and hazard material, jet fuel, and more. \n",
"\n",
"When they came home, many of the worlds fittest and best trained warriors were never the same. \n",
"\n",
"Headaches. Numbness. Dizziness. \n",
"\n",
"A cancer that would put them in a flag-draped coffin. \n",
"\n",
"I know. \n",
"\n",
"One of those soldiers was my son Major Beau Biden. \n",
"\n",
"We dont know for sure if a burn pit was the cause of his brain cancer, or the diseases of so many of our troops. \n",
"\n",
"But Im committed to finding out everything we can. \n",
"\n",
"Committed to military families like Danielle Robinson from Ohio. \n",
"\n",
"The widow of Sergeant First Class Heath Robinson. \n",
"\n",
"He was born a soldier. Army National Guard. Combat medic in Kosovo and Iraq. \n",
"\n",
"Stationed near Baghdad, just ya\n"
]
}
],
"source": [
"docs = docsearch.similarity_search_with_score(query=\"Headaches\", query_type=\"hybrid\")\n",
"print(\"distance - \", docs[0][1])\n",
"print(\"text- \", docs[0][0].page_content[:1000])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "e70ad201",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"reranker : <lancedb.rerankers.linear_combination.LinearCombinationReranker object at 0x107ef1130>\n"
]
}
],
"source": [
"print(\"reranker : \", docsearch._reranker)"
]
},
{
@@ -128,7 +259,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 15,
"id": "9c608226",
"metadata": {},
"outputs": [
@@ -136,6 +267,10 @@
"name": "stdout",
"output_type": "stream",
"text": [
"metadata : {'source': '../../how_to/state_of_the_union.txt'}\n",
"\n",
"SQL filtering :\n",
"\n",
"They were responding to a 9-1-1 call when a man shot and killed them with a stolen gun. \n",
"\n",
"Officer Mora was 27 years old. \n",
@@ -199,93 +334,211 @@
}
],
"source": [
"docs = docsearch.similarity_search(\n",
" query=query, filter={\"metadata.source\": \"../../how_to/state_of_the_union.txt\"}\n",
")\n",
"\n",
"print(\"metadata :\", docs[0].metadata)\n",
"\n",
"# or you can directly supply SQL string filters :\n",
"\n",
"print(\"\\nSQL filtering :\\n\")\n",
"docs = docsearch.similarity_search(query=query, filter=\"text LIKE '%Officer Rivera%'\")\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "a359ed74",
"cell_type": "markdown",
"id": "9a173c94",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"They were responding to a 9-1-1 call when a man shot and killed them with a stolen gun. \n",
"\n",
"Officer Mora was 27 years old. \n",
"\n",
"Officer Rivera was 22. \n",
"\n",
"Both Dominican Americans whod grown up on the same streets they later chose to patrol as police officers. \n",
"\n",
"I spoke with their families and told them that we are forever in debt for their sacrifice, and we will carry on their mission to restore the trust and safety every community deserves. \n",
"\n",
"Ive worked on these issues a long time. \n",
"\n",
"I know what works: Investing in crime prevention and community police officers wholl walk the beat, wholl know the neighborhood, and who can restore trust and safety. \n",
"\n",
"So lets not abandon our streets. Or choose between safety and equal justice. \n",
"\n",
"Lets come together to protect our communities, restore trust, and hold law enforcement accountable. \n",
"\n",
"Thats why the Justice Department required body cameras, banned chokeholds, and restricted no-knock warrants for its officers. \n",
"\n",
"Thats why the American Rescue Plan provided $350 Billion that cities, states, and counties can use to hire more police and invest in proven strategies like community violence interruption—trusted messengers breaking the cycle of violence and trauma and giving young people hope. \n",
"\n",
"We should all agree: The answer is not to Defund the police. The answer is to FUND the police with the resources and training they need to protect our communities. \n",
"\n",
"I ask Democrats and Republicans alike: Pass my budget and keep our neighborhoods safe. \n",
"\n",
"And I will keep doing everything in my power to crack down on gun trafficking and ghost guns you can buy online and make at home—they have no serial numbers and cant be traced. \n",
"\n",
"And I ask Congress to pass proven measures to reduce gun violence. Pass universal background checks. Why should anyone on a terrorist list be able to purchase a weapon? \n",
"\n",
"Ban assault weapons and high-capacity magazines. \n",
"\n",
"Repeal the liability shield that makes gun manufacturers the only industry in America that cant be sued. \n",
"\n",
"These laws dont infringe on the Second Amendment. They save lives. \n",
"\n",
"The most fundamental right in America is the right to vote and to have it counted. And its under assault. \n",
"\n",
"In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \n",
"\n",
"We cannot let this happen. \n",
"\n",
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence. \n",
"\n",
"A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since shes been nominated, shes received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n",
"\n",
"And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n",
"\n",
"We can do both. At our border, weve installed new technology like cutting-edge scanners to better detect drug smuggling. \n",
"\n",
"Weve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n",
"\n",
"Were putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster.\n"
]
}
],
"source": [
"print(docs[0].page_content)"
"## Adding images "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9d749a3f-df17-4a8a-b256-08a3bbc74cb6",
"id": "05f669d7",
"metadata": {},
"outputs": [],
"source": [
"print(docs[0].metadata)"
"! pip install -U langchain-experimental"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3ed69810",
"metadata": {},
"outputs": [],
"source": [
"! pip install open_clip_torch torch"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "2cacb5ee",
"metadata": {},
"outputs": [],
"source": [
"! rm -rf '/tmp/multimmodal_lance'"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "b3456e2c",
"metadata": {},
"outputs": [],
"source": [
"from langchain_experimental.open_clip import OpenCLIPEmbeddings"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "3848eba2",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"import requests\n",
"\n",
"# List of image URLs to download\n",
"image_urls = [\n",
" \"https://github.com/raghavdixit99/assets/assets/34462078/abf47cc4-d979-4aaa-83be-53a2115bf318\",\n",
" \"https://github.com/raghavdixit99/assets/assets/34462078/93be928e-522b-4e37-889d-d4efd54b2112\",\n",
"]\n",
"\n",
"texts = [\"bird\", \"dragon\"]\n",
"\n",
"# Directory to save images\n",
"dir_name = \"./photos/\"\n",
"\n",
"# Create directory if it doesn't exist\n",
"os.makedirs(dir_name, exist_ok=True)\n",
"\n",
"image_uris = []\n",
"# Download and save each image\n",
"for i, url in enumerate(image_urls, start=1):\n",
" response = requests.get(url)\n",
" path = os.path.join(dir_name, f\"image{i}.jpg\")\n",
" image_uris.append(path)\n",
" with open(path, \"wb\") as f:\n",
" f.write(response.content)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "3d62c2a0",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.vectorstores import LanceDB\n",
"\n",
"vec_store = LanceDB(\n",
" table_name=\"multimodal_test\",\n",
" embedding=OpenCLIPEmbeddings(),\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "ebbb4881",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['b673620b-01f0-42ca-a92e-d033bb92c0a6',\n",
" '99c3a5b0-b577-417a-8177-92f4a655dbfb']"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vec_store.add_images(uris=image_uris)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "3c29dea3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['f7adde5d-a4a3-402b-9e73-088b230722c3',\n",
" 'cbed59da-0aec-4bff-8820-9e59d81a2140']"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vec_store.add_texts(texts)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "8b2f25ce",
"metadata": {},
"outputs": [],
"source": [
"img_embed = vec_store._embedding.embed_query(\"bird\")"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "87a24079",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='bird', metadata={'id': 'f7adde5d-a4a3-402b-9e73-088b230722c3'})"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vec_store.similarity_search_by_vector(img_embed)[0]"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "78557867",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LanceTable(connection=LanceDBConnection(/tmp/lancedb), name=\"multimodal_test\")"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vec_store._table"
]
}
],
@@ -305,7 +558,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
"version": "3.12.2"
}
},
"nbformat": 4,