Add "Astra DB" vector store integration (#12966)

# Astra DB Vector store integration

- **Description:** This PR adds a `VectorStore` implementation for
DataStax Astra DB using its HTTP API
  - **Issue:** (no related issue)
- **Dependencies:** A new required dependency is `astrapy` (`>=0.5.3`)
which was added to pyptoject.toml, optional, as per guidelines
- **Tag maintainer:** I recently mentioned to @baskaryan this
integration was coming
  - **Twitter handle:** `@rsprrs` if you want to mention me

This PR introduces the `AstraDB` vector store class, extensive
integration test coverage, a reworking of the documentation which
conflates Cassandra and Astra DB on a single "provider" page and a new,
completely reworked vector-store example notebook (common to the
Cassandra store, since parts of the flow is shared by the two APIs). I
also took care in ensuring docs (and redirects therein) are behaving
correctly.

All style, linting, typechecks and tests pass as far as the `AstraDB`
integration is concerned.

I could build the documentation and check it all right (but ran into
trouble with the `api_docs_build` makefile target which I could not
verify: `Error: Unable to import module
'plan_and_execute.agent_executor' with error: No module named
'langchain_experimental'` was the first of many similar errors)

Thank you for a review!
Stefano

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
This commit is contained in:
Stefano Lottini
2023-11-07 23:45:33 +01:00
committed by GitHub
parent 13bd83bd61
commit 4f4b020582
21 changed files with 4376 additions and 376 deletions

View File

@@ -1,7 +1,7 @@
# cassandra-entomology-rag
This template will perform RAG using Astra DB and Apache Cassandra®.
This template will perform RAG using Apache Cassandra® or Astra DB through CQL (`Cassandra` vector store class)
## Environment Setup
@@ -53,16 +53,6 @@ export LANGCHAIN_API_KEY=<your-api-key>
export LANGCHAIN_PROJECT=<your-project> # if not specified, defaults to "default"
```
To populate the vector store, ensure that you have set all the environment variables, then from this directory, execute the following just once:
```shell
poetry run bash -c "cd [...]/cassandra_entomology_rag; python setup.py"
```
The output will be something like `Done (29 lines inserted).`.
Note: In a full application, the vector store might be populated in other ways. This step is to pre-populate the vector store with some rows for the demo RAG chains to work sensibly.
If you are inside this directory, then you can spin up a LangServe instance directly by:
```shell

View File

@@ -1,7 +1,7 @@
# cassandra-synonym-caching
This template provides a simple chain template showcasing the usage of LLM Caching backed by Astra DB / Apache Cassandra®.
This template provides a simple chain template showcasing the usage of LLM Caching backed by Apache Cassandra® or Astra DB through CQL.
## Environment Setup

View File

@@ -0,0 +1,5 @@
export OPENAI_API_KEY="..."
export ASTRA_DB_API_ENDPOINT="https://...-....apps.astra.datastax.com"
export ASTRA_DB_APPLICATION_TOKEN="AstraCS:..."
export ASTRA_DB_KEYSPACE="..." # Optional - falls back to default if not provided

View File

@@ -0,0 +1,78 @@
# rag-astradb
This template will perform RAG using Astra DB (`AstraDB` vector store class)
## Environment Setup
An [Astra DB](https://astra.datastax.com) database is required; free tier is fine.
- You need the database **API endpoint** (such as `https://0123...-us-east1.apps.astra.datastax.com`) ...
- ... and a **token** (`AstraCS:...`).
Also, an **OpenAI API Key** is required. _Note that out-of-the-box this demo supports OpenAI only, unless you tinker with the code._
Provide the connection parameters and secrets through environment variables. Please refer to `.env.template` for the variable names.
## Usage
To use this package, you should first have the LangChain CLI installed:
```shell
pip install -U "langchain-cli[serve]"
```
To create a new LangChain project and install this as the only package, you can do:
```shell
langchain app new my-app --package rag-astradb
```
If you want to add this to an existing project, you can just run:
```shell
langchain app add rag-astradb
```
And add the following code to your `server.py` file:
```python
from astradb_entomology_rag import chain as astradb_entomology_rag_chain
add_routes(app, astradb_entomology_rag_chain, path="/rag-astradb")
```
(Optional) Let's now configure LangSmith.
LangSmith will help us trace, monitor and debug LangChain applications.
LangSmith is currently in private beta, you can sign up [here](https://smith.langchain.com/).
If you don't have access, you can skip this section
```shell
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=<your-api-key>
export LANGCHAIN_PROJECT=<your-project> # if not specified, defaults to "default"
```
If you are inside this directory, then you can spin up a LangServe instance directly by:
```shell
langchain serve
```
This will start the FastAPI app with a server is running locally at
[http://localhost:8000](http://localhost:8000)
We can see all templates at [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
We can access the playground at [http://127.0.0.1:8000/rag-astradb/playground](http://127.0.0.1:8000/rag-astradb/playground)
We can access the template from code with:
```python
from langserve.client import RemoteRunnable
runnable = RemoteRunnable("http://localhost:8000/rag-astradb")
```
## Reference
Stand-alone repo with LangServe chain: [here](https://github.com/hemidactylus/langserve_astradb_entomology_rag).

View File

@@ -0,0 +1,53 @@
import os
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.vectorstores import AstraDB
from .populate_vector_store import populate
# inits
llm = ChatOpenAI()
embeddings = OpenAIEmbeddings()
vector_store = AstraDB(
embedding=embeddings,
collection_name="langserve_rag_demo",
token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
namespace=os.environ.get("ASTRA_DB_KEYSPACE"),
)
retriever = vector_store.as_retriever(search_kwargs={"k": 3})
# For demo reasons, let's ensure there are rows on the vector store.
# Please remove this and/or adapt to your use case!
inserted_lines = populate(vector_store)
if inserted_lines:
print(f"Done ({inserted_lines} lines inserted).")
entomology_template = """
You are an expert entomologist, tasked with answering enthusiast biologists' questions.
You must answer based only on the provided context, do not make up any fact.
Your answers must be concise and to the point, but strive to provide scientific details
(such as family, order, Latin names, and so on when appropriate).
You MUST refuse to answer questions on other topics than entomology,
as well as questions whose answer is not found in the provided context.
CONTEXT:
{context}
QUESTION: {question}
YOUR ANSWER:"""
entomology_prompt = ChatPromptTemplate.from_template(entomology_template)
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| entomology_prompt
| llm
| StrOutputParser()
)

View File

@@ -0,0 +1,29 @@
import os
BASE_DIR = os.path.abspath(os.path.dirname(__file__))
def populate(vector_store):
# is the store empty? find out with a probe search
hits = vector_store.similarity_search_by_vector(
embedding=[0.001] * 1536,
k=1,
)
#
if len(hits) == 0:
# this seems a first run:
# must populate the vector store
src_file_name = os.path.join(BASE_DIR, "..", "sources.txt")
lines = [
line.strip()
for line in open(src_file_name).readlines()
if line.strip()
if line[0] != "#"
]
# deterministic IDs to prevent duplicates on multiple runs
ids = ["_".join(line.split(" ")[:2]).lower().replace(":", "") for line in lines]
#
vector_store.add_texts(texts=lines, ids=ids)
return len(lines)
else:
return 0

View File

@@ -0,0 +1,5 @@
from astradb_entomology_rag import chain
if __name__ == "__main__":
response = chain.invoke("Are there more coleoptera or bugs?")
print(response)

2070
templates/rag-astradb/poetry.lock generated Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,28 @@
[tool.poetry]
name = "astradb_entomology_rag"
version = "0.0.1"
description = ""
authors = [
"Stefano Lottini <stefano.lottini@datastax.com>",
]
readme = "README.md"
[tool.poetry.dependencies]
python = ">=3.8.1,<4.0"
langchain = ">=0.0.325"
openai = "^0.28.1"
tiktoken = "^0.5.1"
astrapy = "^0.5.3"
[tool.poetry.group.dev.dependencies]
langchain-cli = ">=0.0.15"
[tool.langserve]
export_module = "astradb_entomology_rag"
export_attr = "chain"
[build-system]
requires = [
"poetry-core",
]
build-backend = "poetry.core.masonry.api"

View File

@@ -0,0 +1,31 @@
# source: https://www.thoughtco.com/a-guide-to-the-twenty-nine-insect-orders-1968419
Order Thysanura: The silverfish and firebrats are found in the order Thysanura. They are wingless insects often found in people's attics, and have a lifespan of several years. There are about 600 species worldwide.
Order Diplura: Diplurans are the most primitive insect species, with no eyes or wings. They have the unusual ability among insects to regenerate body parts. There are over 400 members of the order Diplura in the world.
Order Protura: Another very primitive group, the proturans have no eyes, no antennae, and no wings. They are uncommon, with perhaps less than 100 species known.
Order Collembola: The order Collembola includes the springtails, primitive insects without wings. There are approximately 2,000 species of Collembola worldwide.
Order Ephemeroptera: The mayflies of order Ephemeroptera are short-lived, and undergo incomplete metamorphosis. The larvae are aquatic, feeding on algae and other plant life. Entomologists have described about 2,100 species worldwide.
Order Odonata: The order Odonata includes dragonflies and damselflies, which undergo incomplete metamorphosis. They are predators of other insects, even in their immature stage. There are about 5,000 species in the order Odonata.
Order Plecoptera: The stoneflies of order Plecoptera are aquatic and undergo incomplete metamorphosis. The nymphs live under rocks in well flowing streams. Adults are usually seen on the ground along stream and river banks. There are roughly 3,000 species in this group.
Order Grylloblatodea: Sometimes referred to as "living fossils," the insects of the order Grylloblatodea have changed little from their ancient ancestors. This order is the smallest of all the insect orders, with perhaps only 25 known species living today. Grylloblatodea live at elevations above 1500 ft., and are commonly named ice bugs or rock crawlers.
Order Orthoptera: These are familiar insects (grasshoppers, locusts, katydids, and crickets) and one of the largest orders of herbivorous insects. Many species in the order Orthoptera can produce and detect sounds. Approximately 20,000 species exist in this group.
Order Phasmida: The order Phasmida are masters of camouflage, the stick and leaf insects. They undergo incomplete metamorphosis and feed on leaves. There are some 3,000 insects in this group, but only a small fraction of this number is leaf insects. Stick insects are the longest insects in the world.
Order Dermaptera: This order contains the earwigs, an easily recognized insect that often has pincers at the end of the abdomen. Many earwigs are scavengers, eating both plant and animal matter. The order Dermaptera includes less than 2,000 species.
Order Embiidina: The order Embioptera is another ancient order with few species, perhaps only 200 worldwide. The web spinners have silk glands in their front legs and weave nests under leaf litter and in tunnels where they live. Webspinners live in tropical or subtropical climates.
Order Dictyoptera: The order Dictyoptera includes roaches and mantids. Both groups have long, segmented antennae and leathery forewings held tightly against their backs. They undergo incomplete metamorphosis. Worldwide, there approximately 6,000 species in this order, most living in tropical regions.
Order Isoptera: Termites feed on wood and are important decomposers in forest ecosystems. They also feed on wood products and are thought of as pests for the destruction they cause to man-made structures. There are between 2,000 and 3,000 species in this order.
Order Zoraptera: Little is know about the angel insects, which belong to the order Zoraptera. Though they are grouped with winged insects, many are actually wingless. Members of this group are blind, small, and often found in decaying wood. There are only about 30 described species worldwide.
Order Psocoptera: Bark lice forage on algae, lichen, and fungus in moist, dark places. Booklice frequent human dwellings, where they feed on book paste and grains. They undergo incomplete metamorphosis. Entomologists have named about 3,200 species in the order Psocoptera.
Order Mallophaga: Biting lice are ectoparasites that feed on birds and some mammals. There are an estimated 3,000 species in the order Mallophaga, all of which undergo incomplete metamorphosis.
Order Siphunculata: The order Siphunculata are the sucking lice, which feed on the fresh blood of mammals. Their mouthparts are adapted for sucking or siphoning blood. There are only about 500 species of sucking lice.
Order Hemiptera: Most people use the term "bugs" to mean insects; an entomologist uses the term to refer to the order Hemiptera. The Hemiptera are the true bugs, and include cicadas, aphids, and spittlebugs, and others. This is a large group of over 70,000 species worldwide.
Order Thysanoptera: The thrips of order Thysanoptera are small insects that feed on plant tissue. Many are considered agricultural pests for this reason. Some thrips prey on other small insects as well. This order contains about 5,000 species.
Order Neuroptera: Commonly called the order of lacewings, this group actually includes a variety of other insects, too: dobsonflies, owlflies, mantidflies, antlions, snakeflies, and alderflies. Insects in the order Neuroptera undergo complete metamorphosis. Worldwide, there are over 5,500 species in this group.
Order Mecoptera: This order includes the scorpionflies, which live in moist, wooded habitats. Scorpionflies are omnivorous in both their larval and adult forms. The larva are caterpillar-like. There are less than 500 described species in the order Mecoptera.
Order Siphonaptera: Pet lovers fear insects in the order Siphonaptera - the fleas. Fleas are blood-sucking ectoparasites that feed on mammals, and rarely, birds. There are well over 2,000 species of fleas in the world.
Order Coleoptera: This group, the beetles and weevils, is the largest order in the insect world, with over 300,000 distinct species known. The order Coleoptera includes well-known families: june beetles, lady beetles, click beetles, and fireflies. All have hardened forewings that fold over the abdomen to protect the delicate hindwings used for flight.
Order Strepsiptera: Insects in this group are parasites of other insects, particularly bees, grasshoppers, and the true bugs. The immature Strepsiptera lies in wait on a flower and quickly burrows into any host insect that comes along. Strepsiptera undergo complete metamorphosis and pupate within the host insect's body.
Order Diptera: Diptera is one of the largest orders, with nearly 100,000 insects named to the order. These are the true flies, mosquitoes, and gnats. Insects in this group have modified hindwings which are used for balance during flight. The forewings function as the propellers for flying.
Order Lepidoptera: The butterflies and moths of the order Lepidoptera comprise the second largest group in the class Insecta. These well-known insects have scaly wings with interesting colors and patterns. You can often identify an insect in this order just by the wing shape and color.
Order Trichoptera: Caddisflies are nocturnal as adults and aquatic when immature. The caddisfly adults have silky hairs on their wings and body, which is key to identifying a Trichoptera member. The larvae spin traps for prey with silk. They also make cases from the silk and other materials that they carry and use for protection.
Order Hymenoptera: The order Hymenoptera includes many of the most common insects - ants, bees, and wasps. The larvae of some wasps cause trees to form galls, which then provides food for the immature wasps. Other wasps are parasitic, living in caterpillars, beetles, or even aphids. This is the third-largest insect order with just over 100,000 species.