Harrison/deeplake (#1316)

Co-authored-by: Davit Buniatyan <d@activeloop.ai>
This commit is contained in:
Harrison Chase
2023-02-26 22:35:04 -08:00
committed by GitHub
parent aaad6cc954
commit 166cda2cc6
9 changed files with 811 additions and 1 deletions

View File

@@ -0,0 +1,25 @@
# Deep Lake
This page covers how to use the Deep Lake ecosystem within LangChain.
It is broken into two parts: installation and setup, and then references to specific Deep Lake wrappers. For more information.
1. Here is [whitepaper](https://www.deeplake.ai/whitepaper) and [academic paper](https://arxiv.org/pdf/2209.10785.pdf) for Deep Lake
2. Here is a set of additional resources available for review: [Deep Lake](https://github.com/activeloopai/deeplake), [Getting Started](https://docs.activeloop.ai/getting-started) and [Tutorials](https://docs.activeloop.ai/hub-tutorials)
## Installation and Setup
- Install the Python package with `pip install deeplake`
## Wrappers
### VectorStore
There exists a wrapper around Deep Lake, a data lake for Deep Learning applications, allowing you to use it as a vectorstore (for now), whether for semantic search or example selection.
To import this vectorstore:
```python
from langchain.vectorstores import DeepLake
```
For a more detailed walkthrough of the Deep Lake wrapper, see [this notebook](../modules/indexes/vectorstore_examples/deeplake.ipynb)

View File

@@ -36,6 +36,8 @@ In the below guides, we cover different types of vectorstores and how to use the
`Chroma <./vectorstore_examples/chroma.html>`_: A walkthrough of how to use the Chroma vectorstore wrapper.
`DeepLake <./vectorstore_examples/deeplake.html>`_: A walkthrough of how to use the Deep Lake, data lake, wrapper.
`FAISS <./vectorstore_examples/faiss.html>`_: A walkthrough of how to use the FAISS vectorstore wrapper.
`Elastic Search <./vectorstore_examples/elasticsearch.html>`_: A walkthrough of how to use the ElasticSearch wrapper.

View File

@@ -0,0 +1,234 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Deep Lake\n",
"\n",
"This notebook showcases basic functionality related to Deep Lake. While Deep Lake can store embeddings, it is capable of storing any type of data. It is a fully fledged serverless data lake with version control, query engine and streaming dataloader to deep learning frameworks. \n",
"\n",
"For more information, please see the Deep Lake [documentation](docs.activeloop.ai) or [api reference](docs.deeplake.ai)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.vectorstores import DeepLake\n",
"from langchain.document_loaders import TextLoader"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import TextLoader\n",
"loader = TextLoader('../../state_of_the_union.txt')\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"embeddings = OpenAIEmbeddings()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Evaluating ingest: 100%|██████████| 41/41 [00:00<00:00\n"
]
}
],
"source": [
"db = DeepLake.from_documents(docs, embeddings)\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = db.similarity_search(query)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \n",
"\n",
"We cannot let this happen. \n",
"\n",
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"source": [
"print(docs[0].page_content)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deep Lake datasets on cloud or local\n",
"By default deep lake datasets are stored in memory, in case you want to persist locally or to any object storage you can simply provide path to the dataset. You can retrieve token from [app.activeloop.ai](https://app.activeloop.ai/)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/bin/bash: -c: line 0: syntax error near unexpected token `newline'\n",
"/bin/bash: -c: line 0: `activeloop login -t <token>'\n"
]
}
],
"source": [
"!activeloop login -t <token>"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Evaluating ingest: 100%|██████████| 4/4 [00:00<00:00\n"
]
}
],
"source": [
"# Embed and store the texts\n",
"dataset_path = \"hub://{username}/{dataset_name}\" # could be also ./local/path (much faster locally), s3://bucket/path/to/dataset, gcs://, etc.\n",
"\n",
"embedding = OpenAIEmbeddings()\n",
"vectordb = DeepLake.from_documents(documents=docs, embedding=embedding, dataset_path=dataset_path)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \n",
"\n",
"We cannot let this happen. \n",
"\n",
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = db.similarity_search(query)\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset(path='./local/path', tensors=['embedding', 'ids', 'metadata', 'text'])\n",
"\n",
" tensor htype shape dtype compression\n",
" ------- ------- ------- ------- ------- \n",
" embedding generic (4, 1536) None None \n",
" ids text (4, 1) str None \n",
" metadata json (4, 1) str None \n",
" text text (4, 1) str None \n"
]
}
],
"source": [
"vectordb.ds.summary()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"embeddings = vectordb.ds.embedding.numpy()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "7b14174bb6f9d4680b62ac2a6390e1ce94fbfabf172a10844870451d539c58d6"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -50,6 +50,8 @@ The following use cases require specific installs and api keys:
- _OpenSearch_:
- Install requirements with `pip install opensearch-py`
- If you want to set up OpenSearch on your local, [here](https://opensearch.org/docs/latest/)
- _DeepLake_:
- Install requirements with `pip install deeplake`
If you are using the `NLTKTextSplitter` or the `SpacyTextSplitter`, you will also need to install the appropriate models. For example, if you want to use the `SpacyTextSplitter`, you will need to install the `en_core_web_sm` model with `python -m spacy download en_core_web_sm`. Similarly, if you want to use the `NLTKTextSplitter`, you will need to install the `punkt` model with `python -m nltk.downloader punkt`.