#  Deep Lake

This notebook showcases basic functionality related to Deep Lake. While Deep Lake can store embeddings, it is capable of storing any type of data. It is a fully fledged serverless data lake with version control, query engine and streaming dataloader to deep learning frameworks.  

For more information, please see the Deep Lake [documentation](docs.activeloop.ai) or [api reference](docs.deeplake.ai)

In [None]:
!python3 -m pip install openai deeplake

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import DeepLake
from langchain.document_loaders import TextLoader

In [None]:
import os
os.environ['OPENAI_API_KEY'] = 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

In [None]:
from langchain.document_loaders import TextLoader

loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

In [None]:
db = DeepLake.from_documents(docs, embeddings)

query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)

In [None]:
print(docs[0].page_content)

### Retrieval Question/Answering

In [None]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAIChat

qa = RetrievalQA.from_chain_type(llm=OpenAIChat(model='gpt-3.5-turbo'), chain_type='stuff', retriever=db.as_retriever())

In [None]:
query = 'What did the president say about Ketanji Brown Jackson'
qa.run(query)

### Attribute based filtering in metadata

In [None]:
import random

for d in docs:
    d.metadata['year'] = random.randint(2012, 2014)

db = DeepLake.from_documents(docs, embeddings)

In [None]:
db.similarity_search('What did the president say about Ketanji Brown Jackson', filter={'year': 2013})

### Choosing distance function
Distance function `L2` for Euclidean, `L1` for Nuclear, `Max` l-infinity distnace, `cos` for cosine similarity, `dot` for dot product 

In [None]:
db.similarity_search('What did the president say about Ketanji Brown Jackson?', distance_metric='cos')

### Maximal Marginal relevance
Using maximal marginal relevance

In [None]:
db.max_marginal_relevance_search('What did the president say about Ketanji Brown Jackson?')

## Deep Lake datasets on cloud (Activeloop, AWS, GCS, etc.) or local
By default deep lake datasets are stored in memory, in case you want to persist locally or to any object storage you can simply provide path to the dataset. You can retrieve token from [app.activeloop.ai](https://app.activeloop.ai/)

In [None]:
!activeloop login -t <token>

In [None]:
# Embed and store the texts
dataset_path = "hub://{username}/{dataset_name}" # could be also ./local/path (much faster locally), s3://bucket/path/to/dataset, gcs://path/to/dataset, etc.

embedding = OpenAIEmbeddings()
vectordb = DeepLake.from_documents(documents=docs, embedding=embedding, dataset_path=dataset_path)

In [None]:
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)

In [None]:
vectordb.ds.summary()

In [None]:
embeddings = vectordb.ds.embedding.numpy()