Compare commits

..

24 Commits

Author SHA1 Message Date
Harrison Chase
82c080c6e6 bump version to 0078 (#908) 2023-02-06 00:32:44 -08:00
Harrison Chase
71e662e88d update docs (#905) 2023-02-06 00:26:20 -08:00
Harrison Chase
53d56d7650 Harrison/unstructured support (#903) 2023-02-05 23:02:07 -08:00
Harrison Chase
2a68be3e8d chat vector db chain (#902) 2023-02-05 21:38:47 -08:00
James Briggs
8217a2f26c Update pinecone init details in docs (#898)
PR to fix outdated environment details in the docs, see issue #897 

I added code comments as pointers to where users go to get API keys, and
where they can find the relevant environment variable.
2023-02-05 15:21:56 -08:00
Bagatur
7658263bfb Check type of LLM.generate prompts arg (#886)
Was passing prompt in directly as string and getting nonsense outputs.
Had to inspect source code to realize that first arg should be a list.
Could be nice if there was an explicit error or warning, seems like this
could be a common mistake.
2023-02-04 22:49:17 -08:00
Samantha Whitmore
32b11101d3 Get elements of ActionInput on newlines (#889)
The re.DOTALL flag in Python's re (regular expression) module makes the
. (dot) metacharacter match newline characters as well as any other
character.

Without re.DOTALL, the . metacharacter only matches any character except
for a newline character. With re.DOTALL, the . metacharacter matches any
character, including newline characters.
2023-02-04 20:42:25 -08:00
Harrison Chase
1614c5f5fd fix flaky tests (#892) 2023-02-04 20:41:33 -08:00
Harrison Chase
a2b699dcd2 prompt template from string (#884) 2023-02-04 17:04:58 -08:00
Alex
7cc44b3bdb Add to gallery (#882) 2023-02-04 09:45:20 -08:00
Harrison Chase
0b9f086d36 Harrison/docs splitter (#879) 2023-02-03 15:09:13 -08:00
Harrison Chase
bcfbc7a818 version 0077 (#878) 2023-02-03 14:49:52 -08:00
Ryan Walker
1dd0733515 Fix small typo in getting started docs (#876)
Just noticed this little typo while reading the docs, thought I'd open a
PR!
2023-02-03 14:22:12 -08:00
Zach Schillaci
4c79100b15 Correct prompt typo + update example for SQLDatabaseChain (#868)
See https://github.com/hwchase17/langchain/issues/821
2023-02-03 08:34:41 -08:00
Harrison Chase
777aaff841 fix routing to tiktoken encoder (#866) 2023-02-02 22:08:14 -08:00
Harrison Chase
e9ef08862d validate template (#865) 2023-02-02 22:08:01 -08:00
Harrison Chase
364b771743 sql return direct (#864) 2023-02-02 22:07:41 -08:00
Harrison Chase
483441d305 pass kwargs through to loading (#863) 2023-02-02 22:07:26 -08:00
Harrison Chase
8df6b68093 fix length based example selector (#862) 2023-02-02 22:06:56 -08:00
Harrison Chase
3f48eed5bd Harrison/milvus (#856)
Signed-off-by: Filip Haltmayer <filip.haltmayer@zilliz.com>
Signed-off-by: Frank Liu <frank.liu@zilliz.com>
Co-authored-by: Filip Haltmayer <81822489+filip-halt@users.noreply.github.com>
Co-authored-by: Frank Liu <frank@frankzliu.com>
2023-02-02 22:05:47 -08:00
Ankush Gola
933441cc52 Add retry to OpenAI llm (#849)
add ability to retry when certain exceptions are raised by
`openai.Completions.create`

Test plan: ran all OpenAI integration tests.
2023-02-02 19:56:26 -08:00
kahkeng
4a8f5cdf4b Add alternative token-based text splitter (#816)
This does not involve a separator, and will naively chunk input text at
the appropriate boundaries in token space.

This is helpful if we have strict token length limits that we need to
strictly follow the specified chunk size, and we can't use aggressive
separators like spaces to guarantee the absence of long strings.

CharacterTextSplitter will let these strings through without splitting
them, which could cause overflow errors downstream.

Splitting at arbitrary token boundaries is not ideal but is hopefully
mitigated by having a decent overlap quantity. Also this results in
chunks which has exact number of tokens desired, instead of sometimes
overcounting if we concatenate shorter strings.

Potentially also helps with #528.
2023-02-02 19:55:13 -08:00
Harrison Chase
523ad2e6bd vercel deployments (#850) 2023-02-02 19:54:09 -08:00
Harrison Chase
fc0cfd7d1f docs (#848) 2023-02-02 11:35:36 -08:00
50 changed files with 1754 additions and 86 deletions

View File

@@ -27,4 +27,8 @@ This is heavily influenced by James Weaver's [excellent examples](https://huggin
This repo serves as a template for how deploy a LangChain with [Beam](https://beam.cloud).
It implements a Question Answering app and contains instructions for deploying the app as a serverless REST API.
It implements a Question Answering app and contains instructions for deploying the app as a serverless REST API.
## [Vercel](https://github.com/homanp/vercel-langchain)
A minimal example on how to run LangChain on Vercel using Flask.

View File

@@ -199,6 +199,17 @@ Open Source
+++
This repo is a simple demonstration of using LangChain to do fact-checking with prompt chaining.
---
.. link-button:: https://github.com/arc53/docsgpt
:type: url
:text: DocsGPT
:classes: stretched-link btn-lg
+++
Answer questions about the documentation of any project
Misc. Colab Notebooks
~~~~~~~~~~~~~~~

View File

@@ -162,7 +162,7 @@ This is one of the simpler types of chains, but understanding how it works will
`````{dropdown} Agents: Dynamically call chains based on user input
So for the chains we've looked at run in a predetermined order.
So far the chains we've looked at run in a predetermined order.
Agents no longer do: they use an LLM to determine which actions to take and in what order. An action can either be using a tool and observing its output, or returning to the user.

View File

@@ -51,6 +51,8 @@ These modules are, in increasing order of complexity:
- `LLMs <./modules/llms.html>`_: This includes a generic interface for all LLMs, and common utilities for working with LLMs.
- `Document Loaders <./modules/document_loaders.html>`_: This includes a standard interface for loading documents, as well as specific integrations to all types of text data sources.
- `Utils <./modules/utils.html>`_: Language models are often more powerful when interacting with other sources of knowledge or computation. This can include Python REPLs, embeddings, search engines, and more. LangChain provides a large collection of common utils to use in your application.
- `Chains <./modules/chains.html>`_: Chains go beyond just a single LLM call, and are sequences of calls (whether to an LLM or a different utility). LangChain provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.
@@ -68,6 +70,7 @@ These modules are, in increasing order of complexity:
./modules/prompts.md
./modules/llms.md
./modules/document_loaders.md
./modules/utils.md
./modules/chains.md
./modules/agents.md

View File

@@ -0,0 +1,165 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "134a0785",
"metadata": {},
"source": [
"# Chat Vector DB\n",
"\n",
"This notebook goes over how to set up a chain to chat with a vector database. The only difference because this chain and the [VectorDBQAChain](./vector_db_qa.ipynb) is that this allows for passing in of a chat history which can be used to allow for follow up questions."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "70c4e529",
"metadata": {},
"outputs": [],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.vectorstores.faiss import FAISS\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.llms import OpenAI\n",
"from langchain.chains import ChatVectorDBChain"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a8930cf7",
"metadata": {},
"outputs": [],
"source": [
"with open('../../state_of_the_union.txt') as f:\n",
" state_of_the_union = f.read()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"texts = text_splitter.split_text(state_of_the_union)\n",
"\n",
"embeddings = OpenAIEmbeddings()\n",
"vectorstore = FAISS.from_texts(texts, embeddings)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "7b4110f3",
"metadata": {},
"outputs": [],
"source": [
"qa = ChatVectorDBChain.from_llm(OpenAI(temperature=0), vectorstore)"
]
},
{
"cell_type": "markdown",
"id": "3872432d",
"metadata": {},
"source": [
"Here's an example of asking a question with no chat history"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "7fe3e730",
"metadata": {},
"outputs": [],
"source": [
"chat_history = []\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"result = qa({\"question\": query, \"chat_history\": chat_history})"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "bfff9cc8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\" The president said that Ketanji Brown Jackson is one of the nation's top legal minds, a former top litigator in private practice, a former federal public defender, and from a family of public school educators and police officers. He also said that she is a consensus builder and has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.\""
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result[\"answer\"]"
]
},
{
"cell_type": "markdown",
"id": "9e46edf7",
"metadata": {},
"source": [
"Here's an example of asking a question with some chat history"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "00b4cf00",
"metadata": {},
"outputs": [],
"source": [
"chat_history = [(query, result[\"answer\"])]\n",
"query = \"Did he mention who she suceeded\"\n",
"result = qa({\"question\": query, \"chat_history\": chat_history})"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "f01828d1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"' Justice Stephen Breyer'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result['answer']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d0f869c6",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -56,6 +56,15 @@
"llm = OpenAI(temperature=0)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "3d1e692e",
"metadata": {},
"source": [
"**NOTE:** For data-sensitive projects, you can specify `return_direct=True` in the `SQLDatabaseChain` initialization to directly return the output of the SQL query without any additional formatting. This prevents the LLM from seeing any contents within the database. Note, however, the LLM still has access to the database scheme (i.e. dialect, table and key names) by default."
]
},
{
"cell_type": "code",
"execution_count": 3,

View File

@@ -121,10 +121,51 @@
"llm_chain.predict(adjective=\"sad\", subject=\"ducks\")"
]
},
{
"cell_type": "markdown",
"id": "672f59d4",
"metadata": {},
"source": [
"## From string\n",
"You can also construct an LLMChain from a string template directly."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f8bc262e",
"metadata": {},
"outputs": [],
"source": [
"template = \"\"\"Write a {adjective} poem about {subject}.\"\"\"\n",
"llm_chain = LLMChain.from_string(llm=OpenAI(temperature=0), template=template)\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "cb164a76",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"\\n\\nThe ducks swim in the pond,\\nTheir feathers so soft and warm,\\nBut they can't help but feel so forlorn.\\n\\nTheir quacks echo in the air,\\nBut no one is there to hear,\\nFor they have no one to share.\\n\\nThe ducks paddle around in circles,\\nTheir heads hung low in despair,\\nFor they have no one to care.\\n\\nThe ducks look up to the sky,\\nBut no one is there to see,\\nFor they have no one to be.\\n\\nThe ducks drift away in the night,\\nTheir hearts filled with sorrow and pain,\\nFor they have no one to gain.\""
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llm_chain.predict(adjective=\"sad\", subject=\"ducks\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8310cdaa",
"id": "9f0adbc7",
"metadata": {},
"outputs": [],
"source": []

View File

@@ -0,0 +1,29 @@
Document Loaders
==========================
Combining language models with your own text data is a powerful way to differentiate them.
The first step in doing this is to load the data into "documents" - a fancy way of say some pieces of text.
This module is aimed at making this easy.
A primary driver of a lot of this is the `Unstructured <https://github.com/Unstructured-IO/unstructured>`_ python package.
This package is a great way to transform all types of files - text, powerpoint, images, html, pdf, etc - into text data.
For detailed instructions on how to get set up with Unstructured, see installation guidelines `here <https://github.com/Unstructured-IO/unstructured#coffee-getting-started>`_.
The following sections of documentation are provided:
- `Key Concepts <./document_loaders/key_concepts.html>`_: A conceptual guide going over the various concepts related to loading documents.
- `How-To Guides <./document_loaders/how_to_guides.html>`_: A collection of how-to guides. These highlight different types of loaders.
.. toctree::
:maxdepth: 1
:caption: Document Loaders
:name: Document Loaders
:hidden:
./document_loaders/key_concepts.md
./document_loaders/how_to_guides.rst

View File

@@ -0,0 +1,101 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "79f24a6b",
"metadata": {},
"source": [
"# Directory Loader\n",
"This covers how to use the DirectoryLoader to load all documents in a directory. Under the hood, this uses the [UnstructuredLoader](./unstructured_file.ipynb)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "019d8520",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import DirectoryLoader"
]
},
{
"cell_type": "markdown",
"id": "0c76cdc5",
"metadata": {},
"source": [
"We can use the `glob` parameter to control which files to load. Note that here it doesn't load the `.rst` file or the `.ipynb` files."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "891fe56f",
"metadata": {},
"outputs": [],
"source": [
"loader = DirectoryLoader('../', glob=\"**/*.md\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "addfe9cf",
"metadata": {},
"outputs": [],
"source": [
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "b042086d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(docs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cbc8256b",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,82 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "1dc7df1d",
"metadata": {},
"source": [
"# Notion\n",
"This notebook covers how to load documents from a Notion database dump.\n",
"\n",
"In order to get this notion dump, follow these instructions:\n",
"\n",
"## 🧑 Instructions for ingesting your own dataset\n",
"\n",
"Export your dataset from Notion. You can do this by clicking on the three dots in the upper right hand corner and then clicking `Export`.\n",
"\n",
"When exporting, make sure to select the `Markdown & CSV` format option.\n",
"\n",
"This will produce a `.zip` file in your Downloads folder. Move the `.zip` file into this repository.\n",
"\n",
"Run the following command to unzip the zip file (replace the `Export...` with your own file name as needed).\n",
"\n",
"```shell\n",
"unzip Export-d3adfe0f-3131-4bf3-8987-a52017fc1bae.zip -d Notion_DB\n",
"```\n",
"\n",
"Run the following command to ingest the data."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "007c5cbf",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import NotionDirectoryLoader"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a1caec59",
"metadata": {},
"outputs": [],
"source": [
"loader = NotionDirectoryLoader(\"Notion_DB\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b1c30ff7",
"metadata": {},
"outputs": [],
"source": [
"docs = loader.load()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,78 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "17812129",
"metadata": {},
"source": [
"# ReadTheDocs Documentation\n",
"This notebook covers how to load content from html that was generated as part of a Read-The-Docs build.\n",
"\n",
"For an example of this in the wild, see [here](https://github.com/hwchase17/chat-langchain).\n",
"\n",
"This assumes that the html has already been scraped into a folder. This can be done by uncommenting and running the following command"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "84696e27",
"metadata": {},
"outputs": [],
"source": [
"#!wget -r -A.html -P rtdocs https://langchain.readthedocs.io/en/latest/"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "92dd950b",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import ReadTheDocsLoader"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "494567c3",
"metadata": {},
"outputs": [],
"source": [
"loader = ReadTheDocsLoader(\"rtdocs\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e2e6d6f0",
"metadata": {},
"outputs": [],
"source": [
"docs = loader.load()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,72 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "20deed05",
"metadata": {},
"source": [
"# Unstructured File Loader\n",
"This notebook covers how to use Unstructured to load files of many types. Unstructured currently supports loading of text files, powerpoints, html, pdfs, images, and more."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "79d3e549",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import UnstructuredFileLoader"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "2593d1dc",
"metadata": {},
"outputs": [],
"source": [
"loader = UnstructuredFileLoader(\"../../state_of_the_union.txt\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "fe34e941",
"metadata": {},
"outputs": [],
"source": [
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "24e577e5",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,19 @@
How To Guides
====================================
There are a lot of different document loaders that LangChain supports. Below are how-to guides for working with them
`File Loader <./examples/unstructured_file.html>`_: A walkthrough of how to use Unstructured to load files of arbitrary types (pdfs, txt, html, etc).
`Directory Loader <./examples/directory_loader.html>`_: A walkthrough of how to use Unstructured load files from a given directory.
`Notion <./examples/notion.html>`_: A walkthrough of how to load data for an arbitrary Notion DB.
`ReadTheDocs <./examples/readthedocs_documentation.html>`_: A walkthrough of how to load data for documentation generated by ReadTheDocs.
.. toctree::
:maxdepth: 1
:glob:
:hidden:
examples/*

View File

@@ -0,0 +1,12 @@
# Key Concepts
## Document
This class is a container for document information. This contains two parts:
- `page_content`: The content of the actual page itself.
- `metadata`: The metadata associated with the document. This can be things like the file path, the url, etc.
## Loader
This base class is a way to load documents. It exposes a `load` method that returns `Document` objects.
## [Unstructured](https://github.com/Unstructured-IO/unstructured)
Unstructured is a python package specifically focused on transformations from raw documents to text.

View File

@@ -151,6 +151,47 @@
"multiple_input_prompt.format(adjective=\"funny\", content=\"chickens\")"
]
},
{
"cell_type": "markdown",
"id": "72f32ff2",
"metadata": {},
"source": [
"## From Template\n",
"You can also easily load a prompt template by just specifying the template, and not worrying about the input variables."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "2a81f2f8",
"metadata": {},
"outputs": [],
"source": [
"template = \"Tell me a {adjective} joke about {content}.\"\n",
"multiple_input_prompt = PromptTemplate.from_template(template)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "d365b144",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PromptTemplate(input_variables=['adjective', 'content'], output_parser=None, template='Tell me a {adjective} joke about {content}.', template_format='f-string', validate_template=True)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"multiple_input_prompt"
]
},
{
"cell_type": "markdown",
"id": "b2dd6154",

View File

@@ -1,7 +1,6 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "b118c9dc",
"metadata": {},
@@ -476,10 +475,59 @@
"print(texts[0])"
]
},
{
"cell_type": "markdown",
"id": "53049ff5",
"metadata": {},
"source": [
"## Token Text Splitter"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a1a118b1",
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import TokenTextSplitter"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "ef37c5d3",
"metadata": {},
"outputs": [],
"source": [
"text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "5750228a",
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Madam Speaker, Madam Vice President, our\n"
]
}
],
"source": [
"texts = text_splitter.split_text(state_of_the_union)\n",
"print(texts[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a1a118b1",
"id": "0905c1de",
"metadata": {},
"outputs": [],
"source": []
@@ -487,7 +535,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@@ -501,7 +549,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12 (main, Mar 26 2022, 15:51:15) \n[Clang 13.1.6 (clang-1316.0.21.2)]"
"version": "3.10.9"
},
"vscode": {
"interpreter": {

View File

@@ -16,7 +16,7 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 3,
"id": "965eecee",
"metadata": {
"pycharm": {
@@ -32,7 +32,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 4,
"id": "68481687",
"metadata": {
"pycharm": {
@@ -51,7 +51,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 5,
"id": "015f4ff5",
"metadata": {
"pycharm": {
@@ -483,7 +483,10 @@
"import pinecone \n",
"\n",
"# initialize pinecone\n",
"pinecone.init(api_key=\"\", environment=\"us-west1-gcp\")\n",
"pinecone.init(\n",
" api_key=\"YOUR_API_KEY\", # find at app.pinecone.io\n",
" environment=\"YOUR_ENV\" # next to api key in console\n",
")\n",
"\n",
"index_name = \"langchain-demo\"\n",
"\n",
@@ -566,10 +569,74 @@
"docs[0]"
]
},
{
"cell_type": "markdown",
"id": "6c3ec797",
"metadata": {},
"source": [
"## Milvus\n",
"To run, you should have a Milvus instance up and running: https://milvus.io/docs/install_standalone-docker.md"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "be347313",
"metadata": {},
"outputs": [],
"source": [
"from langchain.vectorstores import Milvus"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f2eee23f",
"metadata": {},
"outputs": [],
"source": [
"vector_db = Milvus.from_texts(\n",
" texts,\n",
" embeddings,\n",
" connection_args={\"host\": \"127.0.0.1\", \"port\": \"19530\"},\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "06bdb701",
"metadata": {},
"outputs": [],
"source": [
"docs = vector_db.similarity_search(query)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "7b3e94aa",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \\n\\nWe cannot let this happen. \\n\\nTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.', lookup_str='', metadata={}, lookup_index=0)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8ffd66e2",
"id": "4af5a071",
"metadata": {},
"outputs": [],
"source": []

View File

@@ -40,7 +40,7 @@ def get_action_and_input(llm_output: str) -> Tuple[str, str]:
if FINAL_ANSWER_ACTION in llm_output:
return "Final Answer", llm_output.split(FINAL_ANSWER_ACTION)[-1].strip()
regex = r"Action: (.*?)\nAction Input: (.*)"
match = re.search(regex, llm_output)
match = re.search(regex, llm_output, re.DOTALL)
if not match:
raise ValueError(f"Could not parse LLM output: `{llm_output}`")
action = match.group(1).strip()

View File

@@ -1,5 +1,6 @@
"""Chains are easily reusable components which can be linked together."""
from langchain.chains.api.base import APIChain
from langchain.chains.chat_vector_db.base import ChatVectorDBChain
from langchain.chains.conversation.base import ConversationChain
from langchain.chains.hyde.base import HypotheticalDocumentEmbedder
from langchain.chains.llm import LLMChain
@@ -42,4 +43,5 @@ __all__ = [
"SQLDatabaseSequentialChain",
"load_chain",
"HypotheticalDocumentEmbedder",
"ChatVectorDBChain",
]

View File

@@ -0,0 +1 @@
"""Chain for chatting with a vector database."""

View File

@@ -0,0 +1,85 @@
"""Chain for chatting with a vector database."""
from __future__ import annotations
from typing import Any, Dict, List, Tuple
from pydantic import BaseModel
from langchain.chains.base import Chain
from langchain.chains.chat_vector_db.prompts import CONDENSE_QUESTION_PROMPT, QA_PROMPT
from langchain.chains.combine_documents.base import BaseCombineDocumentsChain
from langchain.chains.llm import LLMChain
from langchain.chains.question_answering import load_qa_chain
from langchain.llms.base import BaseLLM
from langchain.prompts.base import BasePromptTemplate
from langchain.vectorstores.base import VectorStore
def _get_chat_history(chat_history: List[Tuple[str, str]]) -> str:
buffer = ""
for human_s, ai_s in chat_history:
human = "Human: " + human_s
ai = "Assistant: " + ai_s
buffer += "\n" + "\n".join([human, ai])
return buffer
class ChatVectorDBChain(Chain, BaseModel):
"""Chain for chatting with a vector database."""
vectorstore: VectorStore
combine_docs_chain: BaseCombineDocumentsChain
question_generator: LLMChain
output_key: str = "answer"
@property
def _chain_type(self) -> str:
return "chat-vector-db"
@property
def input_keys(self) -> List[str]:
"""Input keys."""
return ["question", "chat_history"]
@property
def output_keys(self) -> List[str]:
"""Output keys."""
return [self.output_key]
@classmethod
def from_llm(
cls,
llm: BaseLLM,
vectorstore: VectorStore,
condense_question_prompt: BasePromptTemplate = CONDENSE_QUESTION_PROMPT,
qa_prompt: BasePromptTemplate = QA_PROMPT,
chain_type: str = "stuff",
) -> ChatVectorDBChain:
"""Load chain from LLM."""
doc_chain = load_qa_chain(
llm,
chain_type=chain_type,
prompt=qa_prompt,
)
condense_question_chain = LLMChain(llm=llm, prompt=condense_question_prompt)
return cls(
vectorstore=vectorstore,
combine_docs_chain=doc_chain,
question_generator=condense_question_chain,
)
def _call(self, inputs: Dict[str, Any]) -> Dict[str, str]:
question = inputs["question"]
chat_history_str = _get_chat_history(inputs["chat_history"])
if chat_history_str:
new_question = self.question_generator.run(
question=question, chat_history=chat_history_str
)
else:
new_question = question
docs = self.vectorstore.similarity_search(new_question, k=4)
new_inputs = inputs.copy()
new_inputs["question"] = new_question
new_inputs["chat_history"] = chat_history_str
answer, _ = self.combine_docs_chain.combine_docs(docs, **new_inputs)
return {self.output_key: answer}

View File

@@ -0,0 +1,20 @@
# flake8: noqa
from langchain.prompts.prompt import PromptTemplate
_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)
prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
Helpful Answer:"""
QA_PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)

View File

@@ -1,5 +1,4 @@
"""Chain that just formats a prompt and calls an LLM."""
from string import Formatter
from typing import Any, Dict, List, Sequence, Union
from pydantic import BaseModel, Extra
@@ -132,10 +131,5 @@ class LLMChain(Chain, BaseModel):
@classmethod
def from_string(cls, llm: BaseLLM, template: str) -> Chain:
"""Create LLMChain from LLM and template."""
input_variables = {
v for _, v, _, _ in Formatter().parse(template) if v is not None
}
prompt_template = PromptTemplate(
input_variables=list(input_variables), template=template
)
prompt_template = PromptTemplate.from_template(template)
return cls(llm=llm, prompt=prompt_template)

View File

@@ -440,7 +440,7 @@ def load_chain_from_config(config: dict, **kwargs: Any) -> Chain:
def load_chain(path: Union[str, Path], **kwargs: Any) -> Chain:
"""Unified method for loading a chain from LangChainHub or local fs."""
if hub_result := try_load_from_hub(
path, _load_chain_from_file, "chains", {"json", "yaml"}
path, _load_chain_from_file, "chains", {"json", "yaml"}, **kwargs
):
return hub_result
else:

View File

@@ -35,6 +35,9 @@ class SQLDatabaseChain(Chain, BaseModel):
input_key: str = "query" #: :meta private:
output_key: str = "result" #: :meta private:
return_intermediate_steps: bool = False
"""Whether or not to return the intermediate steps along with the final answer."""
return_direct: bool = False
"""Whether or not to return the result of querying the SQL table directly."""
class Config:
"""Configuration for this pydantic object."""
@@ -83,11 +86,17 @@ class SQLDatabaseChain(Chain, BaseModel):
intermediate_steps.append(result)
self.callback_manager.on_text("\nSQLResult: ", verbose=self.verbose)
self.callback_manager.on_text(result, color="yellow", verbose=self.verbose)
self.callback_manager.on_text("\nAnswer:", verbose=self.verbose)
input_text += f"{sql_cmd}\nSQLResult: {result}\nAnswer:"
llm_inputs["input"] = input_text
final_result = llm_chain.predict(**llm_inputs)
self.callback_manager.on_text(final_result, color="green", verbose=self.verbose)
# If return direct, we just set the final result equal to the sql query
if self.return_direct:
final_result = result
else:
self.callback_manager.on_text("\nAnswer:", verbose=self.verbose)
input_text += f"{sql_cmd}\nSQLResult: {result}\nAnswer:"
llm_inputs["input"] = input_text
final_result = llm_chain.predict(**llm_inputs)
self.callback_manager.on_text(
final_result, color="green", verbose=self.verbose
)
chain_result: Dict[str, Any] = {self.output_key: final_result}
if self.return_intermediate_steps:
chain_result["intermediate_steps"] = intermediate_steps

View File

@@ -26,7 +26,7 @@ PROMPT = PromptTemplate(
template=_DEFAULT_TEMPLATE,
)
_DECIDER_TEMPLATE = """Given the below input question and list of potential tables, output a comma separated list of the table names that may be neccessary to answer this question.
_DECIDER_TEMPLATE = """Given the below input question and list of potential tables, output a comma separated list of the table names that may be necessary to answer this question.
Question: {query}

View File

@@ -0,0 +1,13 @@
"""All different types of document loaders."""
from langchain.document_loaders.directory import DirectoryLoader
from langchain.document_loaders.notion import NotionDirectoryLoader
from langchain.document_loaders.readthedocs import ReadTheDocsLoader
from langchain.document_loaders.unstructured import UnstructuredFileLoader
__all__ = [
"UnstructuredFileLoader",
"DirectoryLoader",
"NotionDirectoryLoader",
"ReadTheDocsLoader",
]

View File

@@ -0,0 +1,26 @@
"""Base loader class."""
from abc import ABC, abstractmethod
from typing import List, Optional
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter, TextSplitter
class BaseLoader(ABC):
"""Base loader class."""
@abstractmethod
def load(self) -> List[Document]:
"""Load data into document objects."""
def load_and_split(
self, text_splitter: Optional[TextSplitter] = None
) -> List[Document]:
"""Load documents and split into chunks."""
if text_splitter is None:
_text_splitter: TextSplitter = RecursiveCharacterTextSplitter()
else:
_text_splitter = text_splitter
docs = self.load()
return _text_splitter.split_documents(docs)

View File

@@ -0,0 +1,26 @@
"""Loading logic for loading documents from a directory."""
from pathlib import Path
from typing import List
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
from langchain.document_loaders.unstructured import UnstructuredFileLoader
class DirectoryLoader(BaseLoader):
"""Loading logic for loading documents from a directory."""
def __init__(self, path: str, glob: str = "**/*"):
"""Initialize with path to directory and how to glob over it."""
self.path = path
self.glob = glob
def load(self) -> List[Document]:
"""Load documents."""
p = Path(self.path)
docs = []
for i in p.glob(self.glob):
if i.is_file():
sub_docs = UnstructuredFileLoader(str(i)).load()
docs.extend(sub_docs)
return docs

View File

@@ -0,0 +1,25 @@
"""Loader that loads Notion directory dump."""
from pathlib import Path
from typing import List
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
class NotionDirectoryLoader(BaseLoader):
"""Loader that loads Notion directory dump."""
def __init__(self, path: str):
"""Initialize with path."""
self.file_path = path
def load(self) -> List[Document]:
"""Load documents."""
ps = list(Path(self.file_path).glob("**/*.md"))
docs = []
for p in ps:
with open(p) as f:
text = f.read()
metadata = {"source": str(p)}
docs.append(Document(page_content=text, metadata=metadata))
return docs

View File

@@ -0,0 +1,33 @@
"""Loader that loads ReadTheDocs documentation directory dump."""
from pathlib import Path
from typing import List
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
class ReadTheDocsLoader(BaseLoader):
"""Loader that loads ReadTheDocs documentation directory dump."""
def __init__(self, path: str):
"""Initialize path."""
self.file_path = path
def load(self) -> List[Document]:
"""Load documents."""
from bs4 import BeautifulSoup
def _clean_data(data: str) -> str:
soup = BeautifulSoup(data)
text = soup.find_all("main", {"id": "main-content"})[0].get_text()
return "\n".join([t for t in text.split("\n") if t])
docs = []
for p in Path(self.file_path).rglob("*"):
if p.is_dir():
continue
with open(p) as f:
text = _clean_data(f.read())
metadata = {"source": str(p)}
docs.append(Document(page_content=text, metadata=metadata))
return docs

View File

@@ -0,0 +1,29 @@
"""Loader that uses unstructured to load files."""
from typing import List
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
class UnstructuredFileLoader(BaseLoader):
"""Loader that uses unstructured to load files."""
def __init__(self, file_path: str):
"""Initialize with file path."""
try:
import unstructured # noqa:F401
except ImportError:
raise ValueError(
"unstructured package not found, please install it with "
"`pip install unstructured`"
)
self.file_path = file_path
def load(self) -> List[Document]:
"""Load file."""
from unstructured.partition.auto import partition
elements = partition(filename=self.file_path)
text = "\n\n".join([str(el) for el in elements])
metadata = {"source": self.file_path}
return [Document(page_content=text, metadata=metadata)]

View File

@@ -62,6 +62,13 @@ class BaseLLM(BaseModel, ABC):
self, prompts: List[str], stop: Optional[List[str]] = None
) -> LLMResult:
"""Run the LLM on the given prompt and input."""
# If string is passed in directly no errors will be raised but outputs will
# not make sense.
if not isinstance(prompts, list):
raise ValueError(
"Argument 'prompts' is expected to be of type List[str], received"
f" argument of type {type(prompts)}."
)
disregard_cache = self.cache is not None and not self.cache
if langchain.llm_cache is None or disregard_cache:
# This happens when langchain.cache is None, but self.cache is True

View File

@@ -4,6 +4,13 @@ import sys
from typing import Any, Dict, Generator, List, Mapping, Optional, Tuple, Union
from pydantic import BaseModel, Extra, Field, root_validator
from tenacity import (
after_log,
retry,
retry_if_exception_type,
stop_after_attempt,
wait_exponential,
)
from langchain.llms.base import BaseLLM
from langchain.schema import Generation, LLMResult
@@ -56,6 +63,8 @@ class BaseOpenAI(BaseLLM, BaseModel):
"""Timeout for requests to OpenAI completion API. Default is 600 seconds."""
logit_bias: Optional[Dict[str, float]] = Field(default_factory=dict)
"""Adjust the probability of specific tokens being generated."""
max_retries: int = 6
"""Maximum number of retries to make when generating."""
class Config:
"""Configuration for this pydantic object."""
@@ -115,6 +124,32 @@ class BaseOpenAI(BaseLLM, BaseModel):
}
return {**normal_params, **self.model_kwargs}
def completion_with_retry(self, **kwargs: Any) -> Any:
"""Use tenacity to retry the completion call."""
import openai
min_seconds = 4
max_seconds = 10
# Wait 2^x * 1 second between each retry starting with
# 4 seconds, then up to 10 seconds, then 10 seconds afterwards
@retry(
reraise=True,
stop=stop_after_attempt(self.max_retries),
wait=wait_exponential(multiplier=1, min=min_seconds, max=max_seconds),
retry=(
retry_if_exception_type(openai.error.Timeout)
| retry_if_exception_type(openai.error.APIError)
| retry_if_exception_type(openai.error.APIConnectionError)
| retry_if_exception_type(openai.error.RateLimitError)
),
after=after_log(logger, logging.DEBUG),
)
def _completion_with_retry(**kwargs: Any) -> Any:
return self.client.create(**kwargs)
return _completion_with_retry(**kwargs)
def _generate(
self, prompts: List[str], stop: Optional[List[str]] = None
) -> LLMResult:
@@ -155,7 +190,7 @@ class BaseOpenAI(BaseLLM, BaseModel):
# Includes prompt, completion, and total tokens used.
_keys = {"completion_tokens", "prompt_tokens", "total_tokens"}
for _prompts in sub_prompts:
response = self.client.create(prompt=_prompts, **params)
response = self.completion_with_retry(prompt=_prompts, **params)
choices.extend(response["choices"])
_keys_to_use = _keys.intersection(response["usage"])
for _key in _keys_to_use:
@@ -242,8 +277,13 @@ class BaseOpenAI(BaseLLM, BaseModel):
"This is needed in order to calculate get_num_tokens. "
"Please it install it with `pip install tiktoken`."
)
encoder = "gpt2"
if self.model_name in ("text-davinci-003", "text-davinci-002"):
encoder = "p50k_base"
if self.model_name.startswith("code"):
encoder = "p50k_base"
# create a GPT-3 encoder instance
enc = tiktoken.get_encoding("gpt2")
enc = tiktoken.get_encoding(encoder)
# encode the text using the GPT-3 encoder
tokenized_text = enc.encode(text)

View File

@@ -51,7 +51,7 @@ class LengthBasedExampleSelector(BaseExampleSelector, BaseModel):
examples = []
while remaining_length > 0 and i < len(self.examples):
new_length = remaining_length - self.example_text_lengths[i]
if i < 0:
if new_length < 0:
break
else:
examples.append(self.examples[i])

View File

@@ -41,6 +41,9 @@ class FewShotPromptTemplate(BasePromptTemplate, BaseModel):
template_format: str = "f-string"
"""The format of the prompt template. Options are: 'f-string', 'jinja2'."""
validate_template: bool = True
"""Whether or not to try validating the template."""
@root_validator(pre=True)
def check_examples_and_selector(cls, values: Dict) -> Dict:
"""Check that one and only one of examples/example_selector are provided."""
@@ -61,11 +64,12 @@ class FewShotPromptTemplate(BasePromptTemplate, BaseModel):
@root_validator()
def template_is_valid(cls, values: Dict) -> Dict:
"""Check that prefix, suffix and input variables are consistent."""
check_valid_template(
values["prefix"] + values["suffix"],
values["template_format"],
values["input_variables"],
)
if values["validate_template"]:
check_valid_template(
values["prefix"] + values["suffix"],
values["template_format"],
values["input_variables"],
)
return values
class Config:

View File

@@ -1,6 +1,7 @@
"""Prompt schema definition."""
from __future__ import annotations
from string import Formatter
from typing import Any, Dict, List
from pydantic import BaseModel, Extra, root_validator
@@ -31,6 +32,9 @@ class PromptTemplate(BasePromptTemplate, BaseModel):
template_format: str = "f-string"
"""The format of the prompt template. Options are: 'f-string', 'jinja2'."""
validate_template: bool = True
"""Whether or not to try validating the template."""
@property
def _prompt_type(self) -> str:
"""Return the prompt type key."""
@@ -61,9 +65,10 @@ class PromptTemplate(BasePromptTemplate, BaseModel):
@root_validator()
def template_is_valid(cls, values: Dict) -> Dict:
"""Check that template and input variables are consistent."""
check_valid_template(
values["template"], values["template_format"], values["input_variables"]
)
if values["validate_template"]:
check_valid_template(
values["template"], values["template_format"], values["input_variables"]
)
return values
@classmethod
@@ -113,6 +118,14 @@ class PromptTemplate(BasePromptTemplate, BaseModel):
template = f.read()
return cls(input_variables=input_variables, template=template)
@classmethod
def from_template(cls, template: str) -> PromptTemplate:
"""Load a prompt template from a template."""
input_variables = {
v for _, v, _, _ in Formatter().parse(template) if v is not None
}
return cls(input_variables=list(sorted(input_variables)), template=template)
# For backwards compatibility.
Prompt = PromptTemplate

View File

@@ -44,6 +44,12 @@ class TextSplitter(ABC):
documents.append(Document(page_content=chunk, metadata=_metadatas[i]))
return documents
def split_documents(self, documents: List[Document]) -> List[Document]:
"""Split documents."""
texts = [doc.page_content for doc in documents]
metadatas = [doc.metadata for doc in documents]
return self.create_documents(texts, metadatas)
def _join_docs(self, docs: List[str], separator: str) -> Optional[str]:
text = separator.join(docs)
text = text.strip()
@@ -146,6 +152,38 @@ class CharacterTextSplitter(TextSplitter):
return self._merge_splits(splits, self._separator)
class TokenTextSplitter(TextSplitter):
"""Implementation of splitting text that looks at tokens."""
def __init__(self, encoding_name: str = "gpt2", **kwargs: Any):
"""Create a new TextSplitter."""
super().__init__(**kwargs)
try:
import tiktoken
except ImportError:
raise ValueError(
"Could not import tiktoken python package. "
"This is needed in order to for TokenTextSplitter. "
"Please it install it with `pip install tiktoken`."
)
# create a GPT-3 encoder instance
self._tokenizer = tiktoken.get_encoding(encoding_name)
def split_text(self, text: str) -> List[str]:
"""Split incoming text and return chunks."""
splits = []
input_ids = self._tokenizer.encode(text)
start_idx = 0
cur_idx = min(start_idx + self._chunk_size, len(input_ids))
chunk_ids = input_ids[start_idx:cur_idx]
while start_idx < len(input_ids):
splits.append(self._tokenizer.decode(chunk_ids))
start_idx += self._chunk_size - self._chunk_overlap
cur_idx = min(start_idx + self._chunk_size, len(input_ids))
chunk_ids = input_ids[start_idx:cur_idx]
return splits
class RecursiveCharacterTextSplitter(TextSplitter):
"""Implementation of splitting text that looks at characters.

View File

@@ -2,6 +2,7 @@
from langchain.vectorstores.base import VectorStore
from langchain.vectorstores.elastic_vector_search import ElasticVectorSearch
from langchain.vectorstores.faiss import FAISS
from langchain.vectorstores.milvus import Milvus
from langchain.vectorstores.pinecone import Pinecone
from langchain.vectorstores.qdrant import Qdrant
from langchain.vectorstores.weaviate import Weaviate
@@ -13,4 +14,5 @@ __all__ = [
"Pinecone",
"Weaviate",
"Qdrant",
"Milvus",
]

View File

@@ -0,0 +1,428 @@
"""Wrapper around the Milvus vector database."""
from __future__ import annotations
import uuid
from typing import Any, Iterable, List, Optional, Tuple
import numpy as np
from langchain.docstore.document import Document
from langchain.embeddings.base import Embeddings
from langchain.vectorstores.base import VectorStore
from langchain.vectorstores.utils import maximal_marginal_relevance
class Milvus(VectorStore):
"""Wrapper around the Milvus vector database."""
def __init__(
self,
embedding_function: Embeddings,
connection_args: dict,
collection_name: str,
text_field: str,
):
"""Initialize wrapper around the milvus vector database.
In order to use this you need to have `pymilvus` installed and a
running Milvus instance.
See the following documentation for how to run a Milvus instance:
https://milvus.io/docs/install_standalone-docker.md
Args:
embedding_function (Embeddings): Function used to embed the text
connection_args (dict): Arguments for pymilvus connections.connect()
collection_name (str): The name of the collection to search.
text_field (str): The field in Milvus schema where the
original text is stored.
"""
try:
from pymilvus import Collection, DataType, connections
except ImportError:
raise ValueError(
"Could not import pymilvus python package. "
"Please it install it with `pip install pymilvus`."
)
# Connecting to Milvus instance
if not connections.has_connection("default"):
connections.connect(**connection_args)
self.embedding_func = embedding_function
self.collection_name = collection_name
self.text_field = text_field
self.auto_id = False
self.primary_field = None
self.vector_field = None
self.fields = []
self.col = Collection(self.collection_name)
schema = self.col.schema
# Grabbing the fields for the existing collection.
for x in schema.fields:
self.fields.append(x.name)
if x.auto_id:
self.fields.remove(x.name)
if x.is_primary:
self.primary_field = x.name
if x.dtype == DataType.FLOAT_VECTOR or x.dtype == DataType.BINARY_VECTOR:
self.vector_field = x.name
# Default search params when one is not provided.
self.index_params = {
"IVF_FLAT": {"params": {"nprobe": 10}},
"IVF_SQ8": {"params": {"nprobe": 10}},
"IVF_PQ": {"params": {"nprobe": 10}},
"HNSW": {"params": {"ef": 10}},
"RHNSW_FLAT": {"params": {"ef": 10}},
"RHNSW_SQ": {"params": {"ef": 10}},
"RHNSW_PQ": {"params": {"ef": 10}},
"IVF_HNSW": {"params": {"nprobe": 10, "ef": 10}},
"ANNOY": {"params": {"search_k": 10}},
}
def add_texts(
self,
texts: Iterable[str],
metadatas: Optional[List[dict]] = None,
partition_name: Optional[str] = None,
timeout: Optional[int] = None,
) -> List[str]:
"""Insert text data into Milvus.
When using add_texts() it is assumed that a collecton has already
been made and indexed. If metadata is included, it is assumed that
it is ordered correctly to match the schema provided to the Collection
and that the embedding vector is the first schema field.
Args:
texts (Iterable[str]): The text being embedded and inserted.
metadatas (Optional[List[dict]], optional): The metadata that
corresponds to each insert. Defaults to None.
partition_name (str, optional): The partition of the collection
to insert data into. Defaults to None.
timeout: specified timeout.
Returns:
List[str]: The resulting keys for each inserted element.
"""
insert_dict: Any = {self.text_field: list(texts)}
try:
insert_dict[self.vector_field] = self.embedding_func.embed_documents(
list(texts)
)
except NotImplementedError:
insert_dict[self.vector_field] = [
self.embedding_func.embed_query(x) for x in texts
]
# Collect the metadata into the insert dict.
if len(self.fields) > 2 and metadatas is not None:
for d in metadatas:
for key, value in d.items():
if key in self.fields:
insert_dict.setdefault(key, []).append(value)
# Convert dict to list of lists for insertion
insert_list = [insert_dict[x] for x in self.fields]
# Insert into the collection.
res = self.col.insert(
insert_list, partition_name=partition_name, timeout=timeout
)
# Flush to make sure newly inserted is immediately searchable.
self.col.flush()
return res.primary_keys
def _worker_search(
self,
query: str,
k: int = 4,
param: Optional[dict] = None,
expr: Optional[str] = None,
partition_names: Optional[List[str]] = None,
round_decimal: int = -1,
timeout: Optional[int] = None,
**kwargs: Any,
) -> Tuple[List[float], List[Tuple[Document, Any, Any]]]:
# Load the collection into memory for searching.
self.col.load()
# Decide to use default params if not passed in.
if param is None:
index_type = self.col.indexes[0].params["index_type"]
param = self.index_params[index_type]
# Embed the query text.
data = [self.embedding_func.embed_query(query)]
# Determine result metadata fields.
output_fields = self.fields[:]
output_fields.remove(self.vector_field)
# Perform the search.
res = self.col.search(
data,
self.vector_field,
param,
k,
expr=expr,
output_fields=output_fields,
partition_names=partition_names,
round_decimal=round_decimal,
timeout=timeout,
**kwargs,
)
# Organize results.
ret = []
for result in res[0]:
meta = {x: result.entity.get(x) for x in output_fields}
ret.append(
(
Document(page_content=meta.pop(self.text_field), metadata=meta),
result.distance,
result.id,
)
)
return data[0], ret
def similarity_search_with_score(
self,
query: str,
k: int = 4,
param: Optional[dict] = None,
expr: Optional[str] = None,
partition_names: Optional[List[str]] = None,
round_decimal: int = -1,
timeout: Optional[int] = None,
**kwargs: Any,
) -> List[Tuple[Document, float]]:
"""Perform a search on a query string and return results.
Args:
query (str): The text being searched.
k (int, optional): The amount of results ot return. Defaults to 4.
param (dict, optional): The search params for the specified index.
Defaults to None.
expr (str, optional): Filtering expression. Defaults to None.
partition_names (List[str], optional): Partitions to search through.
Defaults to None.
round_decimal (int, optional): Round the resulting distance. Defaults
to -1.
timeout (int, optional): Amount to wait before timeout error. Defaults
to None.
kwargs: Collection.search() keyword arguments.
Returns:
List[float], List[Tuple[Document, any, any]]: search_embedding,
(Document, distance, primary_field) results.
"""
_, result = self._worker_search(
query, k, param, expr, partition_names, round_decimal, timeout, **kwargs
)
return [(x, y) for x, y, _ in result]
def max_marginal_relevance_search(
self,
query: str,
k: int = 4,
fetch_k: int = 20,
param: Optional[dict] = None,
expr: Optional[str] = None,
partition_names: Optional[List[str]] = None,
round_decimal: int = -1,
timeout: Optional[int] = None,
**kwargs: Any,
) -> List[Document]:
"""Perform a search and return results that are reordered by MMR.
Args:
query (str): The text being searched.
k (int, optional): How many results to give. Defaults to 4.
fetch_k (int, optional): Total results to select k from.
Defaults to 20.
param (dict, optional): The search params for the specified index.
Defaults to None.
expr (str, optional): Filtering expression. Defaults to None.
partition_names (List[str], optional): What partitions to search.
Defaults to None.
round_decimal (int, optional): Round the resulting distance. Defaults
to -1.
timeout (int, optional): Amount to wait before timeout error. Defaults
to None.
Returns:
List[Document]: Document results for search.
"""
data, res = self._worker_search(
query,
fetch_k,
param,
expr,
partition_names,
round_decimal,
timeout,
**kwargs,
)
# Extract result IDs.
ids = [x for _, _, x in res]
# Get the raw vectors from Milvus.
vectors = self.col.query(
expr=f"{self.primary_field} in {ids}",
output_fields=[self.primary_field, self.vector_field],
)
# Reorganize the results from query to match result order.
vectors = {x[self.primary_field]: x[self.vector_field] for x in vectors}
search_embedding = data
ordered_result_embeddings = [vectors[x] for x in ids]
# Get the new order of results.
new_ordering = maximal_marginal_relevance(
np.array(search_embedding), ordered_result_embeddings, k=k
)
# Reorder the values and return.
ret = []
for x in new_ordering:
if x == -1:
break
else:
ret.append(res[x][0])
return ret
def similarity_search(
self,
query: str,
k: int = 4,
param: Optional[dict] = None,
expr: Optional[str] = None,
partition_names: Optional[List[str]] = None,
round_decimal: int = -1,
timeout: Optional[int] = None,
**kwargs: Any,
) -> List[Document]:
"""Perform a similarity search against the query string.
Args:
query (str): The text to search.
k (int, optional): How many results to return. Defaults to 4.
param (dict, optional): The search params for the index type.
Defaults to None.
expr (str, optional): Filtering expression. Defaults to None.
partition_names (List[str], optional): What partitions to search.
Defaults to None.
round_decimal (int, optional): What decimal point to round to.
Defaults to -1.
timeout (int, optional): How long to wait before timeout error.
Defaults to None.
Returns:
List[Document]: Document results for search.
"""
_, docs_and_scores = self._worker_search(
query, k, param, expr, partition_names, round_decimal, timeout, **kwargs
)
return [doc for doc, _, _ in docs_and_scores]
@classmethod
def from_texts(
cls,
texts: List[str],
embedding: Embeddings,
metadatas: Optional[List[dict]] = None,
**kwargs: Any,
) -> Milvus:
"""Create a Milvus collection, indexes it with HNSW, and insert data.
Args:
texts (List[str]): Text to insert.
embedding (Embeddings): Embedding function to use.
metadatas (Optional[List[dict]], optional): Dict metatadata.
Defaults to None.
Returns:
VectorStore: The Milvus vector store.
"""
try:
from pymilvus import (
Collection,
CollectionSchema,
DataType,
FieldSchema,
connections,
)
from pymilvus.orm.types import infer_dtype_bydata
except ImportError:
raise ValueError(
"Could not import pymilvus python package. "
"Please it install it with `pip install pymilvus`."
)
# Connect to Milvus instance
if not connections.has_connection("default"):
connections.connect(**kwargs.get("connection_args", {"port": 19530}))
# Determine embedding dim
embeddings = embedding.embed_query(texts[0])
dim = len(embeddings)
# Generate unique names
primary_field = "c" + str(uuid.uuid4().hex)
vector_field = "c" + str(uuid.uuid4().hex)
text_field = "c" + str(uuid.uuid4().hex)
collection_name = "c" + str(uuid.uuid4().hex)
fields = []
# Determine metadata schema
if metadatas:
# Check if all metadata keys line up
key = metadatas[0].keys()
for x in metadatas:
if key != x.keys():
raise ValueError(
"Mismatched metadata. "
"Make sure all metadata has the same keys and datatype."
)
# Create FieldSchema for each entry in singular metadata.
for key, value in metadatas[0].items():
# Infer the corresponding datatype of the metadata
dtype = infer_dtype_bydata(value)
if dtype == DataType.UNKNOWN:
raise ValueError(f"Unrecognized datatype for {key}.")
elif dtype == DataType.VARCHAR:
# Find out max length text based metadata
max_length = 0
for subvalues in metadatas:
max_length = max(max_length, len(subvalues[key]))
fields.append(
FieldSchema(key, DataType.VARCHAR, max_length=max_length + 1)
)
else:
fields.append(FieldSchema(key, dtype))
# Find out max length of texts
max_length = 0
for y in texts:
max_length = max(max_length, len(y))
# Create the text field
fields.append(
FieldSchema(text_field, DataType.VARCHAR, max_length=max_length + 1)
)
# Create the primary key field
fields.append(
FieldSchema(primary_field, DataType.INT64, is_primary=True, auto_id=True)
)
# Create the vector field
fields.append(FieldSchema(vector_field, DataType.FLOAT_VECTOR, dim=dim))
# Create the schema for the collection
schema = CollectionSchema(fields)
# Create the collection
collection = Collection(collection_name, schema)
# Index parameters for the collection
index = {
"index_type": "HNSW",
"metric_type": "L2",
"params": {"M": 8, "efConstruction": 64},
}
# Create the index
collection.create_index(vector_field, index)
# Create the VectorStore
milvus = cls(
embedding,
kwargs.get("connection_args", {"port": 19530}),
collection_name,
text_field,
)
# Add the texts.
milvus.add_texts(texts, metadatas)
return milvus

24
poetry.lock generated
View File

@@ -851,7 +851,6 @@ files = [
{file = "debugpy-1.6.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9b5d1b13d7c7bf5d7cf700e33c0b8ddb7baf030fcf502f76fc061ddd9405d16c"},
{file = "debugpy-1.6.6-cp38-cp38-win32.whl", hash = "sha256:70ab53918fd907a3ade01909b3ed783287ede362c80c75f41e79596d5ccacd32"},
{file = "debugpy-1.6.6-cp38-cp38-win_amd64.whl", hash = "sha256:c05349890804d846eca32ce0623ab66c06f8800db881af7a876dc073ac1c2225"},
{file = "debugpy-1.6.6-cp39-cp39-macosx_11_0_x86_64.whl", hash = "sha256:11a0f3a106f69901e4a9a5683ce943a7a5605696024134b522aa1bfda25b5fec"},
{file = "debugpy-1.6.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a771739902b1ae22a120dbbb6bd91b2cae6696c0e318b5007c5348519a4211c6"},
{file = "debugpy-1.6.6-cp39-cp39-win32.whl", hash = "sha256:549ae0cb2d34fc09d1675f9b01942499751d174381b6082279cf19cdb3c47cbe"},
{file = "debugpy-1.6.6-cp39-cp39-win_amd64.whl", hash = "sha256:de4a045fbf388e120bb6ec66501458d3134f4729faed26ff95de52a754abddb1"},
@@ -2411,6 +2410,7 @@ files = [
{file = "lxml-4.9.2-cp35-cp35m-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:ca989b91cf3a3ba28930a9fc1e9aeafc2a395448641df1f387a2d394638943b0"},
{file = "lxml-4.9.2-cp35-cp35m-manylinux_2_5_x86_64.manylinux1_x86_64.whl", hash = "sha256:822068f85e12a6e292803e112ab876bc03ed1f03dddb80154c395f891ca6b31e"},
{file = "lxml-4.9.2-cp35-cp35m-win32.whl", hash = "sha256:be7292c55101e22f2a3d4d8913944cbea71eea90792bf914add27454a13905df"},
{file = "lxml-4.9.2-cp35-cp35m-win_amd64.whl", hash = "sha256:998c7c41910666d2976928c38ea96a70d1aa43be6fe502f21a651e17483a43c5"},
{file = "lxml-4.9.2-cp36-cp36m-macosx_10_15_x86_64.whl", hash = "sha256:b26a29f0b7fc6f0897f043ca366142d2b609dc60756ee6e4e90b5f762c6adc53"},
{file = "lxml-4.9.2-cp36-cp36m-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_24_i686.whl", hash = "sha256:ab323679b8b3030000f2be63e22cdeea5b47ee0abd2d6a1dc0c8103ddaa56cd7"},
{file = "lxml-4.9.2-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:689bb688a1db722485e4610a503e3e9210dcc20c520b45ac8f7533c837be76fe"},
@@ -2420,6 +2420,7 @@ files = [
{file = "lxml-4.9.2-cp36-cp36m-musllinux_1_1_aarch64.whl", hash = "sha256:58bfa3aa19ca4c0f28c5dde0ff56c520fbac6f0daf4fac66ed4c8d2fb7f22e74"},
{file = "lxml-4.9.2-cp36-cp36m-musllinux_1_1_x86_64.whl", hash = "sha256:bc718cd47b765e790eecb74d044cc8d37d58562f6c314ee9484df26276d36a38"},
{file = "lxml-4.9.2-cp36-cp36m-win32.whl", hash = "sha256:d5bf6545cd27aaa8a13033ce56354ed9e25ab0e4ac3b5392b763d8d04b08e0c5"},
{file = "lxml-4.9.2-cp36-cp36m-win_amd64.whl", hash = "sha256:3ab9fa9d6dc2a7f29d7affdf3edebf6ece6fb28a6d80b14c3b2fb9d39b9322c3"},
{file = "lxml-4.9.2-cp37-cp37m-macosx_10_15_x86_64.whl", hash = "sha256:05ca3f6abf5cf78fe053da9b1166e062ade3fa5d4f92b4ed688127ea7d7b1d03"},
{file = "lxml-4.9.2-cp37-cp37m-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_24_i686.whl", hash = "sha256:a5da296eb617d18e497bcf0a5c528f5d3b18dadb3619fbdadf4ed2356ef8d941"},
{file = "lxml-4.9.2-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_24_aarch64.whl", hash = "sha256:04876580c050a8c5341d706dd464ff04fd597095cc8c023252566a8826505726"},
@@ -5092,6 +5093,21 @@ files = [
[package.extras]
widechars = ["wcwidth"]
[[package]]
name = "tenacity"
version = "8.1.0"
description = "Retry code until it succeeds"
category = "main"
optional = false
python-versions = ">=3.6"
files = [
{file = "tenacity-8.1.0-py3-none-any.whl", hash = "sha256:35525cd47f82830069f0d6b73f7eb83bc5b73ee2fff0437952cedf98b27653ac"},
{file = "tenacity-8.1.0.tar.gz", hash = "sha256:e48c437fdf9340f5666b92cd7990e96bc5fc955e1298baf4a907e3972067a445"},
]
[package.extras]
doc = ["reno", "sphinx", "tornado (>=4.5)"]
[[package]]
name = "tensorboard"
version = "2.11.2"
@@ -5505,11 +5521,14 @@ files = [
{file = "tokenizers-0.13.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:47ef745dbf9f49281e900e9e72915356d69de3a4e4d8a475bda26bfdb5047736"},
{file = "tokenizers-0.13.2-cp310-cp310-win32.whl", hash = "sha256:96cedf83864bcc15a3ffd088a6f81a8a8f55b8b188eabd7a7f2a4469477036df"},
{file = "tokenizers-0.13.2-cp310-cp310-win_amd64.whl", hash = "sha256:eda77de40a0262690c666134baf19ec5c4f5b8bde213055911d9f5a718c506e1"},
{file = "tokenizers-0.13.2-cp311-cp311-macosx_10_11_universal2.whl", hash = "sha256:9eee037bb5aa14daeb56b4c39956164b2bebbe6ab4ca7779d88aa16b79bd4e17"},
{file = "tokenizers-0.13.2-cp311-cp311-macosx_12_0_arm64.whl", hash = "sha256:d1b079c4c9332048fec4cb9c2055c2373c74fbb336716a5524c9a720206d787e"},
{file = "tokenizers-0.13.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a689654fc745135cce4eea3b15e29c372c3e0b01717c6978b563de5c38af9811"},
{file = "tokenizers-0.13.2-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:3606528c07cda0566cff6cbfbda2b167f923661be595feac95701ffcdcbdbb21"},
{file = "tokenizers-0.13.2-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:41291d0160946084cbd53c8ec3d029df3dc2af2673d46b25ff1a7f31a9d55d51"},
{file = "tokenizers-0.13.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7892325f9ca1cc5fca0333d5bfd96a19044ce9b092ce2df625652109a3de16b8"},
{file = "tokenizers-0.13.2-cp311-cp311-win32.whl", hash = "sha256:93714958d4ebe5362d3de7a6bd73dc86c36b5af5941ebef6c325ac900fa58865"},
{file = "tokenizers-0.13.2-cp311-cp311-win_amd64.whl", hash = "sha256:fa7ef7ee380b1f49211bbcfac8a006b1a3fa2fa4c7f4ee134ae384eb4ea5e453"},
{file = "tokenizers-0.13.2-cp37-cp37m-macosx_10_11_x86_64.whl", hash = "sha256:da521bfa94df6a08a6254bb8214ea04854bb9044d61063ae2529361688b5440a"},
{file = "tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a739d4d973d422e1073989769723f3b6ad8b11e59e635a63de99aea4b2208188"},
{file = "tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:cac01fc0b868e4d0a3aa7c5c53396da0a0a63136e81475d32fcf5c348fcb2866"},
@@ -5518,6 +5537,7 @@ files = [
{file = "tokenizers-0.13.2-cp37-cp37m-win32.whl", hash = "sha256:a537061ee18ba104b7f3daa735060c39db3a22c8a9595845c55b6c01d36c5e87"},
{file = "tokenizers-0.13.2-cp37-cp37m-win_amd64.whl", hash = "sha256:c82fb87b1cbfa984d8f05b2b3c3c73e428b216c1d4f0e286d0a3b27f521b32eb"},
{file = "tokenizers-0.13.2-cp38-cp38-macosx_10_11_x86_64.whl", hash = "sha256:ce298605a833ac7f81b8062d3102a42dcd9fa890493e8f756112c346339fe5c5"},
{file = "tokenizers-0.13.2-cp38-cp38-macosx_12_0_arm64.whl", hash = "sha256:f44d59bafe3d61e8a56b9e0a963075187c0f0091023120b13fbe37a87936f171"},
{file = "tokenizers-0.13.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a51b93932daba12ed07060935978a6779593a59709deab04a0d10e6fd5c29e60"},
{file = "tokenizers-0.13.2-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:6969e5ea7ccb909ce7d6d4dfd009115dc72799b0362a2ea353267168667408c4"},
{file = "tokenizers-0.13.2-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:92f040c4d938ea64683526b45dfc81c580e3b35aaebe847e7eec374961231734"},
@@ -6254,4 +6274,4 @@ llms = ["manifest-ml", "torch", "transformers"]
[metadata]
lock-version = "2.0"
python-versions = ">=3.8.1,<4.0"
content-hash = "fdf51cf2f138653a8e32f1c43a2dcbbcca76f74534afc9dc4a7879c58282100a"
content-hash = "6066217c63ce7620ee615d5db269327e288991f48304526767c88a1870f0c7a0"

View File

@@ -1,6 +1,6 @@
[tool.poetry]
name = "langchain"
version = "0.0.76"
version = "0.0.78"
description = "Building applications with LLMs through composability"
authors = []
license = "MIT"
@@ -36,6 +36,7 @@ wolframalpha = {version = "5.0.0", optional = true}
qdrant-client = {version = "^0.11.7", optional = true}
dataclasses-json = "^0.5.7"
tensorflow-text = {version = "^2.11.0", optional = true, python = "^3.10, <3.12"}
tenacity = "^8.1.0"
[tool.poetry.group.docs.dependencies]
autodoc_pydantic = "^1.8.0"

View File

@@ -2,7 +2,7 @@
import pytest
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter
def test_huggingface_type_check() -> None:
@@ -21,3 +21,21 @@ def test_huggingface_tokenizer() -> None:
)
output = text_splitter.split_text("foo bar")
assert output == ["foo", "bar"]
class TestTokenTextSplitter:
"""Test token text splitter."""
def test_basic(self) -> None:
"""Test no overlap."""
splitter = TokenTextSplitter(chunk_size=5, chunk_overlap=0)
output = splitter.split_text("abcdef" * 5) # 10 token string
expected_output = ["abcdefabcdefabc", "defabcdefabcdef"]
assert output == expected_output
def test_overlap(self) -> None:
"""Test with overlap."""
splitter = TokenTextSplitter(chunk_size=5, chunk_overlap=1)
output = splitter.split_text("abcdef" * 5) # 10 token string
expected_output = ["abcdefabcdefabc", "abcdefabcdefabc", "abcdef"]
assert output == expected_output

View File

@@ -0,0 +1,18 @@
"""Fake Embedding class for testing purposes."""
from typing import List
from langchain.embeddings.base import Embeddings
fake_texts = ["foo", "bar", "baz"]
class FakeEmbeddings(Embeddings):
"""Fake embeddings functionality for testing."""
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Return simple embeddings."""
return [[1.0] * 9 + [i] for i in range(len(texts))]
def embed_query(self, text: str) -> List[float]:
"""Return simple embeddings."""
return [1.0] * 9 + [0.0]

View File

@@ -1,21 +1,8 @@
"""Test ElasticSearch functionality."""
from typing import List
from langchain.docstore.document import Document
from langchain.embeddings.base import Embeddings
from langchain.vectorstores.elastic_vector_search import ElasticVectorSearch
class FakeEmbeddings(Embeddings):
"""Fake embeddings functionality for testing."""
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Return simple embeddings."""
return [[1.0] * 9 + [i] for i in range(len(texts))]
def embed_query(self, text: str) -> List[float]:
"""Return simple embeddings."""
return [1.0] * 9 + [0.0]
from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings
def test_elasticsearch() -> None:

View File

@@ -1,26 +1,13 @@
"""Test FAISS functionality."""
import tempfile
from typing import List
import pytest
from langchain.docstore.document import Document
from langchain.docstore.in_memory import InMemoryDocstore
from langchain.docstore.wikipedia import Wikipedia
from langchain.embeddings.base import Embeddings
from langchain.vectorstores.faiss import FAISS
class FakeEmbeddings(Embeddings):
"""Fake embeddings functionality for testing."""
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Return simple embeddings."""
return [[i] * 10 for i in range(len(texts))]
def embed_query(self, text: str) -> List[float]:
"""Return simple embeddings."""
return [0] * 10
from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings
def test_faiss() -> None:

View File

@@ -0,0 +1,53 @@
"""Test Milvus functionality."""
from typing import List, Optional
from langchain.docstore.document import Document
from langchain.vectorstores import Milvus
from tests.integration_tests.vectorstores.fake_embeddings import (
FakeEmbeddings,
fake_texts,
)
def _milvus_from_texts(metadatas: Optional[List[dict]] = None) -> Milvus:
return Milvus.from_texts(
fake_texts,
FakeEmbeddings(),
metadatas=metadatas,
connection_args={"host": "127.0.0.1", "port": "19530"},
)
def test_milvus() -> None:
"""Test end to end construction and search."""
docsearch = _milvus_from_texts()
output = docsearch.similarity_search("foo", k=1)
assert output == [Document(page_content="foo")]
def test_milvus_with_score() -> None:
"""Test end to end construction and search with scores and IDs."""
texts = ["foo", "bar", "baz"]
metadatas = [{"page": i} for i in range(len(texts))]
docsearch = _milvus_from_texts(metadatas=metadatas)
output = docsearch.similarity_search_with_score("foo", k=3)
docs = [o[0] for o in output]
scores = [o[1] for o in output]
assert docs == [
Document(page_content="foo", metadata={"page": 0}),
Document(page_content="bar", metadata={"page": 1}),
Document(page_content="baz", metadata={"page": 2}),
]
assert scores[0] < scores[1] < scores[2]
def test_milvus_max_marginal_relevance_search() -> None:
"""Test end to end construction and MRR search."""
texts = ["foo", "bar", "baz"]
metadatas = [{"page": i} for i in range(len(texts))]
docsearch = _milvus_from_texts(metadatas=metadatas)
output = docsearch.max_marginal_relevance_search("foo", k=2, fetch_k=3)
assert output == [
Document(page_content="foo", metadata={"page": 0}),
Document(page_content="baz", metadata={"page": 2}),
]

View File

@@ -1,21 +1,7 @@
"""Test Qdrant functionality."""
from typing import List
from langchain.docstore.document import Document
from langchain.embeddings.base import Embeddings
from langchain.vectorstores import Qdrant
class FakeEmbeddings(Embeddings):
"""Fake embeddings functionality for testing."""
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Return simple embeddings."""
return [[1.0] * 9 + [float(i)] for i in range(len(texts))]
def embed_query(self, text: str) -> List[float]:
"""Return simple embeddings."""
return [1.0] * 9 + [0.0]
from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings
def test_qdrant() -> None:

View File

@@ -17,7 +17,7 @@ def selector() -> LengthBasedExampleSelector:
selector = LengthBasedExampleSelector(
examples=EXAMPLES,
example_prompt=prompts,
max_length=25,
max_length=30,
)
return selector

View File

@@ -13,6 +13,27 @@ def test_prompt_valid() -> None:
assert prompt.input_variables == input_variables
def test_prompt_from_template() -> None:
"""Test prompts can be constructed from a template."""
# Single input variable.
template = "This is a {foo} test."
prompt = PromptTemplate.from_template(template)
expected_prompt = PromptTemplate(template=template, input_variables=["foo"])
assert prompt == expected_prompt
# Multiple input variables.
template = "This {bar} is a {foo} test."
prompt = PromptTemplate.from_template(template)
expected_prompt = PromptTemplate(template=template, input_variables=["bar", "foo"])
assert prompt == expected_prompt
# Multiple input variables with repeats.
template = "This {bar} is a {foo} test {foo}."
prompt = PromptTemplate.from_template(template)
expected_prompt = PromptTemplate(template=template, input_variables=["bar", "foo"])
assert prompt == expected_prompt
def test_prompt_missing_input_variables() -> None:
"""Test error is raised when input variables are not provided."""
template = "This is a {foo} test."