mirror of
https://github.com/hwchase17/langchain.git
synced 2026-04-20 13:28:53 +00:00
Compare commits
19 Commits
harrison/f
...
eugene/per
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
5ae6eb1429 | ||
|
|
cd80d041f6 | ||
|
|
ab0b73a415 | ||
|
|
9b2e39cb24 | ||
|
|
faf92ae133 | ||
|
|
abbc4948c1 | ||
|
|
a5c1a6fae0 | ||
|
|
5b179daa47 | ||
|
|
9e1bd1e90e | ||
|
|
6968fc79be | ||
|
|
41d61c9e8f | ||
|
|
b0a2ebb271 | ||
|
|
b6340bb94d | ||
|
|
7cc1549a07 | ||
|
|
a0f65462a5 | ||
|
|
4b868c3214 | ||
|
|
6097f3efe8 | ||
|
|
6104e4ef4d | ||
|
|
2893f5afd4 |
2
.github/ISSUE_TEMPLATE/bug-report.yml
vendored
2
.github/ISSUE_TEMPLATE/bug-report.yml
vendored
@@ -46,7 +46,7 @@ body:
|
||||
- @agola11
|
||||
|
||||
Tools / Toolkits
|
||||
- ...
|
||||
- @vowelparrot
|
||||
|
||||
placeholder: "@Username ..."
|
||||
|
||||
|
||||
2
.github/PULL_REQUEST_TEMPLATE.md
vendored
2
.github/PULL_REQUEST_TEMPLATE.md
vendored
@@ -48,7 +48,7 @@ Tag maintainers/contributors who might be interested:
|
||||
- @agola11
|
||||
|
||||
Agents / Tools / Toolkits
|
||||
- @hwchase17
|
||||
- @vowelparrot
|
||||
|
||||
VectorStores / Retrievers / Memory
|
||||
- @dev2049
|
||||
|
||||
1
docs/_static/js/mendablesearch.js
vendored
1
docs/_static/js/mendablesearch.js
vendored
@@ -37,7 +37,6 @@ document.addEventListener('DOMContentLoaded', () => {
|
||||
style: { darkMode: false, accentColor: '#010810' },
|
||||
floatingButtonStyle: { color: '#ffffff', backgroundColor: '#010810' },
|
||||
anon_key: '82842b36-3ea6-49b2-9fb8-52cfc4bde6bf', // Mendable Search Public ANON key, ok to be public
|
||||
cmdShortcutKey:'j',
|
||||
messageSettings: {
|
||||
openSourcesInNewTab: false,
|
||||
prettySources: true // Prettify the sources displayed now
|
||||
|
||||
@@ -24,9 +24,9 @@ This guide aims to provide a comprehensive overview of the requirements for depl
|
||||
|
||||
Understanding these components is crucial when assessing serving systems. LangChain integrates with several open-source projects designed to tackle these issues, providing a robust framework for productionizing your LLM applications. Some notable frameworks include:
|
||||
|
||||
- `Ray Serve <../integrations/ray_serve.html>`_
|
||||
- `Ray Serve <../../../ecosystem/ray_serve.html>`_
|
||||
- `BentoML <https://github.com/ssheng/BentoChain>`_
|
||||
- `Modal <../integrations/modal.html>`_
|
||||
- `Modal <../../../ecosystem/modal.html>`_
|
||||
|
||||
These links will provide further information on each ecosystem, assisting you in finding the best fit for your LLM deployment needs.
|
||||
|
||||
|
||||
@@ -1,25 +0,0 @@
|
||||
# Baseten
|
||||
|
||||
Learn how to use LangChain with models deployed on Baseten.
|
||||
|
||||
## Installation and setup
|
||||
|
||||
- Create a [Baseten](https://baseten.co) account and [API key](https://docs.baseten.co/settings/api-keys).
|
||||
- Install the Baseten Python client with `pip install baseten`
|
||||
- Use your API key to authenticate with `baseten login`
|
||||
|
||||
## Invoking a model
|
||||
|
||||
Baseten integrates with LangChain through the LLM module, which provides a standardized and interoperable interface for models that are deployed on your Baseten workspace.
|
||||
|
||||
You can deploy foundation models like WizardLM and Alpaca with one click from the [Baseten model library](https://app.baseten.co/explore/) or if you have your own model, [deploy it with this tutorial](https://docs.baseten.co/deploying-models/deploy).
|
||||
|
||||
In this example, we'll work with WizardLM. [Deploy WizardLM here](https://app.baseten.co/explore/wizardlm) and follow along with the deployed [model's version ID](https://docs.baseten.co/managing-models/manage).
|
||||
|
||||
```python
|
||||
from langchain.llms import Baseten
|
||||
|
||||
wizardlm = Baseten(model="MODEL_VERSION_ID", verbose=True)
|
||||
|
||||
wizardlm("What is the difference between a Wizard and a Sorcerer?")
|
||||
```
|
||||
@@ -1,21 +0,0 @@
|
||||
# AwaDB
|
||||
|
||||
>[AwaDB](https://github.com/awa-ai/awadb) is an AI Native database for the search and storage of embedding vectors used by LLM Applications.
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
```bash
|
||||
pip install awadb
|
||||
```
|
||||
|
||||
|
||||
## VectorStore
|
||||
|
||||
There exists a wrapper around AwaDB vector databases, allowing you to use it as a vectorstore,
|
||||
whether for semantic search or example selection.
|
||||
|
||||
```python
|
||||
from langchain.vectorstores import AwaDB
|
||||
```
|
||||
|
||||
For a more detailed walkthrough of the AwaDB wrapper, see [this notebook](../modules/indexes/vectorstores/examples/awadb.ipynb)
|
||||
@@ -58,7 +58,7 @@
|
||||
"### Optional Parameters\n",
|
||||
"There following parameters are optional. When executing the method in a Databricks notebook, you don't need to provide them in most of the cases.\n",
|
||||
"* `host`: The Databricks workspace hostname, excluding 'https://' part. Defaults to 'DATABRICKS_HOST' environment variable or current workspace if in a Databricks notebook.\n",
|
||||
"* `api_token`: The Databricks personal access token for accessing the Databricks SQL warehouse or the cluster. Defaults to 'DATABRICKS_TOKEN' environment variable or a temporary one is generated if in a Databricks notebook.\n",
|
||||
"* `api_token`: The Databricks personal access token for accessing the Databricks SQL warehouse or the cluster. Defaults to 'DATABRICKS_API_TOKEN' environment variable or a temporary one is generated if in a Databricks notebook.\n",
|
||||
"* `warehouse_id`: The warehouse ID in the Databricks SQL.\n",
|
||||
"* `cluster_id`: The cluster ID in the Databricks Runtime. If running in a Databricks notebook and both 'warehouse_id' and 'cluster_id' are None, it uses the ID of the cluster the notebook is attached to.\n",
|
||||
"* `engine_args`: The arguments to be used when connecting Databricks.\n",
|
||||
@@ -1,36 +0,0 @@
|
||||
Databricks
|
||||
==========
|
||||
|
||||
The [Databricks](https://www.databricks.com/) Lakehouse Platform unifies data, analytics, and AI on one platform.
|
||||
|
||||
Databricks embraces the LangChain ecosystem in various ways:
|
||||
|
||||
1. Databricks connector for the SQLDatabase Chain: SQLDatabase.from_databricks() provides an easy way to query your data on Databricks through LangChain
|
||||
2. Databricks-managed MLflow integrates with LangChain: Tracking and serving LangChain applications with fewer steps
|
||||
3. Databricks as an LLM provider: Deploy your fine-tuned LLMs on Databricks via serving endpoints or cluster driver proxy apps, and query it as langchain.llms.Databricks
|
||||
4. Databricks Dolly: Databricks open-sourced Dolly which allows for commercial use, and can be accessed through the Hugging Face Hub
|
||||
|
||||
Databricks connector for the SQLDatabase Chain
|
||||
----------------------------------------------
|
||||
You can connect to [Databricks runtimes](https://docs.databricks.com/runtime/index.html) and [Databricks SQL](https://www.databricks.com/product/databricks-sql) using the SQLDatabase wrapper of LangChain. See the notebook [Connect to Databricks](./databricks/databricks.html) for details.
|
||||
|
||||
Databricks-managed MLflow integrates with LangChain
|
||||
---------------------------------------------------
|
||||
|
||||
MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. See the notebook [MLflow Callback Handler](./mlflow_tracking.ipynb) for details about MLflow's integration with LangChain.
|
||||
|
||||
Databricks provides a fully managed and hosted version of MLflow integrated with enterprise security features, high availability, and other Databricks workspace features such as experiment and run management and notebook revision capture. MLflow on Databricks offers an integrated experience for tracking and securing machine learning model training runs and running machine learning projects. See [MLflow guide](https://docs.databricks.com/mlflow/index.html) for more details.
|
||||
|
||||
Databricks-managed MLflow makes it more convenient to develop LangChain applications on Databricks. For MLflow tracking, you don't need to set the tracking uri. For MLflow Model Serving, you can save LangChain Chains in the MLflow langchain flavor, and then register and serve the Chain with a few clicks on Databricks, with credentials securely managed by MLflow Model Serving.
|
||||
|
||||
Databricks as an LLM provider
|
||||
-----------------------------
|
||||
|
||||
The notebook [Wrap Databricks endpoints as LLMs](../modules/models/llms/integrations/databricks.html) illustrates the method to wrap Databricks endpoints as LLMs in LangChain. It supports two types of endpoints: the serving endpoint, which is recommended for both production and development, and the cluster driver proxy app, which is recommended for interactive development.
|
||||
|
||||
Databricks endpoints support Dolly, but are also great for hosting models like MPT-7B or any other models from the Hugging Face ecosystem. Databricks endpoints can also be used with proprietary models like OpenAI to provide a governance layer for enterprises.
|
||||
|
||||
Databricks Dolly
|
||||
----------------
|
||||
|
||||
Databricks’ Dolly is an instruction-following large language model trained on the Databricks machine learning platform that is licensed for commercial use. The model is available on Hugging Face Hub as databricks/dolly-v2-12b. See the notebook [Hugging Face Hub](../modules/models/llms/integrations/huggingface_hub.html) for instructions to access it through the Hugging Face Hub integration with LangChain.
|
||||
@@ -1,368 +0,0 @@
|
||||
# LangChain Decorators ✨
|
||||
|
||||
lanchchain decorators is a layer on the top of LangChain that provides syntactic sugar 🍭 for writing custom langchain prompts and chains
|
||||
|
||||
For Feedback, Issues, Contributions - please raise an issue here:
|
||||
[ju-bezdek/langchain-decorators](https://github.com/ju-bezdek/langchain-decorators)
|
||||
|
||||
|
||||
|
||||
Main principles and benefits:
|
||||
|
||||
- more `pythonic` way of writing code
|
||||
- write multiline prompts that wont break your code flow with indentation
|
||||
- making use of IDE in-built support for **hinting**, **type checking** and **popup with docs** to quickly peek in the function to see the prompt, parameters it consumes etc.
|
||||
- leverage all the power of 🦜🔗 LangChain ecosystem
|
||||
- adding support for **optional parameters**
|
||||
- easily share parameters between the prompts by binding them to one class
|
||||
|
||||
|
||||
|
||||
Here is a simple example of a code written with **LangChain Decorators ✨**
|
||||
|
||||
``` python
|
||||
|
||||
@llm_prompt
|
||||
def write_me_short_post(topic:str, platform:str="twitter", audience:str = "developers")->str:
|
||||
"""
|
||||
Write me a short header for my post about {topic} for {platform} platform.
|
||||
It should be for {audience} audience.
|
||||
(Max 15 words)
|
||||
"""
|
||||
return
|
||||
|
||||
# run it naturaly
|
||||
write_me_short_post(topic="starwars")
|
||||
# or
|
||||
write_me_short_post(topic="starwars", platform="redit")
|
||||
```
|
||||
|
||||
# Quick start
|
||||
## Installation
|
||||
```bash
|
||||
pip install langchain_decorators
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
Good idea on how to start is to review the examples here:
|
||||
- [jupyter notebook](https://github.com/ju-bezdek/langchain-decorators/blob/main/example_notebook.ipynb)
|
||||
- [colab notebook](https://colab.research.google.com/drive/1no-8WfeP6JaLD9yUtkPgym6x0G9ZYZOG#scrollTo=N4cf__D0E2Yk)
|
||||
|
||||
# Defining other parameters
|
||||
Here we are just marking a function as a prompt with `llm_prompt` decorator, turning it effectively into a LLMChain. Instead of running it
|
||||
|
||||
|
||||
Standard LLMchain takes much more init parameter than just inputs_variables and prompt... here is this implementation detail hidden in the decorator.
|
||||
Here is how it works:
|
||||
|
||||
1. Using **Global settings**:
|
||||
|
||||
``` python
|
||||
# define global settings for all prompty (if not set - chatGPT is the current default)
|
||||
from langchain_decorators import GlobalSettings
|
||||
|
||||
GlobalSettings.define_settings(
|
||||
default_llm=ChatOpenAI(temperature=0.0), this is default... can change it here globally
|
||||
default_streaming_llm=ChatOpenAI(temperature=0.0,streaming=True), this is default... can change it here for all ... will be used for streaming
|
||||
)
|
||||
```
|
||||
|
||||
2. Using predefined **prompt types**
|
||||
|
||||
``` python
|
||||
#You can change the default prompt types
|
||||
from langchain_decorators import PromptTypes, PromptTypeSettings
|
||||
|
||||
PromptTypes.AGENT_REASONING.llm = ChatOpenAI()
|
||||
|
||||
# Or you can just define your own ones:
|
||||
class MyCustomPromptTypes(PromptTypes):
|
||||
GPT4=PromptTypeSettings(llm=ChatOpenAI(model="gpt-4"))
|
||||
|
||||
@llm_prompt(prompt_type=MyCustomPromptTypes.GPT4)
|
||||
def write_a_complicated_code(app_idea:str)->str:
|
||||
...
|
||||
|
||||
```
|
||||
|
||||
3. Define the settings **directly in the decorator**
|
||||
|
||||
``` python
|
||||
from langchain.llms import OpenAI
|
||||
|
||||
@llm_prompt(
|
||||
llm=OpenAI(temperature=0.7),
|
||||
stop_tokens=["\nObservation"],
|
||||
...
|
||||
)
|
||||
def creative_writer(book_title:str)->str:
|
||||
...
|
||||
```
|
||||
|
||||
## Passing a memory and/or callbacks:
|
||||
|
||||
To pass any of these, just declare them in the function (or use kwargs to pass anything)
|
||||
|
||||
```python
|
||||
|
||||
@llm_prompt()
|
||||
async def write_me_short_post(topic:str, platform:str="twitter", memory:SimpleMemory = None):
|
||||
"""
|
||||
{history_key}
|
||||
Write me a short header for my post about {topic} for {platform} platform.
|
||||
It should be for {audience} audience.
|
||||
(Max 15 words)
|
||||
"""
|
||||
pass
|
||||
|
||||
await write_me_short_post(topic="old movies")
|
||||
|
||||
```
|
||||
|
||||
# Simplified streaming
|
||||
|
||||
If we wan't to leverage streaming:
|
||||
- we need to define prompt as async function
|
||||
- turn on the streaming on the decorator, or we can define PromptType with streaming on
|
||||
- capture the stream using StreamingContext
|
||||
|
||||
This way we just mark which prompt should be streamed, not needing to tinker with what LLM should we use, passing around the creating and distribute streaming handler into particular part of our chain... just turn the streaming on/off on prompt/prompt type...
|
||||
|
||||
The streaming will happen only if we call it in streaming context ... there we can define a simple function to handle the stream
|
||||
|
||||
``` python
|
||||
# this code example is complete and should run as it is
|
||||
|
||||
from langchain_decorators import StreamingContext, llm_prompt
|
||||
|
||||
# this will mark the prompt for streaming (useful if we want stream just some prompts in our app... but don't want to pass distribute the callback handlers)
|
||||
# note that only async functions can be streamed (will get an error if it's not)
|
||||
@llm_prompt(capture_stream=True)
|
||||
async def write_me_short_post(topic:str, platform:str="twitter", audience:str = "developers"):
|
||||
"""
|
||||
Write me a short header for my post about {topic} for {platform} platform.
|
||||
It should be for {audience} audience.
|
||||
(Max 15 words)
|
||||
"""
|
||||
pass
|
||||
|
||||
|
||||
|
||||
# just an arbitrary function to demonstrate the streaming... wil be some websockets code in the real world
|
||||
tokens=[]
|
||||
def capture_stream_func(new_token:str):
|
||||
tokens.append(new_token)
|
||||
|
||||
# if we want to capture the stream, we need to wrap the execution into StreamingContext...
|
||||
# this will allow us to capture the stream even if the prompt call is hidden inside higher level method
|
||||
# only the prompts marked with capture_stream will be captured here
|
||||
with StreamingContext(stream_to_stdout=True, callback=capture_stream_func):
|
||||
result = await run_prompt()
|
||||
print("Stream finished ... we can distinguish tokens thanks to alternating colors")
|
||||
|
||||
|
||||
print("\nWe've captured",len(tokens),"tokens🎉\n")
|
||||
print("Here is the result:")
|
||||
print(result)
|
||||
```
|
||||
|
||||
|
||||
# Prompt declarations
|
||||
By default the prompt is is the whole function docs, unless you mark your prompt
|
||||
|
||||
## Documenting your prompt
|
||||
|
||||
We can specify what part of our docs is the prompt definition, by specifying a code block with **<prompt>** language tag
|
||||
|
||||
``` python
|
||||
@llm_prompt
|
||||
def write_me_short_post(topic:str, platform:str="twitter", audience:str = "developers"):
|
||||
"""
|
||||
Here is a good way to write a prompt as part of a function docstring, with additional documentation for devs.
|
||||
|
||||
It needs to be a code block, marked as a `<prompt>` language
|
||||
```<prompt>
|
||||
Write me a short header for my post about {topic} for {platform} platform.
|
||||
It should be for {audience} audience.
|
||||
(Max 15 words)
|
||||
```
|
||||
|
||||
Now only to code block above will be used as a prompt, and the rest of the docstring will be used as a description for developers.
|
||||
(It has also a nice benefit that IDE (like VS code) will display the prompt properly (not trying to parse it as markdown, and thus not showing new lines properly))
|
||||
"""
|
||||
return
|
||||
```
|
||||
|
||||
## Chat messages prompt
|
||||
|
||||
For chat models is very useful to define prompt as a set of message templates... here is how to do it:
|
||||
|
||||
``` python
|
||||
@llm_prompt
|
||||
def simulate_conversation(human_input:str, agent_role:str="a pirate"):
|
||||
"""
|
||||
## System message
|
||||
- note the `:system` sufix inside the <prompt:_role_> tag
|
||||
|
||||
|
||||
```<prompt:system>
|
||||
You are a {agent_role} hacker. You mus act like one.
|
||||
You reply always in code, using python or javascript code block...
|
||||
for example:
|
||||
|
||||
... do not reply with anything else.. just with code - respecting your role.
|
||||
```
|
||||
|
||||
# human message
|
||||
(we are using the real role that are enforced by the LLM - GPT supports system, assistant, user)
|
||||
``` <prompt:user>
|
||||
Helo, who are you
|
||||
```
|
||||
a reply:
|
||||
|
||||
|
||||
``` <prompt:assistant>
|
||||
\``` python <<- escaping inner code block with \ that should be part of the prompt
|
||||
def hello():
|
||||
print("Argh... hello you pesky pirate")
|
||||
\```
|
||||
```
|
||||
|
||||
we can also add some history using placeholder
|
||||
```<prompt:placeholder>
|
||||
{history}
|
||||
```
|
||||
```<prompt:user>
|
||||
{human_input}
|
||||
```
|
||||
|
||||
Now only to code block above will be used as a prompt, and the rest of the docstring will be used as a description for developers.
|
||||
(It has also a nice benefit that IDE (like VS code) will display the prompt properly (not trying to parse it as markdown, and thus not showing new lines properly))
|
||||
"""
|
||||
pass
|
||||
|
||||
```
|
||||
|
||||
the roles here are model native roles (assistant, user, system for chatGPT)
|
||||
|
||||
|
||||
|
||||
# Optional sections
|
||||
- you can define a whole sections of your prompt that should be optional
|
||||
- if any input in the section is missing, the whole section wont be rendered
|
||||
|
||||
the syntax for this is as follows:
|
||||
|
||||
``` python
|
||||
@llm_prompt
|
||||
def prompt_with_optional_partials():
|
||||
"""
|
||||
this text will be rendered always, but
|
||||
|
||||
{? anything inside this block will be rendered only if all the {value}s parameters are not empty (None | "") ?}
|
||||
|
||||
you can also place it in between the words
|
||||
this too will be rendered{? , but
|
||||
this block will be rendered only if {this_value} and {this_value}
|
||||
is not empty?} !
|
||||
"""
|
||||
```
|
||||
|
||||
|
||||
# Output parsers
|
||||
|
||||
- llm_prompt decorator natively tries to detect the best output parser based on the output type. (if not set, it returns the raw string)
|
||||
- list, dict and pydantic outputs are also supported natively (automaticaly)
|
||||
|
||||
``` python
|
||||
# this code example is complete and should run as it is
|
||||
|
||||
from langchain_decorators import llm_prompt
|
||||
|
||||
@llm_prompt
|
||||
def write_name_suggestions(company_business:str, count:int)->list:
|
||||
""" Write me {count} good name suggestions for company that {company_business}
|
||||
"""
|
||||
pass
|
||||
|
||||
write_name_suggestions(company_business="sells cookies", count=5)
|
||||
```
|
||||
|
||||
## More complex structures
|
||||
|
||||
for dict / pydantic you need to specify the formatting instructions...
|
||||
this can be tedious, that's why you can let the output parser gegnerate you the instructions based on the model (pydantic)
|
||||
|
||||
``` python
|
||||
from langchain_decorators import llm_prompt
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
|
||||
class TheOutputStructureWeExpect(BaseModel):
|
||||
name:str = Field (description="The name of the company")
|
||||
headline:str = Field( description="The description of the company (for landing page)")
|
||||
employees:list[str] = Field(description="5-8 fake employee names with their positions")
|
||||
|
||||
@llm_prompt()
|
||||
def fake_company_generator(company_business:str)->TheOutputStructureWeExpect:
|
||||
""" Generate a fake company that {company_business}
|
||||
{FORMAT_INSTRUCTIONS}
|
||||
"""
|
||||
return
|
||||
|
||||
company = fake_company_generator(company_business="sells cookies")
|
||||
|
||||
# print the result nicely formatted
|
||||
print("Company name: ",company.name)
|
||||
print("company headline: ",company.headline)
|
||||
print("company employees: ",company.employees)
|
||||
|
||||
```
|
||||
|
||||
|
||||
# Binding the prompt to an object
|
||||
|
||||
``` python
|
||||
from pydantic import BaseModel
|
||||
from langchain_decorators import llm_prompt
|
||||
|
||||
class AssistantPersonality(BaseModel):
|
||||
assistant_name:str
|
||||
assistant_role:str
|
||||
field:str
|
||||
|
||||
@property
|
||||
def a_property(self):
|
||||
return "whatever"
|
||||
|
||||
def hello_world(self, function_kwarg:str=None):
|
||||
"""
|
||||
We can reference any {field} or {a_property} inside our prompt... and combine it with {function_kwarg} in the method
|
||||
"""
|
||||
|
||||
|
||||
@llm_prompt
|
||||
def introduce_your_self(self)->str:
|
||||
"""
|
||||
``` <prompt:system>
|
||||
You are an assistant named {assistant_name}.
|
||||
Your role is to act as {assistant_role}
|
||||
```
|
||||
```<prompt:user>
|
||||
Introduce your self (in less than 20 words)
|
||||
```
|
||||
"""
|
||||
|
||||
|
||||
|
||||
personality = AssistantPersonality(assistant_name="John", assistant_role="a pirate")
|
||||
|
||||
print(personality.introduce_your_self(personality))
|
||||
```
|
||||
|
||||
|
||||
# More examples:
|
||||
|
||||
- these and few more examples are also available in the [colab notebook here](https://colab.research.google.com/drive/1no-8WfeP6JaLD9yUtkPgym6x0G9ZYZOG#scrollTo=N4cf__D0E2Yk)
|
||||
- including the [ReAct Agent re-implementation](https://colab.research.google.com/drive/1no-8WfeP6JaLD9yUtkPgym6x0G9ZYZOG#scrollTo=3bID5fryE2Yp) using purely langchain decorators
|
||||
@@ -1,43 +0,0 @@
|
||||
# Shale Protocol
|
||||
|
||||
[Shale Protocol](https://shaleprotocol.com) provides production-ready inference APIs for open LLMs. It's a Plug & Play API as it's hosted on a highly scalable GPU cloud infrastructure.
|
||||
|
||||
Our free tier supports up to 1K daily requests per key as we want to eliminate the barrier for anyone to start building genAI apps with LLMs.
|
||||
|
||||
With Shale Protocol, developers/researchers can create apps and explore the capabilities of open LLMs at no cost.
|
||||
|
||||
This page covers how Shale-Serve API can be incorporated with LangChain.
|
||||
|
||||
As of June 2023, the API supports Vicuna-13B by default. We are going to support more LLMs such as Falcon-40B in future releases.
|
||||
|
||||
|
||||
## How to
|
||||
|
||||
### 1. Find the link to our Discord on https://shaleprotocol.com. Generate an API key through the "Shale Bot" on our Discord. No credit card is required and no free trials. It's a forever free tier with 1K limit per day per API key.
|
||||
|
||||
### 2. Use https://shale.live/v1 as OpenAI API drop-in replacement
|
||||
|
||||
For example
|
||||
```python
|
||||
from langchain.llms import OpenAI
|
||||
from langchain import PromptTemplate, LLMChain
|
||||
|
||||
import os
|
||||
os.environ['OPENAI_API_BASE'] = "https://shale.live/v1"
|
||||
os.environ['OPENAI_API_KEY'] = "ENTER YOUR API KEY"
|
||||
|
||||
llm = OpenAI()
|
||||
|
||||
template = """Question: {question}
|
||||
|
||||
# Answer: Let's think step by step."""
|
||||
|
||||
prompt = PromptTemplate(template=template, input_variables=["question"])
|
||||
|
||||
llm_chain = LLMChain(prompt=prompt, llm=llm)
|
||||
|
||||
question = "What NFL team won the Super Bowl in the year Justin Beiber was born?"
|
||||
|
||||
llm_chain.run(question)
|
||||
|
||||
```
|
||||
@@ -4,7 +4,7 @@
|
||||
What is Vectara?
|
||||
|
||||
**Vectara Overview:**
|
||||
- Vectara is developer-first API platform for building GenAI applications
|
||||
- Vectara is developer-first API platform for building conversational search applications
|
||||
- To use Vectara - first [sign up](https://console.vectara.com/signup) and create an account. Then create a corpus and an API key for indexing and searching.
|
||||
- You can use Vectara's [indexing API](https://docs.vectara.com/docs/indexing-apis/indexing) to add documents into Vectara's index
|
||||
- You can use Vectara's [Search API](https://docs.vectara.com/docs/search-apis/search) to query Vectara's index (which also supports Hybrid search implicitly).
|
||||
@@ -13,13 +13,6 @@ What is Vectara?
|
||||
## Installation and Setup
|
||||
To use Vectara with LangChain no special installation steps are required. You just have to provide your customer_id, corpus ID, and an API key created within the Vectara console to enable indexing and searching.
|
||||
|
||||
Alternatively these can be provided as environment variables
|
||||
- export `VECTARA_CUSTOMER_ID`="your_customer_id"
|
||||
- export `VECTARA_CORPUS_ID`="your_corpus_id"
|
||||
- export `VECTARA_API_KEY`="your-vectara-api-key"
|
||||
|
||||
## Usage
|
||||
|
||||
### VectorStore
|
||||
|
||||
There exists a wrapper around the Vectara platform, allowing you to use it as a vectorstore, whether for semantic search or example selection.
|
||||
@@ -39,21 +32,8 @@ vectara = Vectara(
|
||||
```
|
||||
The customer_id, corpus_id and api_key are optional, and if they are not supplied will be read from the environment variables `VECTARA_CUSTOMER_ID`, `VECTARA_CORPUS_ID` and `VECTARA_API_KEY`, respectively.
|
||||
|
||||
To query the vectorstore, you can use the `similarity_search` method (or `similarity_search_with_score`), which takes a query string and returns a list of results:
|
||||
```python
|
||||
results = vectara.similarity_score("what is LangChain?")
|
||||
```
|
||||
|
||||
`similarity_search_with_score` also supports the following additional arguments:
|
||||
- `k`: number of results to return (defaults to 5)
|
||||
- `lambda_val`: the [lexical matching](https://docs.vectara.com/docs/api-reference/search-apis/lexical-matching) factor for hybrid search (defaults to 0.025)
|
||||
- `filter`: a [filter](https://docs.vectara.com/docs/common-use-cases/filtering-by-metadata/filter-overview) to apply to the results (default None)
|
||||
- `n_sentence_context`: number of sentences to include before/after the actual matching segment when returning results. This defaults to 0 so as to return the exact text segment that matches, but can be used with other values e.g. 2 or 3 to return adjacent text segments.
|
||||
|
||||
The results are returned as a list of relevant documents, and a relevance score of each document.
|
||||
|
||||
|
||||
For a more detailed examples of using the Vectara wrapper, see one of these two sample notebooks:
|
||||
For a more detailed walkthrough of the Vectara wrapper, see one of the two example notebooks:
|
||||
* [Chat Over Documents with Vectara](./vectara/vectara_chat.html)
|
||||
* [Vectara Text Generation](./vectara/vectara_text_generation.html)
|
||||
|
||||
|
||||
@@ -102,11 +102,21 @@
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"<class 'langchain.vectorstores.vectara.Vectara'>\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"openai_api_key = os.environ['OPENAI_API_KEY']\n",
|
||||
"llm = OpenAI(openai_api_key=openai_api_key, temperature=0)\n",
|
||||
"retriever = vectorstore.as_retriever(lambda_val=0.025, k=5, filter=None)\n",
|
||||
"retriever = VectaraRetriever(vectorstore, alpha=0.025, k=5, filter=None)\n",
|
||||
"\n",
|
||||
"print(type(vectorstore))\n",
|
||||
"d = retriever.get_relevant_documents('What did the president say about Ketanji Brown Jackson')\n",
|
||||
"\n",
|
||||
"qa = ConversationalRetrievalChain.from_llm(llm, retriever, memory=memory)"
|
||||
@@ -132,7 +142,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"\" The president said that Ketanji Brown Jackson is one of the nation's top legal minds and that she will continue Justice Breyer's legacy of excellence.\""
|
||||
"\" The president said that Ketanji Brown Jackson is one of the nation's top legal minds, a former top litigator in private practice, and a former federal public defender.\""
|
||||
]
|
||||
},
|
||||
"execution_count": 7,
|
||||
@@ -164,7 +174,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"' Justice Stephen Breyer'"
|
||||
"' Justice Stephen Breyer.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
@@ -231,7 +241,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"\" The president said that Ketanji Brown Jackson is one of the nation's top legal minds and that she will continue Justice Breyer's legacy of excellence.\""
|
||||
"\" The president said that Ketanji Brown Jackson is one of the nation's top legal minds, a former top litigator in private practice, and a former federal public defender.\""
|
||||
]
|
||||
},
|
||||
"execution_count": 12,
|
||||
@@ -276,7 +286,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"' Justice Stephen Breyer'"
|
||||
"' Justice Stephen Breyer.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 14,
|
||||
@@ -334,7 +344,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'})"
|
||||
"Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. A former top litigator in private practice. A former federal public defender.', metadata={'source': '../../modules/state_of_the_union.txt'})"
|
||||
]
|
||||
},
|
||||
"execution_count": 17,
|
||||
@@ -382,24 +392,6 @@
|
||||
"result = qa({\"question\": query, \"chat_history\": chat_history, \"vectordbkwargs\": vectordbkwargs})"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 35,
|
||||
"id": "24ebdaec",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
" The president said that Ketanji Brown Jackson is one of the nation's top legal minds and that she will continue Justice Breyer's legacy of excellence.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(result['answer'])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "99b96dae",
|
||||
@@ -467,7 +459,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"\" The president said that he nominated Circuit Court of Appeals Judge Ketanji Brown Jackson, who he described as one of the nation's top legal minds, to continue Justice Breyer's legacy of excellence.\""
|
||||
"' The president did not mention Ketanji Brown Jackson.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 23,
|
||||
@@ -546,7 +538,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"\" The president said that he nominated Circuit Court of Appeals Judge Ketanji Brown Jackson, who he described as one of the nation's top legal minds, and that she will continue Justice Breyer's legacy of excellence.\\nSOURCES: ../../../state_of_the_union.txt\""
|
||||
"' The president did not mention Ketanji Brown Jackson.\\nSOURCES: ../../modules/state_of_the_union.txt'"
|
||||
]
|
||||
},
|
||||
"execution_count": 27,
|
||||
@@ -606,7 +598,7 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
" The president said that Ketanji Brown Jackson is one of the nation's top legal minds and that she will continue Justice Breyer's legacy of excellence."
|
||||
" The president said that Ketanji Brown Jackson is one of the nation's top legal minds, a former top litigator in private practice, and a former federal public defender."
|
||||
]
|
||||
}
|
||||
],
|
||||
@@ -628,7 +620,7 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
" Justice Stephen Breyer"
|
||||
" Justice Stephen Breyer."
|
||||
]
|
||||
}
|
||||
],
|
||||
@@ -689,7 +681,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"\" The president said that Ketanji Brown Jackson is one of the nation's top legal minds and that she will continue Justice Breyer's legacy of excellence.\""
|
||||
"\" The president said that Ketanji Brown Jackson is one of the nation's top legal minds, a former top litigator in private practice, and a former federal public defender.\""
|
||||
]
|
||||
},
|
||||
"execution_count": 33,
|
||||
|
||||
@@ -6,7 +6,7 @@
|
||||
"source": [
|
||||
"# Vectara Text Generation\n",
|
||||
"\n",
|
||||
"This notebook is based on [text generation](https://github.com/hwchase17/langchain/blob/master/docs/modules/chains/index_examples/vector_db_text_generation.ipynb) notebook and adapted to Vectara."
|
||||
"This notebook is based on [chat_vector_db](https://github.com/hwchase17/langchain/blob/master/docs/modules/chains/index_examples/question_answering.ipynb) and adapted to Vectara."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -24,7 +24,6 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from langchain.llms import OpenAI\n",
|
||||
"from langchain.docstore.document import Document\n",
|
||||
"import requests\n",
|
||||
@@ -160,7 +159,7 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[{'text': '\\n\\nEnvironment variables are a powerful tool for managing configuration settings in your applications. They allow you to store and access values from anywhere in your code, making it easier to keep your codebase organized and maintainable.\\n\\nHowever, there are times when you may want to use environment variables specifically for a single command. This is where shell variables come in. Shell variables are similar to environment variables, but they won\\'t be exported to spawned commands. They are defined with the following syntax:\\n\\n```sh\\nVAR_NAME=value\\n```\\n\\nFor example, if you wanted to use a shell variable instead of an environment variable in a command, you could do something like this:\\n\\n```sh\\nVAR=hello && echo $VAR && deno eval \"console.log(\\'Deno: \\' + Deno.env.get(\\'VAR\\'))\"\\n```\\n\\nThis would output the following:\\n\\n```\\nhello\\nDeno: undefined\\n```\\n\\nShell variables can be useful when you want to re-use a value, but don\\'t want it available in any spawned processes.\\n\\nAnother way to use environment variables is through pipelines. Pipelines provide a way to pipe the'}, {'text': '\\n\\nEnvironment variables are a great way to store and access sensitive information in your applications. They are also useful for configuring applications and managing different environments. In Deno, there are two ways to use environment variables: the built-in `Deno.env` and the `.env` file.\\n\\nThe `Deno.env` is a built-in feature of the Deno runtime that allows you to set and get environment variables. It has getter and setter methods that you can use to access and set environment variables. For example, you can set the `FIREBASE_API_KEY` and `FIREBASE_AUTH_DOMAIN` environment variables like this:\\n\\n```ts\\nDeno.env.set(\"FIREBASE_API_KEY\", \"examplekey123\");\\nDeno.env.set(\"FIREBASE_AUTH_DOMAIN\", \"firebasedomain.com\");\\n\\nconsole.log(Deno.env.get(\"FIREBASE_API_KEY\")); // examplekey123\\nconsole.log(Deno.env.get(\"FIREBASE_AUTH_DOMAIN\")); // firebasedomain'}, {'text': \"\\n\\nEnvironment variables are a powerful tool for managing configuration and settings in your applications. They allow you to store and access values that can be used in your code, and they can be set and changed without having to modify your code.\\n\\nIn Deno, environment variables are defined using the `export` command. For example, to set a variable called `VAR_NAME` to the value `value`, you would use the following command:\\n\\n```sh\\nexport VAR_NAME=value\\n```\\n\\nYou can then access the value of the environment variable in your code using the `Deno.env.get()` method. For example, if you wanted to log the value of the `VAR_NAME` variable, you could use the following code:\\n\\n```js\\nconsole.log(Deno.env.get('VAR_NAME'));\\n```\\n\\nYou can also set environment variables for a single command. To do this, you can list the environment variables before the command, like so:\\n\\n```\\nVAR=hello VAR2=bye deno run main.ts\\n```\\n\\nThis will set the environment variables `VAR` and `V\"}, {'text': \"\\n\\nEnvironment variables are a powerful tool for managing settings and configuration in your applications. They can be used to store information such as user preferences, application settings, and even passwords. In this blog post, we'll discuss how to make Deno scripts executable with a hashbang (shebang).\\n\\nA hashbang is a line of code that is placed at the beginning of a script. It tells the system which interpreter to use when running the script. In the case of Deno, the hashbang should be `#!/usr/bin/env -S deno run --allow-env`. This tells the system to use the Deno interpreter and to allow the script to access environment variables.\\n\\nOnce the hashbang is in place, you may need to give the script execution permissions. On Linux, this can be done with the command `sudo chmod +x hashbang.ts`. After that, you can execute the script by calling it like any other command: `./hashbang.ts`.\\n\\nIn the example program, we give the context permission to access the environment variables and print the Deno installation path. This is done by using the `Deno.env.get()` function, which returns the value of the specified environment\"}]\n"
|
||||
"[{'text': '\\n\\nEnvironment variables are an essential part of any development workflow. They provide a way to store and access information that is specific to the environment in which the code is running. This can be especially useful when working with different versions of a language or framework, or when running code on different machines.\\n\\nThe Deno CLI tasks extension provides a way to easily manage environment variables when running Deno commands. This extension provides a task definition for allowing you to create tasks that execute the `deno` CLI from within the editor. The template for the Deno CLI tasks has the following interface, which can be configured in a `tasks.json` within your workspace:\\n\\nThe task definition includes the `type` field, which should be set to `deno`, and the `command` field, which is the `deno` command to run (e.g. `run`, `test`, `cache`, etc.). Additionally, you can specify additional arguments to pass on the command line, the current working directory to execute the command, and any environment variables.\\n\\nUsing environment variables with the Deno CLI tasks extension is a great way to ensure that your code is running in the correct environment. For example, if you are running a test suite,'}, {'text': '\\n\\nEnvironment variables are an important part of any programming language, and they can be used to store and access data in a variety of ways. In this blog post, we\\'ll be taking a look at environment variables specifically for the shell.\\n\\nShell variables are similar to environment variables, but they won\\'t be exported to spawned commands. They are defined with the following syntax:\\n\\n```sh\\nVAR_NAME=value\\n```\\n\\nShell variables can be used to store and access data in a variety of ways. For example, you can use them to store values that you want to re-use, but don\\'t want to be available in any spawned processes.\\n\\nFor example, if you wanted to store a value and then use it in a command, you could do something like this:\\n\\n```sh\\nVAR=hello && echo $VAR && deno eval \"console.log(\\'Deno: \\' + Deno.env.get(\\'VAR\\'))\"\\n```\\n\\nThis would output the following:\\n\\n```\\nhello\\nDeno: undefined\\n```\\n\\nAs you can see, the value stored in the shell variable is not available in the spawned process.\\n\\n'}, {'text': '\\n\\nWhen it comes to developing applications, environment variables are an essential part of the process. Environment variables are used to store information that can be used by applications and scripts to customize their behavior. This is especially important when it comes to developing applications with Deno, as there are several environment variables that can impact the behavior of Deno.\\n\\nThe most important environment variable for Deno is `DENO_AUTH_TOKENS`. This environment variable is used to store authentication tokens that are used to access remote resources. This is especially important when it comes to accessing remote APIs or databases. Without the proper authentication tokens, Deno will not be able to access the remote resources.\\n\\nAnother important environment variable for Deno is `DENO_DIR`. This environment variable is used to store the directory where Deno will store its files. This includes the Deno executable, the Deno cache, and the Deno configuration files. By setting this environment variable, you can ensure that Deno will always be able to find the files it needs.\\n\\nFinally, there is the `DENO_PLUGINS` environment variable. This environment variable is used to store the list of plugins that Deno will use. This is important for customizing the'}, {'text': '\\n\\nEnvironment variables are a great way to store and access sensitive information in your Deno applications. Deno offers built-in support for environment variables with `Deno.env`, and you can also use a `.env` file to store and access environment variables. In this blog post, we\\'ll explore both of these options and how to use them in your Deno applications.\\n\\n## Built-in `Deno.env`\\n\\nThe Deno runtime offers built-in support for environment variables with [`Deno.env`](https://deno.land/api@v1.25.3?s=Deno.env). `Deno.env` has getter and setter methods. Here is example usage:\\n\\n```ts\\nDeno.env.set(\"FIREBASE_API_KEY\", \"examplekey123\");\\nDeno.env.set(\"FIREBASE_AUTH_DOMAIN\", \"firebasedomain.com\");\\n\\nconsole.log(Deno.env.get(\"FIREBASE_API_KEY\")); // examplekey123\\nconsole.log(Deno.env.get(\"FIREBASE_AUTH_'}]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
|
||||
@@ -14,7 +14,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "9b22020a",
|
||||
"metadata": {},
|
||||
@@ -140,7 +139,6 @@
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "c0a6c031",
|
||||
"metadata": {},
|
||||
@@ -231,7 +229,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"agent.run(\"What did biden say about ketanji brown jackson in the state of the union address?\")"
|
||||
"agent.run(\"What did biden say about ketanji brown jackson is the state of the union address?\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -273,7 +271,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "787a9b5e",
|
||||
"metadata": {},
|
||||
@@ -282,7 +279,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "9161ba91",
|
||||
"metadata": {},
|
||||
@@ -400,7 +396,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "49a0cbbe",
|
||||
"metadata": {},
|
||||
|
||||
@@ -1,7 +1,6 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "18ada398-dce6-4049-9b56-fc0ede63da9c",
|
||||
"metadata": {},
|
||||
@@ -12,7 +11,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "eecb683b-3a46-4b9d-81a3-7caefbfec1a1",
|
||||
"metadata": {},
|
||||
@@ -90,7 +88,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "f4814175-964d-42f1-aa9d-22801ce1e912",
|
||||
"metadata": {},
|
||||
@@ -126,7 +123,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "8a38ad10",
|
||||
"metadata": {},
|
||||
@@ -169,7 +165,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"agent_executor.run(\"What did biden say about ketanji brown jackson in the state of the union address?\")"
|
||||
"agent_executor.run(\"What did biden say about ketanji brown jackson is the state of the union address?\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -207,11 +203,10 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"agent_executor.run(\"What did biden say about ketanji brown jackson in the state of the union address? List the source.\")"
|
||||
"agent_executor.run(\"What did biden say about ketanji brown jackson is the state of the union address? List the source.\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "7ca07707",
|
||||
"metadata": {},
|
||||
@@ -260,7 +255,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "71680984-edaf-4a63-90f5-94edbd263550",
|
||||
"metadata": {},
|
||||
@@ -305,7 +299,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"agent_executor.run(\"What did biden say about ketanji brown jackson in the state of the union address?\")"
|
||||
"agent_executor.run(\"What did biden say about ketanji brown jackson is the state of the union address?\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -160,9 +160,3 @@ Below is a list of all supported tools and relevant information:
|
||||
- Notes: A connection to the OpenWeatherMap API (https://api.openweathermap.org), specifically the `/data/2.5/weather` endpoint.
|
||||
- Requires LLM: No
|
||||
- Extra Parameters: `openweathermap_api_key` (your API key to access this endpoint)
|
||||
|
||||
**sleep**
|
||||
|
||||
- Tool Name: Sleep
|
||||
- Tool Description: Make agent sleep for some time.
|
||||
- Requires LLM: No
|
||||
|
||||
@@ -177,7 +177,7 @@
|
||||
"\u001b[32;1m\u001b[1;3mMATCH (a:Actor)-[:ACTED_IN]->(m:Movie {name: 'Top Gun'})\n",
|
||||
"RETURN a.name\u001b[0m\n",
|
||||
"Full Context:\n",
|
||||
"\u001b[32;1m\u001b[1;3m[{'a.name': 'Val Kilmer'}, {'a.name': 'Anthony Edwards'}, {'a.name': 'Meg Ryan'}, {'a.name': 'Tom Cruise'}]\u001b[0m\n",
|
||||
"\u001b[32;1m\u001b[1;3m[{'a.name': 'Tom Cruise'}, {'a.name': 'Val Kilmer'}, {'a.name': 'Anthony Edwards'}, {'a.name': 'Meg Ryan'}]\u001b[0m\n",
|
||||
"\n",
|
||||
"\u001b[1m> Finished chain.\u001b[0m\n"
|
||||
]
|
||||
@@ -185,7 +185,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Val Kilmer, Anthony Edwards, Meg Ryan, and Tom Cruise played in Top Gun.'"
|
||||
"'Tom Cruise, Val Kilmer, Anthony Edwards, and Meg Ryan played in Top Gun.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 7,
|
||||
@@ -197,180 +197,10 @@
|
||||
"chain.run(\"Who played in Top Gun?\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2d28c4df",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Limit the number of results\n",
|
||||
"You can limit the number of results from the Cypher QA Chain using the `top_k` parameter.\n",
|
||||
"The default is 10."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "df230946",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain = GraphCypherQAChain.from_llm(\n",
|
||||
" ChatOpenAI(temperature=0), graph=graph, verbose=True, top_k=2\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "3f1600ee",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"\n",
|
||||
"\u001b[1m> Entering new GraphCypherQAChain chain...\u001b[0m\n",
|
||||
"Generated Cypher:\n",
|
||||
"\u001b[32;1m\u001b[1;3mMATCH (a:Actor)-[:ACTED_IN]->(m:Movie {name: 'Top Gun'})\n",
|
||||
"RETURN a.name\u001b[0m\n",
|
||||
"Full Context:\n",
|
||||
"\u001b[32;1m\u001b[1;3m[{'a.name': 'Val Kilmer'}, {'a.name': 'Anthony Edwards'}]\u001b[0m\n",
|
||||
"\n",
|
||||
"\u001b[1m> Finished chain.\u001b[0m\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Val Kilmer and Anthony Edwards played in Top Gun.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"chain.run(\"Who played in Top Gun?\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "88c16206",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Return intermediate results\n",
|
||||
"You can return intermediate steps from the Cypher QA Chain using the `return_intermediate_steps` parameter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "e412f36b",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain = GraphCypherQAChain.from_llm(\n",
|
||||
" ChatOpenAI(temperature=0), graph=graph, verbose=True, return_intermediate_steps=True\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "4f4699dc",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"\n",
|
||||
"\u001b[1m> Entering new GraphCypherQAChain chain...\u001b[0m\n",
|
||||
"Generated Cypher:\n",
|
||||
"\u001b[32;1m\u001b[1;3mMATCH (a:Actor)-[:ACTED_IN]->(m:Movie {name: 'Top Gun'})\n",
|
||||
"RETURN a.name\u001b[0m\n",
|
||||
"Full Context:\n",
|
||||
"\u001b[32;1m\u001b[1;3m[{'a.name': 'Val Kilmer'}, {'a.name': 'Anthony Edwards'}, {'a.name': 'Meg Ryan'}, {'a.name': 'Tom Cruise'}]\u001b[0m\n",
|
||||
"\n",
|
||||
"\u001b[1m> Finished chain.\u001b[0m\n",
|
||||
"Intermediate steps: [{'query': \"MATCH (a:Actor)-[:ACTED_IN]->(m:Movie {name: 'Top Gun'})\\nRETURN a.name\"}, {'context': [{'a.name': 'Val Kilmer'}, {'a.name': 'Anthony Edwards'}, {'a.name': 'Meg Ryan'}, {'a.name': 'Tom Cruise'}]}]\n",
|
||||
"Final answer: Val Kilmer, Anthony Edwards, Meg Ryan, and Tom Cruise played in Top Gun.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"result = chain(\"Who played in Top Gun?\")\n",
|
||||
"print(f\"Intermediate steps: {result['intermediate_steps']}\")\n",
|
||||
"print(f\"Final answer: {result['result']}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d6e1b054",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Return direct results\n",
|
||||
"You can return direct results from the Cypher QA Chain using the `return_direct` parameter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"id": "2d3acf10",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain = GraphCypherQAChain.from_llm(\n",
|
||||
" ChatOpenAI(temperature=0), graph=graph, verbose=True, return_direct=True\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"id": "b0a9d143",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"\n",
|
||||
"\u001b[1m> Entering new GraphCypherQAChain chain...\u001b[0m\n",
|
||||
"Generated Cypher:\n",
|
||||
"\u001b[32;1m\u001b[1;3mMATCH (a:Actor)-[:ACTED_IN]->(m:Movie {name: 'Top Gun'})\n",
|
||||
"RETURN a.name\u001b[0m\n",
|
||||
"\n",
|
||||
"\u001b[1m> Finished chain.\u001b[0m\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[{'a.name': 'Val Kilmer'},\n",
|
||||
" {'a.name': 'Anthony Edwards'},\n",
|
||||
" {'a.name': 'Meg Ryan'},\n",
|
||||
" {'a.name': 'Tom Cruise'}]"
|
||||
]
|
||||
},
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"chain.run(\"Who played in Top Gun?\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "74d0a36f",
|
||||
"id": "b4825316",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
@@ -392,7 +222,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.8"
|
||||
"version": "3.9.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -1,270 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "c94240f5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# NebulaGraphQAChain\n",
|
||||
"\n",
|
||||
"This notebook shows how to use LLMs to provide a natural language interface to NebulaGraph database."
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "dbc0ee68",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You will need to have a running NebulaGraph cluster, for which you can run a containerized cluster by running the following script:\n",
|
||||
"\n",
|
||||
"```bash\n",
|
||||
"curl -fsSL nebula-up.siwei.io/install.sh | bash\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"Other options are:\n",
|
||||
"- Install as a [Docker Desktop Extension](https://www.docker.com/blog/distributed-cloud-native-graph-database-nebulagraph-docker-extension/). See [here](https://docs.nebula-graph.io/3.5.0/2.quick-start/1.quick-start-workflow/)\n",
|
||||
"- NebulaGraph Cloud Service. See [here](https://www.nebula-graph.io/cloud)\n",
|
||||
"- Deploy from package, source code, or via Kubernetes. See [here](https://docs.nebula-graph.io/)\n",
|
||||
"\n",
|
||||
"Once the cluster is running, we could create the SPACE and SCHEMA for the database."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "c82f4141",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%pip install ipython-ngql\n",
|
||||
"%load_ext ngql\n",
|
||||
"\n",
|
||||
"# connect ngql jupyter extension to nebulagraph\n",
|
||||
"%ngql --address 127.0.0.1 --port 9669 --user root --password nebula\n",
|
||||
"# create a new space\n",
|
||||
"%ngql CREATE SPACE IF NOT EXISTS langchain(partition_num=1, replica_factor=1, vid_type=fixed_string(128));\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "eda0809a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Wait for a few seconds for the space to be created.\n",
|
||||
"%ngql USE langchain;"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "119fe35c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Create the schema, for full dataset, refer [here](https://www.siwei.io/en/nebulagraph-etl-dbt/)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "5aa796ee",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%ngql\n",
|
||||
"CREATE TAG IF NOT EXISTS movie(name string);\n",
|
||||
"CREATE TAG IF NOT EXISTS person(name string, birthdate string);\n",
|
||||
"CREATE EDGE IF NOT EXISTS acted_in();\n",
|
||||
"CREATE TAG INDEX IF NOT EXISTS person_index ON person(name(128));\n",
|
||||
"CREATE TAG INDEX IF NOT EXISTS movie_index ON movie(name(128));"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "66e4799a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Wait for schema creation to complete, then we can insert some data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "d8eea530",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"UsageError: Cell magic `%%ngql` not found.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"%%ngql\n",
|
||||
"INSERT VERTEX person(name, birthdate) VALUES \"Al Pacino\":(\"Al Pacino\", \"1940-04-25\");\n",
|
||||
"INSERT VERTEX movie(name) VALUES \"The Godfather II\":(\"The Godfather II\");\n",
|
||||
"INSERT VERTEX movie(name) VALUES \"The Godfather Coda: The Death of Michael Corleone\":(\"The Godfather Coda: The Death of Michael Corleone\");\n",
|
||||
"INSERT EDGE acted_in() VALUES \"Al Pacino\"->\"The Godfather II\":();\n",
|
||||
"INSERT EDGE acted_in() VALUES \"Al Pacino\"->\"The Godfather Coda: The Death of Michael Corleone\":();"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "62812aad",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.chat_models import ChatOpenAI\n",
|
||||
"from langchain.chains import NebulaGraphQAChain\n",
|
||||
"from langchain.graphs import NebulaGraph"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "0928915d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"graph = NebulaGraph(\n",
|
||||
" space=\"langchain\",\n",
|
||||
" username=\"root\",\n",
|
||||
" password=\"nebula\",\n",
|
||||
" address=\"127.0.0.1\",\n",
|
||||
" port=9669,\n",
|
||||
" session_pool_size=30,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "58c1a8ea",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Refresh graph schema information\n",
|
||||
"\n",
|
||||
"If the schema of database changes, you can refresh the schema information needed to generate nGQL statements."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "4e3de44f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# graph.refresh_schema()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "1fe76ccd",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Node properties: [{'tag': 'movie', 'properties': [('name', 'string')]}, {'tag': 'person', 'properties': [('name', 'string'), ('birthdate', 'string')]}]\n",
|
||||
"Edge properties: [{'edge': 'acted_in', 'properties': []}]\n",
|
||||
"Relationships: ['(:person)-[:acted_in]->(:movie)']\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(graph.get_schema)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "68a3c677",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Querying the graph\n",
|
||||
"\n",
|
||||
"We can now use the graph cypher QA chain to ask question of the graph"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "7476ce98",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain = NebulaGraphQAChain.from_llm(\n",
|
||||
" ChatOpenAI(temperature=0), graph=graph, verbose=True\n",
|
||||
")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "ef8ee27b",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"\n",
|
||||
"\u001b[1m> Entering new NebulaGraphQAChain chain...\u001b[0m\n",
|
||||
"Generated nGQL:\n",
|
||||
"\u001b[32;1m\u001b[1;3mMATCH (p:`person`)-[:acted_in]->(m:`movie`) WHERE m.`movie`.`name` == 'The Godfather II'\n",
|
||||
"RETURN p.`person`.`name`\u001b[0m\n",
|
||||
"Full Context:\n",
|
||||
"\u001b[32;1m\u001b[1;3m{'p.person.name': ['Al Pacino']}\u001b[0m\n",
|
||||
"\n",
|
||||
"\u001b[1m> Finished chain.\u001b[0m\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Al Pacino played in The Godfather II.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"chain.run(\"Who played in The Godfather II?\")"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.3"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
@@ -30,7 +30,6 @@ For detailed instructions on how to get set up with Unstructured, see installati
|
||||
:maxdepth: 1
|
||||
:glob:
|
||||
|
||||
./document_loaders/examples/airtable.ipynb
|
||||
./document_loaders/examples/audio.ipynb
|
||||
./document_loaders/examples/conll-u.ipynb
|
||||
./document_loaders/examples/copypaste.ipynb
|
||||
@@ -38,7 +37,6 @@ For detailed instructions on how to get set up with Unstructured, see installati
|
||||
./document_loaders/examples/email.ipynb
|
||||
./document_loaders/examples/epub.ipynb
|
||||
./document_loaders/examples/evernote.ipynb
|
||||
./document_loaders/examples/excel.ipynb
|
||||
./document_loaders/examples/facebook_chat.ipynb
|
||||
./document_loaders/examples/file_directory.ipynb
|
||||
./document_loaders/examples/html.ipynb
|
||||
@@ -117,7 +115,6 @@ We need access tokens and sometime other parameters to get access to these datas
|
||||
./document_loaders/examples/discord_loader.ipynb
|
||||
./document_loaders/examples/docugami.ipynb
|
||||
./document_loaders/examples/duckdb.ipynb
|
||||
./document_loaders/examples/fauna.ipynb
|
||||
./document_loaders/examples/figma.ipynb
|
||||
./document_loaders/examples/gitbook.ipynb
|
||||
./document_loaders/examples/git.ipynb
|
||||
@@ -139,7 +136,6 @@ We need access tokens and sometime other parameters to get access to these datas
|
||||
./document_loaders/examples/reddit.ipynb
|
||||
./document_loaders/examples/roam.ipynb
|
||||
./document_loaders/examples/slack.ipynb
|
||||
./document_loaders/examples/snowflake.ipynb
|
||||
./document_loaders/examples/spreedly.ipynb
|
||||
./document_loaders/examples/stripe.ipynb
|
||||
./document_loaders/examples/tomarkdown.ipynb
|
||||
|
||||
@@ -1,142 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7ae421e6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Airtable"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "98aea00d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"! pip install pyairtable"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "592483eb",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders import AirtableLoader"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "637e1205",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"* Get your API key [here](https://support.airtable.com/docs/creating-and-using-api-keys-and-access-tokens).\n",
|
||||
"* Get ID of your base [here](https://airtable.com/developers/web/api/introduction).\n",
|
||||
"* Get your table ID from the table url as shown [here](https://www.highviewapps.com/kb/where-can-i-find-the-airtable-base-id-and-table-id/#:~:text=Both%20the%20Airtable%20Base%20ID,URL%20that%20begins%20with%20tbl)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "c12a7aff",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"api_key=\"xxx\"\n",
|
||||
"base_id=\"xxx\"\n",
|
||||
"table_id=\"xxx\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "ccddd5a6",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"loader = AirtableLoader(api_key,table_id,base_id)\n",
|
||||
"docs = loader.load()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ae76c25c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Returns each table row as `dict`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "7abec7ce",
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"3"
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"len(docs)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "403c95da",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'id': 'recF3GbGZCuh9sXIQ',\n",
|
||||
" 'createdTime': '2023-06-09T04:47:21.000Z',\n",
|
||||
" 'fields': {'Priority': 'High',\n",
|
||||
" 'Status': 'In progress',\n",
|
||||
" 'Name': 'Document Splitters'}}"
|
||||
]
|
||||
},
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"eval(docs[0].page_content)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.16"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
@@ -29,6 +29,7 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": 26,
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"jupyter": {
|
||||
"outputs_hidden": false
|
||||
}
|
||||
@@ -44,6 +45,7 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": 27,
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"jupyter": {
|
||||
"outputs_hidden": false
|
||||
}
|
||||
@@ -74,6 +76,7 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": 28,
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"jupyter": {
|
||||
"outputs_hidden": false
|
||||
}
|
||||
@@ -93,6 +96,7 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": 29,
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"jupyter": {
|
||||
"outputs_hidden": false
|
||||
}
|
||||
@@ -148,211 +152,6 @@
|
||||
"source": [
|
||||
"print(data)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## `UnstructuredCSVLoader`\n",
|
||||
"\n",
|
||||
"You can also load the table using the `UnstructuredCSVLoader`. One advantage of using `UnstructuredCSVLoader` is that if you use it in `\"elements\"` mode, an HTML representation of the table will be available in the metadata."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders.csv_loader import UnstructuredCSVLoader"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"loader = UnstructuredCSVLoader(file_path='example_data/mlb_teams_2012.csv', mode=\"elements\")\n",
|
||||
"docs = loader.load()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <td>Nationals</td>\n",
|
||||
" <td>81.34</td>\n",
|
||||
" <td>98</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Reds</td>\n",
|
||||
" <td>82.20</td>\n",
|
||||
" <td>97</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Yankees</td>\n",
|
||||
" <td>197.96</td>\n",
|
||||
" <td>95</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Giants</td>\n",
|
||||
" <td>117.62</td>\n",
|
||||
" <td>94</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Braves</td>\n",
|
||||
" <td>83.31</td>\n",
|
||||
" <td>94</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Athletics</td>\n",
|
||||
" <td>55.37</td>\n",
|
||||
" <td>94</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Rangers</td>\n",
|
||||
" <td>120.51</td>\n",
|
||||
" <td>93</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Orioles</td>\n",
|
||||
" <td>81.43</td>\n",
|
||||
" <td>93</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Rays</td>\n",
|
||||
" <td>64.17</td>\n",
|
||||
" <td>90</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Angels</td>\n",
|
||||
" <td>154.49</td>\n",
|
||||
" <td>89</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Tigers</td>\n",
|
||||
" <td>132.30</td>\n",
|
||||
" <td>88</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Cardinals</td>\n",
|
||||
" <td>110.30</td>\n",
|
||||
" <td>88</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Dodgers</td>\n",
|
||||
" <td>95.14</td>\n",
|
||||
" <td>86</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>White Sox</td>\n",
|
||||
" <td>96.92</td>\n",
|
||||
" <td>85</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Brewers</td>\n",
|
||||
" <td>97.65</td>\n",
|
||||
" <td>83</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Phillies</td>\n",
|
||||
" <td>174.54</td>\n",
|
||||
" <td>81</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Diamondbacks</td>\n",
|
||||
" <td>74.28</td>\n",
|
||||
" <td>81</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Pirates</td>\n",
|
||||
" <td>63.43</td>\n",
|
||||
" <td>79</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Padres</td>\n",
|
||||
" <td>55.24</td>\n",
|
||||
" <td>76</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Mariners</td>\n",
|
||||
" <td>81.97</td>\n",
|
||||
" <td>75</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Mets</td>\n",
|
||||
" <td>93.35</td>\n",
|
||||
" <td>74</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Blue Jays</td>\n",
|
||||
" <td>75.48</td>\n",
|
||||
" <td>73</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Royals</td>\n",
|
||||
" <td>60.91</td>\n",
|
||||
" <td>72</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Marlins</td>\n",
|
||||
" <td>118.07</td>\n",
|
||||
" <td>69</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Red Sox</td>\n",
|
||||
" <td>173.18</td>\n",
|
||||
" <td>69</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Indians</td>\n",
|
||||
" <td>78.43</td>\n",
|
||||
" <td>68</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Twins</td>\n",
|
||||
" <td>94.08</td>\n",
|
||||
" <td>66</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Rockies</td>\n",
|
||||
" <td>78.06</td>\n",
|
||||
" <td>64</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Cubs</td>\n",
|
||||
" <td>88.19</td>\n",
|
||||
" <td>61</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td>Astros</td>\n",
|
||||
" <td>60.65</td>\n",
|
||||
" <td>55</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(docs[0].metadata[\"text_as_html\"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
@@ -371,7 +170,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.13"
|
||||
"version": "3.10.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -1,167 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"# Embaas\n",
|
||||
"[embaas](https://embaas.io) is a fully managed NLP API service that offers features like embedding generation, document text extraction, document to embeddings and more. You can choose a [variety of pre-trained models](https://embaas.io/docs/models/embeddings).\n",
|
||||
"\n",
|
||||
"### Prerequisites\n",
|
||||
"Create a free embaas account at [https://embaas.io/register](https://embaas.io/register) and generate an [API key](https://embaas.io/dashboard/api-keys)\n",
|
||||
"\n",
|
||||
"### Document Text Extraction API\n",
|
||||
"The document text extraction API allows you to extract the text from a given document. The API supports a variety of document formats, including PDF, mp3, mp4 and more. For a full list of supported formats, check out the API docs (link below)."
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Set API key\n",
|
||||
"embaas_api_key = \"YOUR_API_KEY\"\n",
|
||||
"# or set environment variable\n",
|
||||
"os.environ[\"EMBAAS_API_KEY\"] = \"YOUR_API_KEY\""
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"#### Using a blob (bytes)"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders.embaas import EmbaasBlobLoader\n",
|
||||
"from langchain.document_loaders.blob_loaders import Blob"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"blob_loader = EmbaasBlobLoader()\n",
|
||||
"blob = Blob.from_path(\"example.pdf\")\n",
|
||||
"documents = blob_loader.load(blob)"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# You can also directly create embeddings with your preferred embeddings model\n",
|
||||
"blob_loader = EmbaasBlobLoader(params={\"model\": \"e5-large-v2\", \"should_embed\": True})\n",
|
||||
"blob = Blob.from_path(\"example.pdf\")\n",
|
||||
"documents = blob_loader.load(blob)\n",
|
||||
"\n",
|
||||
"print(documents[0][\"metadata\"][\"embedding\"])"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"ExecuteTime": {
|
||||
"start_time": "2023-06-12T22:19:48.366886Z",
|
||||
"end_time": "2023-06-12T22:19:48.380467Z"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"#### Using a file"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders.embaas import EmbaasLoader"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"file_loader = EmbaasLoader(file_path=\"example.pdf\")\n",
|
||||
"documents = file_loader.load()"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Disable automatic text splitting\n",
|
||||
"file_loader = EmbaasLoader(file_path=\"example.mp3\", params={\"should_chunk\": False})\n",
|
||||
"documents = file_loader.load()"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"ExecuteTime": {
|
||||
"start_time": "2023-06-12T22:24:31.880857Z",
|
||||
"end_time": "2023-06-12T22:24:31.894665Z"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"For more detailed information about the embaas document text extraction API, please refer to [the official embaas API documentation](https://embaas.io/api-reference)."
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 2
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython2",
|
||||
"version": "2.7.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
||||
@@ -1,27 +0,0 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<factbook>
|
||||
<country>
|
||||
<name>United States</name>
|
||||
<capital>Washington, DC</capital>
|
||||
<leader>Joe Biden</leader>
|
||||
<sport>Baseball</sport>
|
||||
</country>
|
||||
<country>
|
||||
<name>Canada</name>
|
||||
<capital>Ottawa</capital>
|
||||
<leader>Justin Trudeau</leader>
|
||||
<sport>Hockey</sport>
|
||||
</country>
|
||||
<country>
|
||||
<name>France</name>
|
||||
<capital>Paris</capital>
|
||||
<leader>Emmanuel Macron</leader>
|
||||
<sport>Soccer</sport>
|
||||
</country>
|
||||
<country>
|
||||
<name>Trinidad & Tobado</name>
|
||||
<capital>Port of Spain</capital>
|
||||
<leader>Keith Rowley</leader>
|
||||
<sport>Track & Field</sport>
|
||||
</country>
|
||||
</factbook>
|
||||
@@ -1,84 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Fauna\n",
|
||||
"\n",
|
||||
">[Fauna](https://fauna.com/) is a Document Database.\n",
|
||||
"\n",
|
||||
"Query `Fauna` documents"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#!pip install fauna"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Query data example"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders.fauna import FaunaLoader\n",
|
||||
"\n",
|
||||
"secret = \"<enter-valid-fauna-secret>\"\n",
|
||||
"query = \"Item.all()\" # Fauna query. Assumes that the collection is called \"Item\"\n",
|
||||
"field = \"text\" # The field that contains the page content. Assumes that the field is called \"text\"\n",
|
||||
"\n",
|
||||
"loader = FaunaLoader(query, field, secret)\n",
|
||||
"docs = loader.lazy_load()\n",
|
||||
"\n",
|
||||
"for value in docs:\n",
|
||||
" print(value)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Query with Pagination\n",
|
||||
"You get a `after` value if there are more data. You can get values after the curcor by passing in the `after` string in query. \n",
|
||||
"\n",
|
||||
"To learn more following [this link](https://fqlx-beta--fauna-docs.netlify.app/fqlx/beta/reference/schema_entities/set/static-paginate)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"\"\"\n",
|
||||
"Item.paginate(\"hs+DzoPOg ... aY1hOohozrV7A\")\n",
|
||||
"Item.all()\n",
|
||||
"\"\"\"\n",
|
||||
"loader = FaunaLoader(query, field, secret)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
},
|
||||
"orig_nbformat": 4
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -22,16 +22,6 @@
|
||||
"Load .docx using `Docx2txt` into a document."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "7b80ea891",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install docx2txt "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
|
||||
@@ -146,73 +146,6 @@
|
||||
"documents[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Add custom scraping rules\n",
|
||||
"\n",
|
||||
"The `SitemapLoader` uses `beautifulsoup4` for the scraping process, and it scrapes every element on the page by default. The `SitemapLoader` constructor accepts a custom scraping function. This feature can be helpful to tailor the scraping process to your specific needs; for example, you might want to avoid scraping headers or navigation elements.\n",
|
||||
"\n",
|
||||
" The following example shows how to develop and use a custom function to avoid navigation and header elements."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Import the `beautifulsoup4` library and define the custom function."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"pip install beautifulsoup4"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from bs4 import BeautifulSoup\n",
|
||||
"\n",
|
||||
"def remove_nav_and_header_elements(content: BeautifulSoup) -> str:\n",
|
||||
" # Find all 'nav' and 'header' elements in the BeautifulSoup object\n",
|
||||
" nav_elements = content.find_all('nav')\n",
|
||||
" header_elements = content.find_all('header')\n",
|
||||
"\n",
|
||||
" # Remove each 'nav' and 'header' element from the BeautifulSoup object\n",
|
||||
" for element in nav_elements + header_elements:\n",
|
||||
" element.decompose()\n",
|
||||
"\n",
|
||||
" return str(content.get_text())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Add your custom function to the `SitemapLoader` object."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"loader = SitemapLoader(\n",
|
||||
" \"https://langchain.readthedocs.io/sitemap.xml\",\n",
|
||||
" filter_urls=[\"https://python.langchain.com/en/latest/\"],\n",
|
||||
" parsing_function=remove_nav_and_header_elements\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
|
||||
@@ -1,98 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Snowflake\n",
|
||||
"\n",
|
||||
"This notebooks goes over how to load documents from Snowflake"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"! pip install snowflake-connector-python"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import settings as s\n",
|
||||
"from langchain.document_loaders import SnowflakeLoader"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"QUERY = \"select text, survey_id from CLOUD_DATA_SOLUTIONS.HAPPY_OR_NOT.OPEN_FEEDBACK limit 10\"\n",
|
||||
"snowflake_loader = SnowflakeLoader(\n",
|
||||
" query=QUERY,\n",
|
||||
" user=s.SNOWFLAKE_USER,\n",
|
||||
" password=s.SNOWFLAKE_PASS,\n",
|
||||
" account=s.SNOWFLAKE_ACCOUNT,\n",
|
||||
" warehouse=s.SNOWFLAKE_WAREHOUSE,\n",
|
||||
" role=s.SNOWFLAKE_ROLE,\n",
|
||||
" database=s.SNOWFLAKE_DATABASE,\n",
|
||||
" schema=s.SNOWFLAKE_SCHEMA\n",
|
||||
")\n",
|
||||
"snowflake_documents = snowflake_loader.load()\n",
|
||||
"print(snowflake_documents)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from snowflakeLoader import SnowflakeLoader\n",
|
||||
"import settings as s\n",
|
||||
"QUERY = \"select text, survey_id as source from CLOUD_DATA_SOLUTIONS.HAPPY_OR_NOT.OPEN_FEEDBACK limit 10\"\n",
|
||||
"snowflake_loader = SnowflakeLoader(\n",
|
||||
" query=QUERY,\n",
|
||||
" user=s.SNOWFLAKE_USER,\n",
|
||||
" password=s.SNOWFLAKE_PASS,\n",
|
||||
" account=s.SNOWFLAKE_ACCOUNT,\n",
|
||||
" warehouse=s.SNOWFLAKE_WAREHOUSE,\n",
|
||||
" role=s.SNOWFLAKE_ROLE,\n",
|
||||
" database=s.SNOWFLAKE_DATABASE,\n",
|
||||
" schema=s.SNOWFLAKE_SCHEMA,\n",
|
||||
" metadata_columns=['source']\n",
|
||||
")\n",
|
||||
"snowflake_documents = snowflake_loader.load()\n",
|
||||
"print(snowflake_documents)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,78 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "22a849cc",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# XML\n",
|
||||
"\n",
|
||||
"The `UnstructuredXMLLoader` is used to load `XML` files. The loader works with `.xml` files. The page content will be the text extracted from the XML tags."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "e6616e3a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders import UnstructuredXMLLoader"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "a654e4d9",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"Document(page_content='United States\\n\\nWashington, DC\\n\\nJoe Biden\\n\\nBaseball\\n\\nCanada\\n\\nOttawa\\n\\nJustin Trudeau\\n\\nHockey\\n\\nFrance\\n\\nParis\\n\\nEmmanuel Macron\\n\\nSoccer\\n\\nTrinidad & Tobado\\n\\nPort of Spain\\n\\nKeith Rowley\\n\\nTrack & Field', metadata={'source': 'example_data/factbook.xml'})"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"loader = UnstructuredXMLLoader(\n",
|
||||
" \"example_data/factbook.xml\",\n",
|
||||
")\n",
|
||||
"docs = loader.load()\n",
|
||||
"docs[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "a54342bb",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.15"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
@@ -1,296 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e48afb8d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Loading documents from a YouTube url\n",
|
||||
"\n",
|
||||
"Building chat or QA applications on YouTube videos is a topic of high interest.\n",
|
||||
"\n",
|
||||
"Below we show how to easily go from a YouTube url to text to chat!\n",
|
||||
"\n",
|
||||
"We wil use the `OpenAIWhisperParser`, which will use the OpenAI Whisper API to transcribe audio to text.\n",
|
||||
"\n",
|
||||
"Note: You will need to have an `OPENAI_API_KEY` supplied."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "5f34e934",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders.generic import GenericLoader\n",
|
||||
"from langchain.document_loaders.parsers import OpenAIWhisperParser\n",
|
||||
"from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "85fc12bd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We will use `yt_dlp` to download audio for YouTube urls.\n",
|
||||
"\n",
|
||||
"We will use `pydub` to split downloaded audio files (such that we adhere to Whisper API's 25MB file size limit)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "fb5a6606",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"! pip install yt_dlp\n",
|
||||
"! pip install pydub"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b0e119f4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### YouTube url to text\n",
|
||||
"\n",
|
||||
"Use `YoutubeAudioLoader` to fetch / download the audio files.\n",
|
||||
"\n",
|
||||
"Then, ues `OpenAIWhisperParser()` to transcribe them to text.\n",
|
||||
"\n",
|
||||
"Let's take the first lecture of Andrej Karpathy's YouTube course as an example! "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "23e1e134",
|
||||
"metadata": {
|
||||
"scrolled": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[youtube] Extracting URL: https://youtu.be/kCc8FmEb1nY\n",
|
||||
"[youtube] kCc8FmEb1nY: Downloading webpage\n",
|
||||
"[youtube] kCc8FmEb1nY: Downloading android player API JSON\n",
|
||||
"[info] kCc8FmEb1nY: Downloading 1 format(s): 140\n",
|
||||
"[dashsegments] Total fragments: 11\n",
|
||||
"[download] Destination: /Users/31treehaus/Desktop/AI/langchain-fork/docs/modules/indexes/document_loaders/examples/Let's build GPT: from scratch, in code, spelled out..m4a\n",
|
||||
"[download] 100% of 107.73MiB in 00:00:18 at 5.92MiB/s \n",
|
||||
"[FixupM4a] Correcting container of \"/Users/31treehaus/Desktop/AI/langchain-fork/docs/modules/indexes/document_loaders/examples/Let's build GPT: from scratch, in code, spelled out..m4a\"\n",
|
||||
"[ExtractAudio] Not converting audio /Users/31treehaus/Desktop/AI/langchain-fork/docs/modules/indexes/document_loaders/examples/Let's build GPT: from scratch, in code, spelled out..m4a; file is already in target format m4a\n",
|
||||
"[youtube] Extracting URL: https://youtu.be/VMj-3S1tku0\n",
|
||||
"[youtube] VMj-3S1tku0: Downloading webpage\n",
|
||||
"[youtube] VMj-3S1tku0: Downloading android player API JSON\n",
|
||||
"[info] VMj-3S1tku0: Downloading 1 format(s): 140\n",
|
||||
"[download] /Users/31treehaus/Desktop/AI/langchain-fork/docs/modules/indexes/document_loaders/examples/The spelled-out intro to neural networks and backpropagation: building micrograd.m4a has already been downloaded\n",
|
||||
"[download] 100% of 134.98MiB\n",
|
||||
"[ExtractAudio] Not converting audio /Users/31treehaus/Desktop/AI/langchain-fork/docs/modules/indexes/document_loaders/examples/The spelled-out intro to neural networks and backpropagation: building micrograd.m4a; file is already in target format m4a\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Two Karpathy lecture videos\n",
|
||||
"urls = [\"https://youtu.be/kCc8FmEb1nY\",\n",
|
||||
" \"https://youtu.be/VMj-3S1tku0\"]\n",
|
||||
"\n",
|
||||
"# Directory to save audio files \n",
|
||||
"save_dir = \"~/Downloads/YouTube\"\n",
|
||||
"\n",
|
||||
"# Transcribe the videos to text\n",
|
||||
"loader = GenericLoader(YoutubeAudioLoader(urls,save_dir),OpenAIWhisperParser())\n",
|
||||
"docs = loader.load()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "72a94fd8",
|
||||
"metadata": {
|
||||
"scrolled": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"\"Hello, my name is Andrej and I've been training deep neural networks for a bit more than a decade. And in this lecture I'd like to show you what neural network training looks like under the hood. So in particular we are going to start with a blank Jupyter notebook and by the end of this lecture we will define and train a neural net and you'll get to see everything that goes on under the hood and exactly sort of how that works on an intuitive level. Now specifically what I would like to do is I w\""
|
||||
]
|
||||
},
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Returns a list of Documents, which can be easily viewed or parsed\n",
|
||||
"docs[0].page_content[0:500]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "93be6b49",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Building a chat app from YouTube video\n",
|
||||
"\n",
|
||||
"Given `Documents`, we can easily enable chat / question+answering."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "1823f042",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.chains import RetrievalQA\n",
|
||||
"from langchain.vectorstores import FAISS\n",
|
||||
"from langchain.chat_models import ChatOpenAI\n",
|
||||
"from langchain.embeddings import OpenAIEmbeddings\n",
|
||||
"from langchain.text_splitter import RecursiveCharacterTextSplitter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "7257cda1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Combine doc\n",
|
||||
"combined_docs = [doc.page_content for doc in docs]\n",
|
||||
"text = \" \".join(combined_docs)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "147c0c55",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Split them\n",
|
||||
"text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500, chunk_overlap = 150)\n",
|
||||
"splits = text_splitter.split_text(text)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "f3556703",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Build an index\n",
|
||||
"embeddings = OpenAIEmbeddings()\n",
|
||||
"vectordb = FAISS.from_texts(splits,embeddings)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "beaa99db",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Build a QA chain\n",
|
||||
"qa_chain = RetrievalQA.from_chain_type(llm = ChatOpenAI(model_name=\"gpt-3.5-turbo\", temperature=0),\n",
|
||||
" chain_type=\"stuff\",\n",
|
||||
" retriever=vectordb.as_retriever())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "f2239a62",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"\"We need to zero out the gradient before backprop at each step because the backward pass accumulates gradients in the grad attribute of each parameter. If we don't reset the grad to zero before each backward pass, the gradients will accumulate and add up, leading to incorrect updates and slower convergence. By resetting the grad to zero before each backward pass, we ensure that the gradients are calculated correctly and that the optimization process works as intended.\""
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Ask a question!\n",
|
||||
"query = \"Why do we need to zero out the gradient before backprop at each step?\"\n",
|
||||
"qa_chain.run(query)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "a8d01098",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'In the context of transformers, an encoder is a component that reads in a sequence of input tokens and generates a sequence of hidden representations. On the other hand, a decoder is a component that takes in a sequence of hidden representations and generates a sequence of output tokens. The main difference between the two is that the encoder is used to encode the input sequence into a fixed-length representation, while the decoder is used to decode the fixed-length representation into an output sequence. In machine translation, for example, the encoder reads in the source language sentence and generates a fixed-length representation, which is then used by the decoder to generate the target language sentence.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"query = \"What is the difference between an encoder and decoder?\"\n",
|
||||
"qa_chain.run(query)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "fe1e77dd",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'For any token, x is the input vector that contains the private information of that token, k and q are the key and query vectors respectively, which are produced by forwarding linear modules on x, and v is the vector that is calculated by propagating the same linear module on x again. The key vector represents what the token contains, and the query vector represents what the token is looking for. The vector v is the information that the token will communicate to other tokens if it finds them interesting, and it gets aggregated for the purposes of the self-attention mechanism.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"query = \"For any token, what are x, k, v, and q?\"\n",
|
||||
"qa_chain.run(query)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.16"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
@@ -1,90 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# AWS Kendra\n",
|
||||
"\n",
|
||||
"> AWS Kendra is an intelligent search service provided by Amazon Web Services (AWS). It utilizes advanced natural language processing (NLP) and machine learning algorithms to enable powerful search capabilities across various data sources within an organization. Kendra is designed to help users find the information they need quickly and accurately, improving productivity and decision-making.\n",
|
||||
"\n",
|
||||
"> With Kendra, users can search across a wide range of content types, including documents, FAQs, knowledge bases, manuals, and websites. It supports multiple languages and can understand complex queries, synonyms, and contextual meanings to provide highly relevant search results."
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using the AWS Kendra Index Retriever"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#!pip install boto3"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import boto3\n",
|
||||
"from langchain.retrievers import AwsKendraIndexRetriever"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Create New Retriever"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"kclient = boto3.client('kendra', region_name=\"us-east-1\")\n",
|
||||
"\n",
|
||||
"retriever = AwsKendraIndexRetriever(\n",
|
||||
" kclient=kclient,\n",
|
||||
" kendraindex=\"kendraindex\",\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now you can use retrieved documents from AWS Kendra Index"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"retriever.get_relevant_documents(\"what is langchain\")"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
},
|
||||
"orig_nbformat": 4
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,121 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "fc0db1bc",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# LOTR (Merger Retriever)\n",
|
||||
"\n",
|
||||
"Lord of the Retrievers, also known as MergerRetriever, takes a list of retrievers as input and merges the results of their get_relevant_documents() methods into a single list. The merged results will be a list of documents that are relevant to the query and that have been ranked by the different retrievers.\n",
|
||||
"\n",
|
||||
"The MergerRetriever class can be used to improve the accuracy of document retrieval in a number of ways. First, it can combine the results of multiple retrievers, which can help to reduce the risk of bias in the results. Second, it can rank the results of the different retrievers, which can help to ensure that the most relevant documents are returned first."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "9fbcc58f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import chromadb\n",
|
||||
"from langchain.retrievers.merger_retriever import MergerRetriever\n",
|
||||
"from langchain.vectorstores import Chroma\n",
|
||||
"from langchain.embeddings import HuggingFaceEmbeddings\n",
|
||||
"from langchain.embeddings import OpenAIEmbeddings\n",
|
||||
"from langchain.document_transformers import EmbeddingsRedundantFilter\n",
|
||||
"from langchain.retrievers.document_compressors import DocumentCompressorPipeline\n",
|
||||
"from langchain.retrievers import ContextualCompressionRetriever\n",
|
||||
"\n",
|
||||
"# Get 3 diff embeddings.\n",
|
||||
"all_mini = HuggingFaceEmbeddings(model_name=\"all-MiniLM-L6-v2\")\n",
|
||||
"multi_qa_mini = HuggingFaceEmbeddings(model_name=\"multi-qa-MiniLM-L6-dot-v1\")\n",
|
||||
"filter_embeddings = OpenAIEmbeddings()\n",
|
||||
"\n",
|
||||
"ABS_PATH = os.path.dirname(os.path.abspath(__file__))\n",
|
||||
"DB_DIR = os.path.join(ABS_PATH, \"db\")\n",
|
||||
"\n",
|
||||
"# Instantiate 2 diff cromadb indexs, each one with a diff embedding.\n",
|
||||
"client_settings = chromadb.config.Settings(\n",
|
||||
" chroma_db_impl=\"duckdb+parquet\",\n",
|
||||
" persist_directory=DB_DIR,\n",
|
||||
" anonymized_telemetry=False,\n",
|
||||
")\n",
|
||||
"db_all = Chroma(\n",
|
||||
" collection_name=\"project_store_all\",\n",
|
||||
" persist_directory=DB_DIR,\n",
|
||||
" client_settings=client_settings,\n",
|
||||
" embedding_function=all_mini,\n",
|
||||
")\n",
|
||||
"db_multi_qa = Chroma(\n",
|
||||
" collection_name=\"project_store_multi\",\n",
|
||||
" persist_directory=DB_DIR,\n",
|
||||
" client_settings=client_settings,\n",
|
||||
" embedding_function=multi_qa_mini,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Define 2 diff retrievers with 2 diff embeddings and diff search type.\n",
|
||||
"retriever_all = db_all.as_retriever(\n",
|
||||
" search_type=\"similarity\", search_kwargs={\"k\": 5, \"include_metadata\": True}\n",
|
||||
")\n",
|
||||
"retriever_multi_qa = db_multi_qa.as_retriever(\n",
|
||||
" search_type=\"mmr\", search_kwargs={\"k\": 5, \"include_metadata\": True}\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# The Lord of the Retrievers will hold the ouput of boths retrievers and can be used as any other \n",
|
||||
"# retriever on different types of chains.\n",
|
||||
"lotr = MergerRetriever(retrievers=[retriever_all, retriever_multi_qa])\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "c152339d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Remove redundant results from the merged retrievers."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "039faea6",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"# We can remove redundant results from both retrievers using yet another embedding. \n",
|
||||
"# Using multiples embeddings in diff steps could help reduce biases.\n",
|
||||
"filter = EmbeddingsRedundantFilter(embeddings=filter_embeddings)\n",
|
||||
"pipeline = DocumentCompressorPipeline(transformers=[filter])\n",
|
||||
"compression_retriever = ContextualCompressionRetriever(\n",
|
||||
" base_compressor=pipeline, base_retriever=lotr\n",
|
||||
")"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
@@ -99,14 +99,13 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "2d958271",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Similarity Score Threshold Retrieval\n",
|
||||
"\n",
|
||||
"You can also use a retrieval method that sets a similarity score threshold and only returns documents with a score above that threshold"
|
||||
"You can also a retrieval method that sets a similarity score threshold and only returns documents with a score above that threshold"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -1,145 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "70e9b619",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# MarkdownHeaderTextSplitter\n",
|
||||
"\n",
|
||||
"This splits a markdown file by a specified set of headers. For example, if we want to split this markdown:\n",
|
||||
"```\n",
|
||||
"md = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim \\nHi this is Joe\\n\\n ## Baz\\n\\n Hi this is Molly' \n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"Headers to split on:\n",
|
||||
"```\n",
|
||||
"[(\"#\", \"Header 1\"),(\"##\", \"Header 2\")]\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"Expected output:\n",
|
||||
"```\n",
|
||||
"{'content': 'Hi this is Jim \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n",
|
||||
"{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"Optionally, this also includes `return_each_line` in case a user want to perform other types of aggregation. \n",
|
||||
"\n",
|
||||
"If `return_each_line=True`, each line and associated header metadata are simply returned. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "19c044f0",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.text_splitter import MarkdownHeaderTextSplitter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "2ae3649b",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"{'content': 'Hi this is Jim \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n",
|
||||
"{'content': 'Hi this is Lance', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}}\n",
|
||||
"{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"markdown_document = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim\\n\\nHi this is Joe\\n\\n ### Boo \\n\\n Hi this is Lance \\n\\n ## Baz\\n\\n Hi this is Molly' \n",
|
||||
" \n",
|
||||
"headers_to_split_on = [\n",
|
||||
" (\"#\", \"Header 1\"),\n",
|
||||
" (\"##\", \"Header 2\"),\n",
|
||||
" (\"###\", \"Header 3\"),\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
|
||||
"splits = markdown_splitter.split_text(markdown_document)\n",
|
||||
"for split in splits:\n",
|
||||
" print(split)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2a32026a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here's an example on a larger file with `return_each_line=True` passed, allowing each line to be examined."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "8af8f9a2",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"{'content': 'Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]', 'metadata': {'Header 1': 'Intro', 'Header 2': 'History'}}\n",
|
||||
"{'content': 'Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'History'}}\n",
|
||||
"{'content': 'As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for', 'metadata': {'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}}\n",
|
||||
"{'content': 'additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}}\n",
|
||||
"{'content': 'From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'Rise and divergence', 'Header 4': 'Standardization'}}\n",
|
||||
"{'content': 'Implementations of Markdown are available for over a dozen programming languages.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'Implementations'}}\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"markdown_document = '# Intro \\n\\n ## History \\n\\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \\n\\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \\n\\n ## Rise and divergence \\n\\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \\n\\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \\n\\n #### Standardization \\n\\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \\n\\n ## Implementations \\n\\n Implementations of Markdown are available for over a dozen programming languages.'\n",
|
||||
" \n",
|
||||
"headers_to_split_on = [\n",
|
||||
" (\"#\", \"Header 1\"),\n",
|
||||
" (\"##\", \"Header 2\"),\n",
|
||||
" (\"###\", \"Header 3\"),\n",
|
||||
" (\"####\", \"Header 4\"),\n",
|
||||
"]\n",
|
||||
" \n",
|
||||
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=True)\n",
|
||||
"splits = markdown_splitter.split_text(markdown_document)\n",
|
||||
"for line in splits:\n",
|
||||
" print(line)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "987183f2",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.16"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
@@ -12,8 +12,7 @@
|
||||
"\n",
|
||||
"- `length_function`: how the length of chunks is calculated. Defaults to just counting number of characters, but it's pretty common to pass a token counter here.\n",
|
||||
"- `chunk_size`: the maximum size of your chunks (as measured by the length function).\n",
|
||||
"- `chunk_overlap`: the maximum overlap between chunks. It can be nice to have some overlap to maintain some continuity between chunks (eg do a sliding window).\n",
|
||||
"- `add_start_index` : wether to include the starting position of each chunk within the original document in the metadata. "
|
||||
"- `chunk_overlap`: the maximum overlap between chunks. It can be nice to have some overlap to maintain some continuity between chunks (eg do a sliding window)."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -50,7 +49,6 @@
|
||||
" chunk_size = 100,\n",
|
||||
" chunk_overlap = 20,\n",
|
||||
" length_function = len,\n",
|
||||
" add_start_index = True,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
@@ -64,8 +62,8 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and' metadata={'start_index': 0}\n",
|
||||
"page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.' metadata={'start_index': 82}\n"
|
||||
"page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and' lookup_str='' metadata={} lookup_index=0\n",
|
||||
"page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.' lookup_str='' metadata={} lookup_index=0\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@@ -92,7 +90,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.16"
|
||||
"version": "3.9.1"
|
||||
},
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
|
||||
@@ -1,194 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "833c4789",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# AwaDB\n",
|
||||
"[AwaDB](https://github.com/awa-ai/awadb) is an AI Native database for the search and storage of embedding vectors used by LLM Applications.\n",
|
||||
"This notebook shows how to use functionality related to the AwaDB."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "252930ea",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install awadb"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f2b71a47",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.text_splitter import CharacterTextSplitter\n",
|
||||
"from langchain.vectorstores import AwaDB\n",
|
||||
"from langchain.document_loaders import TextLoader"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "49be0bac",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"loader = TextLoader('../../../state_of_the_union.txt')\n",
|
||||
"documents = loader.load()\n",
|
||||
"text_splitter = CharacterTextSplitter(chunk_size= 100, chunk_overlap=0)\n",
|
||||
"docs = text_splitter.split_documents(documents)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "18714278",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"db = AwaDB.from_documents(docs)\n",
|
||||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||
"docs = db.similarity_search(query)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "62b7a4c5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(docs[0].page_content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a9b4be48",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "87fec6b5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Similarity search with score"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "17231924",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The returned distance score is between 0-1. 0 is dissimilar, 1 is the most similar"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f40ddae1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"docs = db.similarity_search_with_score(query)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f0045583",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(docs[0])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8c2da99d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"(Document(page_content='And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'}), 0.561813814013747)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0b49fb59",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Restore the table created and added data before"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "1bfa6e25",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"AwaDB automatically persists added document data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2a0f3b35",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"If you can restore the table you created and added before, you can just do this as below:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "1fd4b5b0",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"awadb_client = awadb.Client()\n",
|
||||
"ret = awadb_client.Load('langchain_awadb')\n",
|
||||
"if ret : print('awadb load table success')\n",
|
||||
"else:\n",
|
||||
" print('awadb load table failed')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "5ae9a9dd",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"awadb load table success"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
@@ -1,245 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Azure Cognitive Search"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Install Azure Cognitive Search SDK"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install --index-url=https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/ azure-search-documents==11.4.0a20230509004\n",
|
||||
"!pip install azure-identity"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Import required libraries"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os, json\n",
|
||||
"import openai\n",
|
||||
"from dotenv import load_dotenv\n",
|
||||
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
||||
"from langchain.schema import BaseRetriever\n",
|
||||
"from langchain.vectorstores.azuresearch import AzureSearch"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Configure OpenAI settings\n",
|
||||
"Configure the OpenAI settings to use Azure OpenAI or OpenAI"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Load environment variables from a .env file using load_dotenv():\n",
|
||||
"load_dotenv()\n",
|
||||
"\n",
|
||||
"openai.api_type = \"azure\"\n",
|
||||
"openai.api_base = \"YOUR_OPENAI_ENDPOINT\"\n",
|
||||
"openai.api_version = \"2023-05-15\"\n",
|
||||
"openai.api_key = \"YOUR_OPENAI_API_KEY\"\n",
|
||||
"model: str = \"text-embedding-ada-002\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Configure vector store settings\n",
|
||||
" \n",
|
||||
"Set up the vector store settings using environment variables:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"vector_store_address: str = 'YOUR_AZURE_SEARCH_ENDPOINT'\n",
|
||||
"vector_store_password: str = 'YOUR_AZURE_SEARCH_ADMIN_KEY'\n",
|
||||
"index_name: str = \"langchain-vector-demo\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Create embeddings and vector store instances\n",
|
||||
" \n",
|
||||
"Create instances of the OpenAIEmbeddings and AzureSearch classes:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"embeddings: OpenAIEmbeddings = OpenAIEmbeddings(model=model, chunk_size=1) \n",
|
||||
"vector_store: AzureSearch = AzureSearch(azure_search_endpoint=vector_store_address, \n",
|
||||
" azure_search_key=vector_store_password, \n",
|
||||
" index_name=index_name, \n",
|
||||
" embedding_function=embeddings.embed_query) \n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Insert text and embeddings into vector store\n",
|
||||
" \n",
|
||||
"Add texts and metadata from the JSON data to the vector store:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders import TextLoader\n",
|
||||
"from langchain.text_splitter import CharacterTextSplitter\n",
|
||||
"loader = TextLoader('../../../state_of_the_union.txt', encoding='utf-8')\n",
|
||||
"\n",
|
||||
"documents = loader.load()\n",
|
||||
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
||||
"docs = text_splitter.split_documents(documents)\n",
|
||||
"\n",
|
||||
"vector_store.add_documents(documents=docs)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Perform a vector similarity search\n",
|
||||
" \n",
|
||||
"Execute a pure vector similarity search using the similarity_search() method:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
|
||||
"\n",
|
||||
"Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
|
||||
"\n",
|
||||
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
|
||||
"\n",
|
||||
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Perform a similarity search\n",
|
||||
"docs = vector_store.similarity_search(query=\"What did the president say about Ketanji Brown Jackson\", k=3, search_type='similarity')\n",
|
||||
"print(docs[0].page_content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Perform a Hybrid Search\n",
|
||||
"\n",
|
||||
"Execute hybrid search using the hybrid_search() method:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
|
||||
"\n",
|
||||
"Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
|
||||
"\n",
|
||||
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
|
||||
"\n",
|
||||
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Perform a hybrid search \n",
|
||||
"docs = vector_store.similarity_search(query=\"What did the president say about Ketanji Brown Jackson\", k=3)\n",
|
||||
"print(docs[0].page_content)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.9.13 ('.venv': venv)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.3"
|
||||
},
|
||||
"orig_nbformat": 4,
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "645053d6307d413a1a75681b5ebb6449bb2babba4bcb0bf65a1ddc3dbefb108a"
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -40,12 +40,20 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"execution_count": 10,
|
||||
"id": "47f9b495-88f1-4286-8d5d-1416103931a7",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"OpenAI API Key: ········\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import getpass\n",
|
||||
@@ -58,7 +66,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"execution_count": 2,
|
||||
"id": "aac9563e",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
@@ -73,7 +81,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"execution_count": 7,
|
||||
"id": "a3c3999a",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
@@ -91,7 +99,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"execution_count": 8,
|
||||
"id": "5eabdb75",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
@@ -106,7 +114,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"execution_count": 9,
|
||||
"id": "4b172de8",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
@@ -142,7 +150,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"execution_count": 6,
|
||||
"id": "186ee1d8",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -152,18 +160,18 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"execution_count": 7,
|
||||
"id": "284e04b5",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"(Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'}),\n",
|
||||
" 0.36913747)"
|
||||
"(Document(page_content='In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \\n\\nWe cannot let this happen. \\n\\nTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n",
|
||||
" 0.3914415)"
|
||||
]
|
||||
},
|
||||
"execution_count": 14,
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -183,7 +191,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"execution_count": 9,
|
||||
"id": "b558ebb7",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -204,7 +212,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"execution_count": 10,
|
||||
"id": "428a6816",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -214,7 +222,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"execution_count": 11,
|
||||
"id": "56d1841c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -224,7 +232,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"execution_count": 12,
|
||||
"id": "39055525",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -234,17 +242,17 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"execution_count": 13,
|
||||
"id": "98378c4e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'})"
|
||||
"Document(page_content='In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \\n\\nWe cannot let this happen. \\n\\nTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0)"
|
||||
]
|
||||
},
|
||||
"execution_count": 19,
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -265,7 +273,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 20,
|
||||
"execution_count": 6,
|
||||
"id": "6dfd2b78",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -276,17 +284,17 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"execution_count": 8,
|
||||
"id": "29960da7",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'068c473b-d420-487a-806b-fb0ccea7f711': Document(page_content='foo', metadata={})}"
|
||||
"{'e0b74348-6c93-4893-8764-943139ec1d17': Document(page_content='foo', lookup_str='', metadata={}, lookup_index=0)}"
|
||||
]
|
||||
},
|
||||
"execution_count": 21,
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -297,17 +305,17 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"execution_count": 9,
|
||||
"id": "83392605",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'807e0c63-13f6-4070-9774-5c6f0fbb9866': Document(page_content='bar', metadata={})}"
|
||||
"{'bdc50ae3-a1bb-4678-9260-1b0979578f40': Document(page_content='bar', lookup_str='', metadata={}, lookup_index=0)}"
|
||||
]
|
||||
},
|
||||
"execution_count": 22,
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -318,7 +326,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"execution_count": 10,
|
||||
"id": "a3fcc1c7",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -328,18 +336,18 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 24,
|
||||
"execution_count": 11,
|
||||
"id": "41c51f89",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'068c473b-d420-487a-806b-fb0ccea7f711': Document(page_content='foo', metadata={}),\n",
|
||||
" '807e0c63-13f6-4070-9774-5c6f0fbb9866': Document(page_content='bar', metadata={})}"
|
||||
"{'e0b74348-6c93-4893-8764-943139ec1d17': Document(page_content='foo', lookup_str='', metadata={}, lookup_index=0),\n",
|
||||
" 'd5211050-c777-493d-8825-4800e74cfdb6': Document(page_content='bar', lookup_str='', metadata={}, lookup_index=0)}"
|
||||
]
|
||||
},
|
||||
"execution_count": 24,
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -348,140 +356,13 @@
|
||||
"db1.docstore._dict"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "f4294b96",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Similarity Search with filtering\n",
|
||||
"FAISS vectorstore can also support filtering, since the FAISS does not natively support filtering we have to do it manually. This is done by first fetching more results than `k` and then filtering them. You can filter the documents based on metadata. You can also set the `fetch_k` parameter when calling any search method to set how many documents you want to fetch before filtering. Here is a small example:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 25,
|
||||
"id": "d5bf812c",
|
||||
"execution_count": null,
|
||||
"id": "f80b60de",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Content: foo, Metadata: {'page': 1}, Score: 5.159960813797904e-15\n",
|
||||
"Content: foo, Metadata: {'page': 2}, Score: 5.159960813797904e-15\n",
|
||||
"Content: foo, Metadata: {'page': 3}, Score: 5.159960813797904e-15\n",
|
||||
"Content: foo, Metadata: {'page': 4}, Score: 5.159960813797904e-15\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain.schema import Document\n",
|
||||
"list_of_documents = [\n",
|
||||
" Document(page_content=\"foo\", metadata=dict(page=1)),\n",
|
||||
" Document(page_content=\"bar\", metadata=dict(page=1)),\n",
|
||||
" Document(page_content=\"foo\", metadata=dict(page=2)),\n",
|
||||
" Document(page_content=\"barbar\", metadata=dict(page=2)),\n",
|
||||
" Document(page_content=\"foo\", metadata=dict(page=3)),\n",
|
||||
" Document(page_content=\"bar burr\", metadata=dict(page=3)),\n",
|
||||
" Document(page_content=\"foo\", metadata=dict(page=4)),\n",
|
||||
" Document(page_content=\"bar bruh\", metadata=dict(page=4))\n",
|
||||
"]\n",
|
||||
"db = FAISS.from_documents(list_of_documents, embeddings)\n",
|
||||
"results_with_scores = db.similarity_search_with_score(\"foo\")\n",
|
||||
"for doc, score in results_with_scores:\n",
|
||||
" print(f\"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "3d33c126",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we make the same query call but we filter for only `page = 1` "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 26,
|
||||
"id": "83159330",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Content: foo, Metadata: {'page': 1}, Score: 5.159960813797904e-15\n",
|
||||
"Content: bar, Metadata: {'page': 1}, Score: 0.3131446838378906\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"results_with_scores = db.similarity_search_with_score(\"foo\", filter=dict(page=1))\n",
|
||||
"for doc, score in results_with_scores:\n",
|
||||
" print(f\"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "0be136e0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Same thing can be done with the `max_marginal_relevance_search` as well."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 27,
|
||||
"id": "432c6980",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Content: foo, Metadata: {'page': 1}\n",
|
||||
"Content: bar, Metadata: {'page': 1}\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"results = db.max_marginal_relevance_search(\"foo\", filter=dict(page=1))\n",
|
||||
"for doc in results:\n",
|
||||
" print(f\"Content: {doc.page_content}, Metadata: {doc.metadata}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "1b4ecd86",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here is an example of how to set `fetch_k` parameter when calling `similarity_search`. Usually you would want the `fetch_k` parameter >> `k` parameter. This is because the `fetch_k` parameter is the number of documents that will be fetched before filtering. If you set `fetch_k` to a low number, you might not get enough documents to filter from."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 28,
|
||||
"id": "1fd60fd1",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Content: foo, Metadata: {'page': 1}, Score: 5.159960813797904e-15\n",
|
||||
"Content: bar, Metadata: {'page': 1}, Score: 0.3131446838378906\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"results = db.similarity_search(\"foo\", filter=dict(page=1), k=1, fetch_k=4)\n",
|
||||
"for doc, score in results_with_scores:\n",
|
||||
" print(f\"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}\")"
|
||||
]
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
@@ -500,7 +381,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.16"
|
||||
"version": "3.10.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
Binary file not shown.
@@ -1,157 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Hologres\n",
|
||||
"\n",
|
||||
">[Hologres](https://www.alibabacloud.com/help/en/hologres/latest/introduction) is a unified real-time data warehousing service developed by Alibaba Cloud. You can use Hologres to write, update, process, and analyze large amounts of data in real time. \n",
|
||||
">Hologres supports standard SQL syntax, is compatible with PostgreSQL, and supports most PostgreSQL functions. Hologres supports online analytical processing (OLAP) and ad hoc analysis for up to petabytes of data, and provides high-concurrency and low-latency online data services. \n",
|
||||
"\n",
|
||||
">Hologres provides **vector database** functionality by adopting [Proxima](https://www.alibabacloud.com/help/en/hologres/latest/vector-processing).\n",
|
||||
">Proxima is a high-performance software library developed by Alibaba DAMO Academy. It allows you to search for the nearest neighbors of vectors. Proxima provides higher stability and performance than similar open source software such as Faiss. Proxima allows you to search for similar text or image embeddings with high throughput and low latency. Hologres is deeply integrated with Proxima to provide a high-performance vector search service.\n",
|
||||
"\n",
|
||||
"This notebook shows how to use functionality related to the `Hologres Proxima` vector database.\n",
|
||||
"Click [here](https://www.alibabacloud.com/zh/product/hologres) to fast deploy a Hologres cloud instance."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
||||
"from langchain.text_splitter import CharacterTextSplitter\n",
|
||||
"from langchain.vectorstores import Hologres"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Split documents and get embeddings by call OpenAI API"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders import TextLoader\n",
|
||||
"\n",
|
||||
"loader = TextLoader(\"../../../state_of_the_union.txt\")\n",
|
||||
"documents = loader.load()\n",
|
||||
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
||||
"docs = text_splitter.split_documents(documents)\n",
|
||||
"\n",
|
||||
"embeddings = OpenAIEmbeddings()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Connect to Hologres by setting related ENVIRONMENTS.\n",
|
||||
"```\n",
|
||||
"export PG_HOST={host}\n",
|
||||
"export PG_PORT={port} # Optional, default is 80\n",
|
||||
"export PG_DATABASE={db_name} # Optional, default is postgres\n",
|
||||
"export PG_USER={username}\n",
|
||||
"export PG_PASSWORD={password}\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"Then store your embeddings and documents into Hologres"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"\n",
|
||||
"connection_string = Hologres.connection_string_from_db_params(\n",
|
||||
" host=os.environ.get(\"PGHOST\", \"localhost\"),\n",
|
||||
" port=int(os.environ.get(\"PGPORT\", \"80\")),\n",
|
||||
" database=os.environ.get(\"PGDATABASE\", \"postgres\"),\n",
|
||||
" user=os.environ.get(\"PGUSER\", \"postgres\"),\n",
|
||||
" password=os.environ.get(\"PGPASSWORD\", \"postgres\"),\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"vector_db = Hologres.from_documents(\n",
|
||||
" docs,\n",
|
||||
" embeddings,\n",
|
||||
" connection_string=connection_string,\n",
|
||||
" table_name=\"langchain_example_embeddings\",\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Query and retrieve data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||
"docs = vector_db.similarity_search(query)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
|
||||
"\n",
|
||||
"Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
|
||||
"\n",
|
||||
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
|
||||
"\n",
|
||||
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(docs[0].page_content)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.16"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
@@ -5,18 +5,16 @@
|
||||
"id": "683953b3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Commented out until further notice\n",
|
||||
"# MongoDB Atlas Vector Search\n",
|
||||
"\n",
|
||||
"MongoDB Atlas Vector Search\n",
|
||||
">[MongoDB Atlas](https://www.mongodb.com/docs/atlas/) is a document database managed in the cloud. It also enables Lucene and its vector search feature.\n",
|
||||
"\n",
|
||||
">[MongoDB Atlas](https://www.mongodb.com/docs/atlas/) is a fully-managed cloud database available in AWS , Azure, and GCP. It now has support for native Vector Search on your MongoDB document data.\n",
|
||||
"This notebook shows how to use the functionality related to the `MongoDB Atlas Vector Search` feature where you can store your embeddings in MongoDB documents and create a Lucene vector index to perform a KNN search.\n",
|
||||
"\n",
|
||||
"This notebook shows how to use `MongoDB Atlas Vector Search` to store your embeddings in MongoDB documents, create a vector search index, and perform KNN search with an approximate nearest neighbor algorithm.\n",
|
||||
"It uses the [knnBeta Operator](https://www.mongodb.com/docs/atlas/atlas-search/knn-beta) available in MongoDB Atlas Search. This feature is in early access and available only for evaluation purposes, to validate functionality, and to gather feedback from a small closed group of early access users. It is not recommended for production deployments as we may introduce breaking changes.\n",
|
||||
"\n",
|
||||
"It uses the [knnBeta Operator](https://www.mongodb.com/docs/atlas/atlas-search/knn-beta) available in MongoDB Atlas Search. This feature is in Public Preview and available for evaluation purposes, to validate functionality, and to gather feedback from public preview users. It is not recommended for production deployments as we may introduce breaking changes.\n",
|
||||
"\n",
|
||||
"To use MongoDB Atlas, you must first deploy a cluster. We have a Forever-Free tier of clusters available. \n",
|
||||
"To get started head over to Atlas here: [quick start](https://www.mongodb.com/docs/atlas/getting-started/)."
|
||||
"To use MongoDB Atlas, you must have first deployed a cluster. Free clusters are available. \n",
|
||||
"Here is the MongoDB Atlas [quick start](https://www.mongodb.com/docs/atlas/getting-started/)."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -39,29 +37,16 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import getpass\n",
|
||||
"\n",
|
||||
"MONGODB_ATLAS_CLUSTER_URI = getpass.getpass('MongoDB Atlas Cluster URI:')\n",
|
||||
"MONGODB_ATLAS_CLUSTER_URI = os.environ['MONGODB_ATLAS_CLUSTER_URI']"
|
||||
"MONGODB_ATLAS_URI = os.environ['MONGODB_ATLAS_URI']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "457ace44-1d95-4001-9dd5-78811ab208ad",
|
||||
"id": "320af802-9271-46ee-948f-d2453933d44b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We want to use `OpenAIEmbeddings` so we need to set up our OpenAI API Key. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "2d8f240d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')\n",
|
||||
"OPENAI_API_KEY = os.environ['OPENAI_API_KEY']"
|
||||
"We want to use `OpenAIEmbeddings` so we have to get the OpenAI API Key. Make sure the environment variable `OPENAI_API_KEY` is set up before proceeding."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -69,8 +54,8 @@
|
||||
"id": "1f3ecc42",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now, let's create a vector search index on your cluster. In the below example, `embedding` is the name of the field that contains the embedding vector. Please refer to the [documentation](https://www.mongodb.com/docs/atlas/atlas-search/define-field-mappings-for-vector-search) to get more details on how to define an Atlas Vector Search index.\n",
|
||||
"You can name the index `langchain_demo` and create the index on the namespace `lanchain_db.langchain_col`. Finally, write the following definition in the JSON editor on MongoDB Atlas:\n",
|
||||
"Now, let's create a Lucene vector index on your cluster. In the below example, `embedding` is the name of the field that contains the embedding vector. Please refer to the [documentation](https://www.mongodb.com/docs/atlas/atlas-search/define-field-mappings-for-vector-search) to get more details on how to define an Atlas Search index.\n",
|
||||
"You can name the index `langchain_demo` and create the index on the namespace `lanchain_db.langchain_col`. Finally, write the following definition in the JSON editor:\n",
|
||||
"\n",
|
||||
"```json\n",
|
||||
"{\n",
|
||||
@@ -129,7 +114,7 @@
|
||||
"from pymongo import MongoClient\n",
|
||||
"\n",
|
||||
"# initialize MongoDB python client\n",
|
||||
"client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)\n",
|
||||
"client = MongoClient(MONGODB_ATLAS_CONNECTION_STRING)\n",
|
||||
"\n",
|
||||
"db_name = \"lanchain_db\"\n",
|
||||
"collection_name = \"langchain_col\"\n",
|
||||
@@ -158,47 +143,6 @@
|
||||
"source": [
|
||||
"print(docs[0].page_content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "851a2ec9-9390-49a4-8412-3e132c9f789d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can reuse the vector search index you created, make sure the `OPENAI_API_KEY` environment variable is set up, then execute another query."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "6336fe79-3e73-48be-b20a-0ff1bb6a4399",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from pymongo import MongoClient\n",
|
||||
"from langchain.vectorstores import MongoDBAtlasVectorSearch\n",
|
||||
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
||||
"import os\n",
|
||||
"\n",
|
||||
"MONGODB_ATLAS_URI = os.environ['MONGODB_ATLAS_URI']\n",
|
||||
"\n",
|
||||
"# initialize MongoDB python client\n",
|
||||
"client = MongoClient(MONGODB_ATLAS_URI)\n",
|
||||
"\n",
|
||||
"db_name = \"langchain_db\"\n",
|
||||
"collection_name = \"langchain_col\"\n",
|
||||
"collection = client[db_name][collection_name]\n",
|
||||
"index_name = \"langchain_index\"\n",
|
||||
"\n",
|
||||
"# initialize vector store\n",
|
||||
"vectorStore = MongoDBAtlasVectorSearch(\n",
|
||||
" collection, OpenAIEmbeddings(), index_name=index_name)\n",
|
||||
"\n",
|
||||
"# perform a similarity search between the embedding of the query and the embeddings of the documents\n",
|
||||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||
"docs = vectorStore.similarity_search(query)\n",
|
||||
"\n",
|
||||
"print(docs[0].page_content)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
@@ -217,7 +161,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.1"
|
||||
"version": "3.11.3"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -1,139 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2b9582dc",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# SingleStoreDB vector search\n",
|
||||
"[SingleStore DB](https://singlestore.com) is a high-performance distributed database that supports deployment both in the [cloud](https://www.singlestore.com/cloud/) and on-premises. For a significant duration, it has provided support for vector functions such as [dot_product](https://docs.singlestore.com/managed-service/en/reference/sql-reference/vector-functions/dot_product.html), thereby positioning itself as an ideal solution for AI applications that require text similarity matching. \n",
|
||||
"This tutorial illustrates how to utilize the features of the SingleStore DB Vector Store."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "e4a61a4d",
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Establishing a connection to the database is facilitated through the singlestoredb Python connector.\n",
|
||||
"# Please ensure that this connector is installed in your working environment.\n",
|
||||
"!pip install singlestoredb"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "39a0132a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import getpass\n",
|
||||
"\n",
|
||||
"# We want to use OpenAIEmbeddings so we have to get the OpenAI API Key.\n",
|
||||
"os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "6104fde8",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
||||
"from langchain.text_splitter import CharacterTextSplitter\n",
|
||||
"from langchain.vectorstores import SingleStoreDB\n",
|
||||
"from langchain.document_loaders import TextLoader"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "7b45113c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Load text samples \n",
|
||||
"from langchain.document_loaders import TextLoader\n",
|
||||
"loader = TextLoader('../../../state_of_the_union.txt')\n",
|
||||
"documents = loader.load()\n",
|
||||
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
||||
"docs = text_splitter.split_documents(documents)\n",
|
||||
"\n",
|
||||
"embeddings = OpenAIEmbeddings()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "535b2687",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"There are several ways to establish a [connection](https://singlestoredb-python.labs.singlestore.com/generated/singlestoredb.connect.html) to the database. You can either set up environment variables or pass named parameters to the `SingleStoreDB constructor`. Alternatively, you may provide these parameters to the `from_documents` and `from_texts` methods."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "d0b316bf",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Setup connection url as environment variable\n",
|
||||
"os.environ['SINGLESTOREDB_URL'] = 'root:pass@localhost:3306/db'\n",
|
||||
"\n",
|
||||
"# Load documents to the store\n",
|
||||
"docsearch = SingleStoreDB.from_documents(\n",
|
||||
" docs,\n",
|
||||
" embeddings,\n",
|
||||
" table_name = \"noteook\", # use table with a custom name \n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "0eaa4297",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||
"docs = docsearch.similarity_search(query) # Find documents that correspond to the query\n",
|
||||
"print(docs[0].page_content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "86efff90",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.2"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
@@ -101,7 +101,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"execution_count": 5,
|
||||
"id": "8429667e",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
@@ -133,7 +133,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"execution_count": 6,
|
||||
"id": "a8c513ab",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
@@ -145,12 +145,12 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||
"found_docs = vectara.similarity_search(query, n_sentence_context=0)"
|
||||
"found_docs = vectara.similarity_search(query)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"execution_count": 7,
|
||||
"id": "fc516993",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
@@ -164,13 +164,7 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
|
||||
"\n",
|
||||
"Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
|
||||
"\n",
|
||||
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
|
||||
"\n",
|
||||
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n"
|
||||
"Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. A former top litigator in private practice. A former federal public defender.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@@ -191,7 +185,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"execution_count": 8,
|
||||
"id": "8804a21d",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
@@ -207,7 +201,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"execution_count": 9,
|
||||
"id": "756a6887",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
@@ -220,15 +214,9 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
|
||||
"Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. A former top litigator in private practice. A former federal public defender.\n",
|
||||
"\n",
|
||||
"Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
|
||||
"\n",
|
||||
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
|
||||
"\n",
|
||||
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n",
|
||||
"\n",
|
||||
"Score: 0.7129974\n"
|
||||
"Score: 1.0046461\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@@ -251,7 +239,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"execution_count": 11,
|
||||
"id": "9427195f",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
@@ -263,10 +251,10 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"VectaraRetriever(vectorstore=<langchain.vectorstores.vectara.Vectara object at 0x122db2830>, search_type='similarity', search_kwargs={'lambda_val': 0.025, 'k': 5, 'filter': '', 'n_sentence_context': '0'})"
|
||||
"VectorStoreRetriever(vectorstore=<langchain.vectorstores.vectara.Vectara object at 0x156d3e830>, search_type='similarity', search_kwargs={})"
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -278,7 +266,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"execution_count": 15,
|
||||
"id": "f3c70c31",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
@@ -290,10 +278,10 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'})"
|
||||
"Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. A former top litigator in private practice. A former federal public defender.', metadata={'source': '../../modules/state_of_the_union.txt'})"
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -328,7 +316,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.9"
|
||||
"version": "3.11.3"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -209,6 +209,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "8fc3487b",
|
||||
"metadata": {},
|
||||
@@ -217,6 +218,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "281c0fcc",
|
||||
"metadata": {},
|
||||
@@ -234,6 +236,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "503e2e75",
|
||||
"metadata": {},
|
||||
@@ -270,6 +273,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "fbd7a6cb",
|
||||
"metadata": {},
|
||||
@@ -278,6 +282,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "f349acb9",
|
||||
"metadata": {},
|
||||
@@ -379,7 +384,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.9"
|
||||
"version": "3.9.16"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -121,7 +121,7 @@
|
||||
"\n",
|
||||
"Human: Hi there my friend\n",
|
||||
"AI: Hi there, how are you doing today?\n",
|
||||
"Human: Not too bad - how are you?\n",
|
||||
"Human: Not to bad - how are you?\n",
|
||||
"Chatbot:\u001b[0m\n",
|
||||
"\n",
|
||||
"\u001b[1m> Finished LLMChain chain.\u001b[0m\n"
|
||||
|
||||
@@ -118,29 +118,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "955f1b15",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## DynamoDBChatMessageHistory with Custom Endpoint URL\n",
|
||||
"\n",
|
||||
"Sometimes it is useful to specify the URL to the AWS endpoint to connect to. For instance, when you are running locally against [Localstack](https://localstack.cloud/). For those cases you can specify the URL via the `endpoint_url` parameter in the constructor."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "225713c8",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.memory.chat_message_histories import DynamoDBChatMessageHistory\n",
|
||||
"\n",
|
||||
"history = DynamoDBChatMessageHistory(table_name=\"SessionTable\", session_id=\"0\", endpoint_url=\"http://localhost.localstack.cloud:4566\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "3b33c988",
|
||||
"metadata": {},
|
||||
|
||||
@@ -1,7 +1,6 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "d31df93e",
|
||||
"metadata": {},
|
||||
@@ -10,7 +9,7 @@
|
||||
"\n",
|
||||
"This notebook walks through how LangChain thinks about memory. \n",
|
||||
"\n",
|
||||
"Memory involves keeping a concept of state around throughout a user's interactions with a language model. A user's interactions with a language model are captured in the concept of ChatMessages, so this boils down to ingesting, capturing, transforming and extracting knowledge from a sequence of chat messages. There are many different ways to do this, each of which exists as its own memory type.\n",
|
||||
"Memory involves keeping a concept of state around throughout a user's interactions with an language model. A user's interactions with a language model are captured in the concept of ChatMessages, so this boils down to ingesting, capturing, transforming and extracting knowledge from a sequence of chat messages. There are many different ways to do this, each of which exists as its own memory type.\n",
|
||||
"\n",
|
||||
"In general, for each type of memory there are two ways to understanding using memory. These are the standalone functions which extract information from a sequence of messages, and then there is the way you can use this type of memory in a chain. \n",
|
||||
"\n",
|
||||
@@ -26,7 +25,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"execution_count": 1,
|
||||
"id": "87235cf1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -42,18 +41,18 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"execution_count": 5,
|
||||
"id": "be030822",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[HumanMessage(content='hi!', additional_kwargs={}, example=False),\n",
|
||||
" AIMessage(content='whats up?', additional_kwargs={}, example=False)]"
|
||||
"[HumanMessage(content='hi!', additional_kwargs={}),\n",
|
||||
" AIMessage(content='whats up?', additional_kwargs={})]"
|
||||
]
|
||||
},
|
||||
"execution_count": 3,
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -76,7 +75,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"execution_count": 7,
|
||||
"id": "a382b160",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -86,7 +85,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"execution_count": 10,
|
||||
"id": "a280d337",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -98,7 +97,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"execution_count": 12,
|
||||
"id": "1b739c0a",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@@ -108,7 +107,7 @@
|
||||
"{'history': 'Human: hi!\\nAI: whats up?'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 7,
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -127,7 +126,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"execution_count": 13,
|
||||
"id": "798ceb1c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -139,18 +138,18 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"execution_count": 14,
|
||||
"id": "698688fd",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'history': [HumanMessage(content='hi!', additional_kwargs={}, example=False),\n",
|
||||
" AIMessage(content='whats up?', additional_kwargs={}, example=False)]}"
|
||||
"{'history': [HumanMessage(content='hi!', additional_kwargs={}),\n",
|
||||
" AIMessage(content='whats up?', additional_kwargs={})]}"
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -170,7 +169,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"execution_count": 15,
|
||||
"id": "54301321",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -189,7 +188,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"execution_count": 16,
|
||||
"id": "ae046bff",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@@ -217,7 +216,7 @@
|
||||
"\" Hi there! It's nice to meet you. How can I help you today?\""
|
||||
]
|
||||
},
|
||||
"execution_count": 11,
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -228,7 +227,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"execution_count": 17,
|
||||
"id": "d8e2a6ff",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@@ -257,7 +256,7 @@
|
||||
"\" That's great! It's always nice to have a conversation with someone new. What would you like to talk about?\""
|
||||
]
|
||||
},
|
||||
"execution_count": 12,
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -268,7 +267,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"execution_count": 18,
|
||||
"id": "15eda316",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@@ -299,7 +298,7 @@
|
||||
"\" Sure! I'm an AI created to help people with their everyday tasks. I'm programmed to understand natural language and provide helpful information. I'm also constantly learning and updating my knowledge base so I can provide more accurate and helpful answers.\""
|
||||
]
|
||||
},
|
||||
"execution_count": 13,
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -320,7 +319,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"execution_count": 1,
|
||||
"id": "b5acbc4b",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -339,7 +338,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"execution_count": 2,
|
||||
"id": "7812ee21",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -349,20 +348,18 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"execution_count": 3,
|
||||
"id": "3ed6e6a0",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[{'type': 'human',\n",
|
||||
" 'data': {'content': 'hi!', 'additional_kwargs': {}, 'example': False}},\n",
|
||||
" {'type': 'ai',\n",
|
||||
" 'data': {'content': 'whats up?', 'additional_kwargs': {}, 'example': False}}]"
|
||||
"[{'type': 'human', 'data': {'content': 'hi!', 'additional_kwargs': {}}},\n",
|
||||
" {'type': 'ai', 'data': {'content': 'whats up?', 'additional_kwargs': {}}}]"
|
||||
]
|
||||
},
|
||||
"execution_count": 16,
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -373,7 +370,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"execution_count": 4,
|
||||
"id": "cdf4ebd2",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -383,18 +380,18 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"execution_count": 5,
|
||||
"id": "9724e24b",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[HumanMessage(content='hi!', additional_kwargs={}, example=False),\n",
|
||||
" AIMessage(content='whats up?', additional_kwargs={}, example=False)]"
|
||||
"[HumanMessage(content='hi!', additional_kwargs={}),\n",
|
||||
" AIMessage(content='whats up?', additional_kwargs={})]"
|
||||
]
|
||||
},
|
||||
"execution_count": 18,
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -410,6 +407,14 @@
|
||||
"source": [
|
||||
"And that's it for the getting started! There are plenty of different types of memory, check out our examples to see them all"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "3dd37d93",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
@@ -428,7 +433,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.9"
|
||||
"version": "3.9.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -631,7 +631,7 @@
|
||||
"id": "56ea6a08",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You'll need to get a Momento auth token to use this class. This can either be passed in to a momento.CacheClient if you'd like to instantiate that directly, as a named parameter `auth_token` to `MomentoChatMessageHistory.from_client_params`, or can just be set as an environment variable `MOMENTO_AUTH_TOKEN`."
|
||||
"You'll need to get a Momemto auth token to use this class. This can either be passed in to a momento.CacheClient if you'd like to instantiate that directly, as a named parameter `auth_token` to `MomentoChatMessageHistory.from_client_params`, or can just be set as an environment variable `MOMENTO_AUTH_TOKEN`."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -1,196 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Baseten\n",
|
||||
"\n",
|
||||
"[Baseten](https://baseten.co) provides all the infrastructure you need to deploy and serve ML models performantly, scalably, and cost-efficiently.\n",
|
||||
"\n",
|
||||
"This example demonstrates using Langchain with models deployed on Baseten."
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Setup\n",
|
||||
"\n",
|
||||
"To run this notebook, you'll need a [Baseten account](https://baseten.co) and an [API key](https://docs.baseten.co/settings/api-keys).\n",
|
||||
"\n",
|
||||
"You'll also need to install the Baseten Python package:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install baseten"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import baseten\n",
|
||||
"\n",
|
||||
"baseten.login(\"YOUR_API_KEY\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Single model call\n",
|
||||
"\n",
|
||||
"First, you'll need to deploy a model to Baseten.\n",
|
||||
"\n",
|
||||
"You can deploy foundation models like WizardLM and Alpaca with one click from the [Baseten model library](https://app.baseten.co/explore/) or if you have your own model, [deploy it with this tutorial](https://docs.baseten.co/deploying-models/deploy).\n",
|
||||
"\n",
|
||||
"In this example, we'll work with WizardLM. [Deploy WizardLM here](https://app.baseten.co/explore/llama) and follow along with the deployed [model's version ID](https://docs.baseten.co/managing-models/manage)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.llms import Baseten"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Load the model\n",
|
||||
"wizardlm = Baseten(model=\"MODEL_VERSION_ID\", verbose=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Prompt the model\n",
|
||||
"\n",
|
||||
"wizardlm(\"What is the difference between a Wizard and a Sorcerer?\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Chained model calls\n",
|
||||
"\n",
|
||||
"We can chain together multiple calls to one or multiple models, which is the whole point of Langchain!\n",
|
||||
"\n",
|
||||
"This example uses WizardLM to plan a meal with an entree, three sides, and an alcoholic and non-alcoholic beverage pairing."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.chains import SimpleSequentialChain\n",
|
||||
"from langchain import PromptTemplate, LLMChain"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Build the first link in the chain\n",
|
||||
"\n",
|
||||
"prompt = PromptTemplate(\n",
|
||||
" input_variables=[\"cuisine\"],\n",
|
||||
" template=\"Name a complex entree for a {cuisine} dinner. Respond with just the name of a single dish.\",\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"link_one = LLMChain(llm=wizardlm, prompt=prompt)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Build the second link in the chain\n",
|
||||
"\n",
|
||||
"prompt = PromptTemplate(\n",
|
||||
" input_variables=[\"entree\"],\n",
|
||||
" template=\"What are three sides that would go with {entree}. Respond with only a list of the sides.\",\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"link_two = LLMChain(llm=wizardlm, prompt=prompt)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Build the third link in the chain\n",
|
||||
"\n",
|
||||
"prompt = PromptTemplate(\n",
|
||||
" input_variables=[\"sides\"],\n",
|
||||
" template=\"What is one alcoholic and one non-alcoholic beverage that would go well with this list of sides: {sides}. Respond with only the names of the beverages.\",\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"link_three = LLMChain(llm=wizardlm, prompt=prompt)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Run the full chain!\n",
|
||||
"\n",
|
||||
"menu_maker = SimpleSequentialChain(chains=[link_one, link_two, link_three], verbose=True)\n",
|
||||
"menu_maker.run(\"South Indian\")"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": ".venv",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.4"
|
||||
},
|
||||
"orig_nbformat": 4
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,7 +1,6 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
@@ -44,7 +43,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
@@ -165,14 +163,14 @@
|
||||
],
|
||||
"source": [
|
||||
"# Otherwise, you can manually specify the Databricks workspace hostname and personal access token \n",
|
||||
"# or set `DATABRICKS_HOST` and `DATABRICKS_TOKEN` environment variables, respectively.\n",
|
||||
"# or set `DATABRICKS_HOST` and `DATABRICKS_API_TOKEN` environment variables, respectively.\n",
|
||||
"# See https://docs.databricks.com/dev-tools/auth.html#databricks-personal-access-tokens\n",
|
||||
"# We strongly recommend not exposing the API token explicitly inside a notebook.\n",
|
||||
"# You can use Databricks secret manager to store your API token securely.\n",
|
||||
"# See https://docs.databricks.com/dev-tools/databricks-utils.html#secrets-utility-dbutilssecrets\n",
|
||||
"\n",
|
||||
"import os\n",
|
||||
"os.environ[\"DATABRICKS_TOKEN\"] = dbutils.secrets.get(\"myworkspace\", \"api_token\")\n",
|
||||
"os.environ[\"DATABRICKS_API_TOKEN\"] = dbutils.secrets.get(\"myworkspace\", \"api_token\")\n",
|
||||
"\n",
|
||||
"llm = Databricks(host=\"myworkspace.cloud.databricks.com\", endpoint_name=\"dolly\")\n",
|
||||
"\n",
|
||||
@@ -259,7 +257,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
@@ -276,7 +273,7 @@
|
||||
"Prerequisites:\n",
|
||||
"* An LLM loaded on a Databricks interactive cluster in \"single user\" or \"no isolation shared\" mode.\n",
|
||||
"* A local HTTP server running on the driver node to serve the model at `\"/\"` using HTTP POST with JSON input/output.\n",
|
||||
"* It uses a port number between `[3000, 8000]` and listens to the driver IP address or simply `0.0.0.0` instead of localhost only.\n",
|
||||
"* It uses a port number between `[3000, 8000]` and litens to the driver IP address or simply `0.0.0.0` instead of localhost only.\n",
|
||||
"* You have \"Can Attach To\" permission to the cluster.\n",
|
||||
"\n",
|
||||
"The expected server schema (using JSON schema) is:\n",
|
||||
|
||||
@@ -1,7 +1,6 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "959300d4",
|
||||
"metadata": {},
|
||||
@@ -14,7 +13,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "4c1b8450-5eaf-4d34-8341-2d785448a1ff",
|
||||
"metadata": {
|
||||
@@ -62,7 +60,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "84dd44c1-c428-41f3-a911-520281386c94",
|
||||
"metadata": {},
|
||||
@@ -107,7 +104,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "ddaa06cf-95ec-48ce-b0ab-d892a7909693",
|
||||
"metadata": {},
|
||||
@@ -118,7 +114,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "4fa9337e-ccb5-4c52-9b7c-1653148bc256",
|
||||
"metadata": {},
|
||||
@@ -163,14 +158,13 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "1a5c97af-89bc-4e59-95c1-223742a9160b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Dolly, by Databricks\n",
|
||||
"### Dolly, by DataBricks\n",
|
||||
"\n",
|
||||
"See [Databricks](https://huggingface.co/databricks) organization page for a list of available models."
|
||||
"See [DataBricks](https://huggingface.co/databricks) organization page for a list of available models."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -202,7 +196,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "03f6ae52-b5f9-4de6-832c-551cb3fa11ae",
|
||||
"metadata": {},
|
||||
@@ -240,7 +233,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "2bf838eb-1083-402f-b099-b07c452418c8",
|
||||
"metadata": {},
|
||||
|
||||
@@ -1,83 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# DashScope\n",
|
||||
"\n",
|
||||
"Let's load the DashScope Embedding class."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.embeddings import DashScopeEmbeddings"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"embeddings = DashScopeEmbeddings(model='text-embedding-v1', dashscope_api_key='your-dashscope-api-key')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"text = \"This is a test document.\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query_result = embeddings.embed_query(text)\n",
|
||||
"print(query_result)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"doc_results = embeddings.embed_documents([\"foo\"])\n",
|
||||
"print(doc_results)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "chatgpt",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.4"
|
||||
},
|
||||
"orig_nbformat": 4
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,133 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# DeepInfra\n",
|
||||
"\n",
|
||||
"[DeepInfra](https://deepinfra.com/?utm_source=langchain) is a serverless inference as a service that provides access to a [variety of LLMs](https://deepinfra.com/models?utm_source=langchain) and [embeddings models](https://deepinfra.com/models?type=embeddings&utm_source=langchain). This notebook goes over how to use LangChain with DeepInfra for text embeddings."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdin",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
" ········\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# sign up for an account: https://deepinfra.com/login?utm_source=langchain\n",
|
||||
"\n",
|
||||
"from getpass import getpass\n",
|
||||
"\n",
|
||||
"DEEPINFRA_API_TOKEN = getpass()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"\n",
|
||||
"os.environ[\"DEEPINFRA_API_TOKEN\"] = DEEPINFRA_API_TOKEN"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.embeddings import DeepInfraEmbeddings"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"embeddings = DeepInfraEmbeddings(\n",
|
||||
" model_id=\"sentence-transformers/clip-ViT-B-32\",\n",
|
||||
" query_instruction=\"\",\n",
|
||||
" embed_instruction=\"\",\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"docs = [\"Dog is not a cat\",\n",
|
||||
" \"Beta is the second letter of Greek alphabet\"]\n",
|
||||
"document_result = embeddings.embed_documents(docs)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"What is the first letter of Greek alphabet\"\n",
|
||||
"query_result = embeddings.embed_query(query)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Cosine similarity between \"Dog is not a cat\" and query: 0.7489097144129355\n",
|
||||
"Cosine similarity between \"Beta is the second letter of Greek alphabet\" and query: 0.9519380640702013\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import numpy as np\n",
|
||||
"\n",
|
||||
"query_numpy = np.array(query_result)\n",
|
||||
"for doc_res, doc in zip(document_result, docs):\n",
|
||||
" document_numpy = np.array(doc_res)\n",
|
||||
" similarity = np.dot(query_numpy, document_numpy) / (np.linalg.norm(query_numpy)*np.linalg.norm(document_numpy))\n",
|
||||
" print(f\"Cosine similarity between \\\"{doc}\\\" and query: {similarity}\")"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.10"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
@@ -1,144 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Embaas\n",
|
||||
"\n",
|
||||
"[embaas](https://embaas.io) is a fully managed NLP API service that offers features like embedding generation, document text extraction, document to embeddings and more. You can choose a [variety of pre-trained models](https://embaas.io/docs/models/embeddings).\n",
|
||||
"\n",
|
||||
"In this tutorial, we will show you how to use the embaas Embeddings API to generate embeddings for a given text.\n",
|
||||
"\n",
|
||||
"### Prerequisites\n",
|
||||
"Create your free embaas account at [https://embaas.io/register](https://embaas.io/register) and generate an [API key](https://embaas.io/dashboard/api-keys)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Set API key\n",
|
||||
"embaas_api_key = \"YOUR_API_KEY\"\n",
|
||||
"# or set environment variable\n",
|
||||
"os.environ[\"EMBAAS_API_KEY\"] = \"YOUR_API_KEY\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.embeddings import EmbaasEmbeddings"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"embeddings = EmbaasEmbeddings()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2023-06-10T11:17:55.940265Z",
|
||||
"start_time": "2023-06-10T11:17:55.938517Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create embeddings for a single document\n",
|
||||
"doc_text = \"This is a test document.\"\n",
|
||||
"doc_text_embedding = embeddings.embed_query(doc_text)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Print created embedding\n",
|
||||
"print(doc_text_embedding)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2023-06-10T11:19:25.237161Z",
|
||||
"start_time": "2023-06-10T11:19:25.235320Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create embeddings for multiple documents\n",
|
||||
"doc_texts = [\"This is a test document.\", \"This is another test document.\"]\n",
|
||||
"doc_texts_embeddings = embeddings.embed_documents(doc_texts)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Print created embeddings\n",
|
||||
"for i, doc_text_embedding in enumerate(doc_texts_embeddings):\n",
|
||||
" print(f\"Embedding for document {i + 1}: {doc_text_embedding}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2023-06-10T11:22:26.139769Z",
|
||||
"start_time": "2023-06-10T11:22:26.138357Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Using a different model and/or custom instruction\n",
|
||||
"embeddings = EmbaasEmbeddings(model=\"instructor-large\", instruction=\"Represent the Wikipedia document for retrieval\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"For more detailed information about the embaas Embeddings API, please refer to [the official embaas API documentation](https://embaas.io/api-reference)."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
}
|
||||
@@ -864,11 +864,7 @@ class AgentExecutor(Chain):
|
||||
raise e
|
||||
text = str(e)
|
||||
if isinstance(self.handle_parsing_errors, bool):
|
||||
if e.send_to_llm:
|
||||
observation = str(e.observation)
|
||||
text = str(e.llm_output)
|
||||
else:
|
||||
observation = "Invalid or incomplete response"
|
||||
observation = "Invalid or incomplete response"
|
||||
elif isinstance(self.handle_parsing_errors, str):
|
||||
observation = self.handle_parsing_errors
|
||||
elif callable(self.handle_parsing_errors):
|
||||
|
||||
@@ -34,7 +34,6 @@ from langchain.tools.requests.tool import (
|
||||
from langchain.tools.scenexplain.tool import SceneXplainTool
|
||||
from langchain.tools.searx_search.tool import SearxSearchResults, SearxSearchRun
|
||||
from langchain.tools.shell.tool import ShellTool
|
||||
from langchain.tools.sleep.tool import SleepTool
|
||||
from langchain.tools.wikipedia.tool import WikipediaQueryRun
|
||||
from langchain.tools.wolfram_alpha.tool import WolframAlphaQueryRun
|
||||
from langchain.tools.openweathermap.tool import OpenWeatherMapQueryRun
|
||||
@@ -83,10 +82,6 @@ def _get_terminal() -> BaseTool:
|
||||
return ShellTool()
|
||||
|
||||
|
||||
def _get_sleep() -> BaseTool:
|
||||
return SleepTool()
|
||||
|
||||
|
||||
_BASE_TOOLS: Dict[str, Callable[[], BaseTool]] = {
|
||||
"python_repl": _get_python_repl,
|
||||
"requests": _get_tools_requests_get, # preserved for backwards compatability
|
||||
@@ -96,7 +91,6 @@ _BASE_TOOLS: Dict[str, Callable[[], BaseTool]] = {
|
||||
"requests_put": _get_tools_requests_put,
|
||||
"requests_delete": _get_tools_requests_delete,
|
||||
"terminal": _get_terminal,
|
||||
"sleep": _get_sleep,
|
||||
}
|
||||
|
||||
|
||||
|
||||
@@ -2,10 +2,11 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Any, List, Optional, Sequence, Set
|
||||
from typing import List, Optional, Sequence, Set
|
||||
|
||||
from pydantic import BaseModel
|
||||
|
||||
from langchain.callbacks.manager import Callbacks
|
||||
from langchain.load.serializable import Serializable
|
||||
from langchain.schema import BaseMessage, LLMResult, PromptValue, get_buffer_string
|
||||
|
||||
|
||||
@@ -28,14 +29,13 @@ def _get_token_ids_default_method(text: str) -> List[int]:
|
||||
return tokenizer.encode(text)
|
||||
|
||||
|
||||
class BaseLanguageModel(Serializable, ABC):
|
||||
class BaseLanguageModel(BaseModel, ABC):
|
||||
@abstractmethod
|
||||
def generate_prompt(
|
||||
self,
|
||||
prompts: List[PromptValue],
|
||||
stop: Optional[List[str]] = None,
|
||||
callbacks: Callbacks = None,
|
||||
**kwargs: Any,
|
||||
) -> LLMResult:
|
||||
"""Take in a list of prompt values and return an LLMResult."""
|
||||
|
||||
@@ -45,39 +45,26 @@ class BaseLanguageModel(Serializable, ABC):
|
||||
prompts: List[PromptValue],
|
||||
stop: Optional[List[str]] = None,
|
||||
callbacks: Callbacks = None,
|
||||
**kwargs: Any,
|
||||
) -> LLMResult:
|
||||
"""Take in a list of prompt values and return an LLMResult."""
|
||||
|
||||
@abstractmethod
|
||||
def predict(
|
||||
self, text: str, *, stop: Optional[Sequence[str]] = None, **kwargs: Any
|
||||
) -> str:
|
||||
def predict(self, text: str, *, stop: Optional[Sequence[str]] = None) -> str:
|
||||
"""Predict text from text."""
|
||||
|
||||
@abstractmethod
|
||||
def predict_messages(
|
||||
self,
|
||||
messages: List[BaseMessage],
|
||||
*,
|
||||
stop: Optional[Sequence[str]] = None,
|
||||
**kwargs: Any,
|
||||
self, messages: List[BaseMessage], *, stop: Optional[Sequence[str]] = None
|
||||
) -> BaseMessage:
|
||||
"""Predict message from messages."""
|
||||
|
||||
@abstractmethod
|
||||
async def apredict(
|
||||
self, text: str, *, stop: Optional[Sequence[str]] = None, **kwargs: Any
|
||||
) -> str:
|
||||
async def apredict(self, text: str, *, stop: Optional[Sequence[str]] = None) -> str:
|
||||
"""Predict text from text."""
|
||||
|
||||
@abstractmethod
|
||||
async def apredict_messages(
|
||||
self,
|
||||
messages: List[BaseMessage],
|
||||
*,
|
||||
stop: Optional[Sequence[str]] = None,
|
||||
**kwargs: Any,
|
||||
self, messages: List[BaseMessage], *, stop: Optional[Sequence[str]] = None
|
||||
) -> BaseMessage:
|
||||
"""Predict message from messages."""
|
||||
|
||||
|
||||
@@ -133,11 +133,15 @@ def trace_as_chain_group(
|
||||
*,
|
||||
session_name: Optional[str] = None,
|
||||
example_id: Optional[Union[str, UUID]] = None,
|
||||
tenant_id: Optional[str] = None,
|
||||
session_extra: Optional[Dict[str, Any]] = None,
|
||||
) -> Generator[CallbackManager, None, None]:
|
||||
"""Get a callback manager for a chain group in a context manager."""
|
||||
cb = LangChainTracer(
|
||||
tenant_id=tenant_id,
|
||||
session_name=session_name,
|
||||
example_id=example_id,
|
||||
session_extra=session_extra,
|
||||
)
|
||||
cm = CallbackManager.configure(
|
||||
inheritable_callbacks=[cb],
|
||||
@@ -154,11 +158,15 @@ async def atrace_as_chain_group(
|
||||
*,
|
||||
session_name: Optional[str] = None,
|
||||
example_id: Optional[Union[str, UUID]] = None,
|
||||
tenant_id: Optional[str] = None,
|
||||
session_extra: Optional[Dict[str, Any]] = None,
|
||||
) -> AsyncGenerator[AsyncCallbackManager, None]:
|
||||
"""Get a callback manager for a chain group in a context manager."""
|
||||
cb = LangChainTracer(
|
||||
tenant_id=tenant_id,
|
||||
session_name=session_name,
|
||||
example_id=example_id,
|
||||
session_extra=session_extra,
|
||||
)
|
||||
cm = AsyncCallbackManager.configure(
|
||||
inheritable_callbacks=[cb],
|
||||
@@ -204,7 +212,7 @@ def _handle_event(
|
||||
except Exception as e:
|
||||
if handler.raise_error:
|
||||
raise e
|
||||
logger.warning(f"Error in {event_name} callback: {e}")
|
||||
logging.warning(f"Error in {event_name} callback: {e}")
|
||||
|
||||
|
||||
async def _ahandle_event_for_handler(
|
||||
@@ -238,8 +246,6 @@ async def _ahandle_event_for_handler(
|
||||
else:
|
||||
logger.warning(f"Error in {event_name} callback: {e}")
|
||||
except Exception as e:
|
||||
if handler.raise_error:
|
||||
raise e
|
||||
logger.warning(f"Error in {event_name} callback: {e}")
|
||||
|
||||
|
||||
@@ -872,16 +878,6 @@ class AsyncCallbackManager(BaseCallbackManager):
|
||||
T = TypeVar("T", CallbackManager, AsyncCallbackManager)
|
||||
|
||||
|
||||
def env_var_is_set(env_var: str) -> bool:
|
||||
"""Check if an environment variable is set."""
|
||||
return env_var in os.environ and os.environ[env_var] not in (
|
||||
"",
|
||||
"0",
|
||||
"false",
|
||||
"False",
|
||||
)
|
||||
|
||||
|
||||
def _configure(
|
||||
callback_manager_cls: Type[T],
|
||||
inheritable_callbacks: Callbacks = None,
|
||||
@@ -915,17 +911,18 @@ def _configure(
|
||||
wandb_tracer = wandb_tracing_callback_var.get()
|
||||
open_ai = openai_callback_var.get()
|
||||
tracing_enabled_ = (
|
||||
env_var_is_set("LANGCHAIN_TRACING")
|
||||
os.environ.get("LANGCHAIN_TRACING") is not None
|
||||
or tracer is not None
|
||||
or env_var_is_set("LANGCHAIN_HANDLER")
|
||||
or os.environ.get("LANGCHAIN_HANDLER") is not None
|
||||
)
|
||||
wandb_tracing_enabled_ = (
|
||||
env_var_is_set("LANGCHAIN_WANDB_TRACING") or wandb_tracer is not None
|
||||
os.environ.get("LANGCHAIN_WANDB_TRACING") is not None
|
||||
or wandb_tracer is not None
|
||||
)
|
||||
|
||||
tracer_v2 = tracing_v2_callback_var.get()
|
||||
tracing_v2_enabled_ = (
|
||||
env_var_is_set("LANGCHAIN_TRACING_V2") or tracer_v2 is not None
|
||||
os.environ.get("LANGCHAIN_TRACING_V2") is not None or tracer_v2 is not None
|
||||
)
|
||||
tracer_session = os.environ.get("LANGCHAIN_SESSION")
|
||||
debug = _get_debug()
|
||||
|
||||
@@ -93,6 +93,7 @@ class BaseTracer(BaseCallbackHandler, ABC):
|
||||
execution_order = self._get_execution_order(parent_run_id_)
|
||||
llm_run = Run(
|
||||
id=run_id,
|
||||
name=serialized.get("name"),
|
||||
parent_run_id=parent_run_id,
|
||||
serialized=serialized,
|
||||
inputs={"prompts": prompts},
|
||||
@@ -153,6 +154,7 @@ class BaseTracer(BaseCallbackHandler, ABC):
|
||||
execution_order = self._get_execution_order(parent_run_id_)
|
||||
chain_run = Run(
|
||||
id=run_id,
|
||||
name=serialized.get("name"),
|
||||
parent_run_id=parent_run_id,
|
||||
serialized=serialized,
|
||||
inputs=inputs,
|
||||
@@ -214,6 +216,7 @@ class BaseTracer(BaseCallbackHandler, ABC):
|
||||
execution_order = self._get_execution_order(parent_run_id_)
|
||||
tool_run = Run(
|
||||
id=run_id,
|
||||
name=serialized.get("name"),
|
||||
parent_run_id=parent_run_id,
|
||||
serialized=serialized,
|
||||
inputs={"input": input_str},
|
||||
|
||||
@@ -8,28 +8,59 @@ from datetime import datetime
|
||||
from typing import Any, Dict, List, Optional, Union
|
||||
from uuid import UUID
|
||||
|
||||
from langchainplus_sdk import LangChainPlusClient
|
||||
import requests
|
||||
from requests.exceptions import HTTPError
|
||||
from tenacity import (
|
||||
before_sleep_log,
|
||||
retry,
|
||||
retry_if_exception_type,
|
||||
stop_after_attempt,
|
||||
wait_exponential,
|
||||
)
|
||||
|
||||
from langchain.callbacks.tracers.base import BaseTracer
|
||||
from langchain.callbacks.tracers.schemas import (
|
||||
Run,
|
||||
RunCreate,
|
||||
RunTypeEnum,
|
||||
RunUpdate,
|
||||
TracerSession,
|
||||
)
|
||||
from langchain.env import get_runtime_environment
|
||||
from langchain.schema import BaseMessage, messages_to_dict
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
_LOGGED = set()
|
||||
|
||||
|
||||
def log_error_once(method: str, exception: Exception) -> None:
|
||||
"""Log an error once."""
|
||||
global _LOGGED
|
||||
if (method, type(exception)) in _LOGGED:
|
||||
return
|
||||
_LOGGED.add((method, type(exception)))
|
||||
logger.error(exception)
|
||||
def get_headers() -> Dict[str, Any]:
|
||||
"""Get the headers for the LangChain API."""
|
||||
headers: Dict[str, Any] = {"Content-Type": "application/json"}
|
||||
if os.getenv("LANGCHAIN_API_KEY"):
|
||||
headers["x-api-key"] = os.getenv("LANGCHAIN_API_KEY")
|
||||
return headers
|
||||
|
||||
|
||||
def get_endpoint() -> str:
|
||||
return os.getenv("LANGCHAIN_ENDPOINT", "http://localhost:1984")
|
||||
|
||||
|
||||
class LangChainTracerAPIError(Exception):
|
||||
"""An error occurred while communicating with the LangChain API."""
|
||||
|
||||
|
||||
class LangChainTracerUserError(Exception):
|
||||
"""An error occurred while communicating with the LangChain API."""
|
||||
|
||||
|
||||
class LangChainTracerError(Exception):
|
||||
"""An error occurred while communicating with the LangChain API."""
|
||||
|
||||
|
||||
retry_decorator = retry(
|
||||
stop=stop_after_attempt(3),
|
||||
wait=wait_exponential(multiplier=1, min=4, max=10),
|
||||
retry=retry_if_exception_type(LangChainTracerAPIError),
|
||||
before_sleep=before_sleep_log(logger, logging.WARNING),
|
||||
)
|
||||
|
||||
|
||||
class LangChainTracer(BaseTracer):
|
||||
@@ -39,19 +70,19 @@ class LangChainTracer(BaseTracer):
|
||||
self,
|
||||
example_id: Optional[Union[UUID, str]] = None,
|
||||
session_name: Optional[str] = None,
|
||||
client: Optional[LangChainPlusClient] = None,
|
||||
**kwargs: Any,
|
||||
) -> None:
|
||||
"""Initialize the LangChain tracer."""
|
||||
super().__init__(**kwargs)
|
||||
self.session: Optional[TracerSession] = None
|
||||
self._endpoint = get_endpoint()
|
||||
self._headers = get_headers()
|
||||
self.example_id = (
|
||||
UUID(example_id) if isinstance(example_id, str) else example_id
|
||||
)
|
||||
self.session_name = session_name or os.getenv("LANGCHAIN_SESSION", "default")
|
||||
# set max_workers to 1 to process tasks in order
|
||||
self.executor = ThreadPoolExecutor(max_workers=1)
|
||||
self.client = client or LangChainPlusClient()
|
||||
|
||||
def on_chat_model_start(
|
||||
self,
|
||||
@@ -67,6 +98,7 @@ class LangChainTracer(BaseTracer):
|
||||
execution_order = self._get_execution_order(parent_run_id_)
|
||||
chat_model_run = Run(
|
||||
id=run_id,
|
||||
name=serialized.get("name"),
|
||||
parent_run_id=parent_run_id,
|
||||
serialized=serialized,
|
||||
inputs={"messages": [messages_to_dict(batch) for batch in messages]},
|
||||
@@ -82,29 +114,60 @@ class LangChainTracer(BaseTracer):
|
||||
def _persist_run(self, run: Run) -> None:
|
||||
"""The Langchain Tracer uses Post/Patch rather than persist."""
|
||||
|
||||
@retry_decorator
|
||||
def _persist_run_single(self, run: Run) -> None:
|
||||
"""Persist a run."""
|
||||
if run.parent_run_id is None:
|
||||
run.reference_example_id = self.example_id
|
||||
run_dict = run.dict(exclude={"child_runs"})
|
||||
extra = run_dict.get("extra", {})
|
||||
extra["runtime"] = get_runtime_environment()
|
||||
run_dict["extra"] = extra
|
||||
run_dict = run.dict()
|
||||
del run_dict["child_runs"]
|
||||
run_create = RunCreate(**run_dict, session_name=self.session_name)
|
||||
response = None
|
||||
try:
|
||||
run = self.client.create_run(**run_dict, session_name=self.session_name)
|
||||
# TODO: Add retries when async
|
||||
response = requests.post(
|
||||
f"{self._endpoint}/runs",
|
||||
data=run_create.json(),
|
||||
headers=self._headers,
|
||||
)
|
||||
response.raise_for_status()
|
||||
except HTTPError as e:
|
||||
if response is not None and response.status_code == 500:
|
||||
raise LangChainTracerAPIError(
|
||||
f"Failed to upsert persist run to LangChain API. {e}"
|
||||
)
|
||||
else:
|
||||
raise LangChainTracerUserError(
|
||||
f"Failed to persist run to LangChain API. {e}"
|
||||
)
|
||||
except Exception as e:
|
||||
# Errors are swallowed by the thread executor so we need to log them here
|
||||
log_error_once("post", e)
|
||||
raise
|
||||
raise LangChainTracerError(
|
||||
f"Failed to persist run to LangChain API. {e}"
|
||||
) from e
|
||||
|
||||
@retry_decorator
|
||||
def _update_run_single(self, run: Run) -> None:
|
||||
"""Update a run."""
|
||||
run_update = RunUpdate(**run.dict())
|
||||
response = None
|
||||
try:
|
||||
self.client.update_run(run.id, **run.dict())
|
||||
response = requests.patch(
|
||||
f"{self._endpoint}/runs/{run.id}",
|
||||
data=run_update.json(),
|
||||
headers=self._headers,
|
||||
)
|
||||
response.raise_for_status()
|
||||
except HTTPError as e:
|
||||
if response is not None and response.status_code == 500:
|
||||
raise LangChainTracerAPIError(
|
||||
f"Failed to update run to LangChain API. {e}"
|
||||
)
|
||||
else:
|
||||
raise LangChainTracerUserError(f"Failed to run to LangChain API. {e}")
|
||||
except Exception as e:
|
||||
# Errors are swallowed by the thread executor so we need to log them here
|
||||
log_error_once("patch", e)
|
||||
raise
|
||||
raise LangChainTracerError(
|
||||
f"Failed to update run to LangChain API. {e}"
|
||||
) from e
|
||||
|
||||
def _on_llm_start(self, run: Run) -> None:
|
||||
"""Persist an LLM run."""
|
||||
|
||||
@@ -2,11 +2,12 @@ from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
from typing import Any, Dict, Optional, Union
|
||||
from typing import Any, Optional, Union
|
||||
|
||||
import requests
|
||||
|
||||
from langchain.callbacks.tracers.base import BaseTracer
|
||||
from langchain.callbacks.tracers.langchain import get_headers
|
||||
from langchain.callbacks.tracers.schemas import (
|
||||
ChainRun,
|
||||
LLMRun,
|
||||
@@ -20,14 +21,6 @@ from langchain.schema import get_buffer_string
|
||||
from langchain.utils import raise_for_status_with_text
|
||||
|
||||
|
||||
def get_headers() -> Dict[str, Any]:
|
||||
"""Get the headers for the LangChain API."""
|
||||
headers: Dict[str, Any] = {"Content-Type": "application/json"}
|
||||
if os.getenv("LANGCHAIN_API_KEY"):
|
||||
headers["x-api-key"] = os.getenv("LANGCHAIN_API_KEY")
|
||||
return headers
|
||||
|
||||
|
||||
def _get_endpoint() -> str:
|
||||
return os.getenv("LANGCHAIN_ENDPOINT", "http://localhost:8000")
|
||||
|
||||
|
||||
@@ -2,13 +2,13 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import datetime
|
||||
from enum import Enum
|
||||
from typing import Any, Dict, List, Optional
|
||||
from uuid import UUID
|
||||
|
||||
from langchainplus_sdk.schemas import RunBase as BaseRunV2
|
||||
from langchainplus_sdk.schemas import RunTypeEnum
|
||||
from pydantic import BaseModel, Field, root_validator
|
||||
|
||||
from langchain.env import get_runtime_environment
|
||||
from langchain.schema import LLMResult
|
||||
|
||||
|
||||
@@ -88,37 +88,66 @@ class ToolRun(BaseRun):
|
||||
# Begin V2 API Schemas
|
||||
|
||||
|
||||
class Run(BaseRunV2):
|
||||
"""Run schema for the V2 API in the Tracer."""
|
||||
class RunTypeEnum(str, Enum):
|
||||
"""Enum for run types."""
|
||||
|
||||
tool = "tool"
|
||||
chain = "chain"
|
||||
llm = "llm"
|
||||
|
||||
|
||||
class RunBase(BaseModel):
|
||||
"""Base Run schema."""
|
||||
|
||||
id: Optional[UUID]
|
||||
start_time: datetime.datetime = Field(default_factory=datetime.datetime.utcnow)
|
||||
end_time: datetime.datetime = Field(default_factory=datetime.datetime.utcnow)
|
||||
extra: Optional[Dict[str, Any]] = None
|
||||
error: Optional[str]
|
||||
execution_order: int
|
||||
child_execution_order: int
|
||||
child_execution_order: Optional[int]
|
||||
serialized: dict
|
||||
inputs: dict
|
||||
outputs: Optional[dict]
|
||||
reference_example_id: Optional[UUID]
|
||||
run_type: RunTypeEnum
|
||||
parent_run_id: Optional[UUID]
|
||||
|
||||
|
||||
class Run(RunBase):
|
||||
"""Run schema when loading from the DB."""
|
||||
|
||||
name: str
|
||||
child_runs: List[Run] = Field(default_factory=list)
|
||||
|
||||
@root_validator(pre=True)
|
||||
def assign_name(cls, values: dict) -> dict:
|
||||
"""Assign name to the run."""
|
||||
if values.get("name") is None:
|
||||
if "name" in values["serialized"]:
|
||||
values["name"] = values["serialized"]["name"]
|
||||
elif "id" in values["serialized"]:
|
||||
values["name"] = values["serialized"]["id"][-1]
|
||||
if "name" not in values:
|
||||
values["name"] = values["serialized"]["name"]
|
||||
return values
|
||||
|
||||
|
||||
class RunCreate(RunBase):
|
||||
name: str
|
||||
session_name: Optional[str] = None
|
||||
|
||||
@root_validator(pre=True)
|
||||
def add_runtime_env(cls, values: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Add env info to the run."""
|
||||
extra = values.get("extra", {})
|
||||
extra["runtime"] = get_runtime_environment()
|
||||
values["extra"] = extra
|
||||
return values
|
||||
|
||||
|
||||
class RunUpdate(BaseModel):
|
||||
end_time: Optional[datetime.datetime]
|
||||
error: Optional[str]
|
||||
outputs: Optional[dict]
|
||||
parent_run_id: Optional[UUID]
|
||||
reference_example_id: Optional[UUID]
|
||||
|
||||
|
||||
ChainRun.update_forward_refs()
|
||||
ToolRun.update_forward_refs()
|
||||
|
||||
__all__ = [
|
||||
"BaseRun",
|
||||
"ChainRun",
|
||||
"LLMRun",
|
||||
"Run",
|
||||
"RunTypeEnum",
|
||||
"ToolRun",
|
||||
"TracerSession",
|
||||
"TracerSessionBase",
|
||||
"TracerSessionV1",
|
||||
"TracerSessionV1Base",
|
||||
"TracerSessionV1Create",
|
||||
]
|
||||
|
||||
@@ -8,7 +8,7 @@ from langchain.input import get_bolded_text, get_colored_text
|
||||
|
||||
def try_json_stringify(obj: Any, fallback: str) -> str:
|
||||
try:
|
||||
return json.dumps(obj, indent=2, ensure_ascii=False)
|
||||
return json.dumps(obj, indent=2)
|
||||
except Exception:
|
||||
return fallback
|
||||
|
||||
|
||||
@@ -60,20 +60,10 @@ def _convert_llm_run_to_wb_span(trace_tree: Any, run: Run) -> trace_tree.Span:
|
||||
return base_span
|
||||
|
||||
|
||||
def _serialize_inputs(run_inputs: dict) -> Union[dict, list]:
|
||||
if "input_documents" in run_inputs:
|
||||
docs = run_inputs["input_documents"]
|
||||
return [doc.json() for doc in docs]
|
||||
else:
|
||||
return run_inputs
|
||||
|
||||
|
||||
def _convert_chain_run_to_wb_span(trace_tree: Any, run: Run) -> trace_tree.Span:
|
||||
base_span = _convert_run_to_wb_span(trace_tree, run)
|
||||
|
||||
base_span.results = [
|
||||
trace_tree.Result(inputs=_serialize_inputs(run.inputs), outputs=run.outputs)
|
||||
]
|
||||
base_span.results = [trace_tree.Result(inputs=run.inputs, outputs=run.outputs)]
|
||||
base_span.child_spans = [
|
||||
_convert_lc_run_to_wb_span(trace_tree, child_run)
|
||||
for child_run in run.child_runs
|
||||
@@ -89,9 +79,7 @@ def _convert_chain_run_to_wb_span(trace_tree: Any, run: Run) -> trace_tree.Span:
|
||||
|
||||
def _convert_tool_run_to_wb_span(trace_tree: Any, run: Run) -> trace_tree.Span:
|
||||
base_span = _convert_run_to_wb_span(trace_tree, run)
|
||||
base_span.results = [
|
||||
trace_tree.Result(inputs=_serialize_inputs(run.inputs), outputs=run.outputs)
|
||||
]
|
||||
base_span.results = [trace_tree.Result(inputs=run.inputs, outputs=run.outputs)]
|
||||
base_span.child_spans = [
|
||||
_convert_lc_run_to_wb_span(trace_tree, child_run)
|
||||
for child_run in run.child_runs
|
||||
|
||||
@@ -11,7 +11,6 @@ from langchain.chains.conversational_retrieval.base import (
|
||||
from langchain.chains.flare.base import FlareChain
|
||||
from langchain.chains.graph_qa.base import GraphQAChain
|
||||
from langchain.chains.graph_qa.cypher import GraphCypherQAChain
|
||||
from langchain.chains.graph_qa.nebulagraph import NebulaGraphQAChain
|
||||
from langchain.chains.hyde.base import HypotheticalDocumentEmbedder
|
||||
from langchain.chains.llm import LLMChain
|
||||
from langchain.chains.llm_bash.base import LLMBashChain
|
||||
@@ -68,5 +67,4 @@ __all__ = [
|
||||
"ConversationalRetrievalChain",
|
||||
"OpenAPIEndpointChain",
|
||||
"FlareChain",
|
||||
"NebulaGraphQAChain",
|
||||
]
|
||||
|
||||
@@ -78,7 +78,6 @@ class APIChain(Chain):
|
||||
callbacks=_run_manager.get_child(),
|
||||
)
|
||||
_run_manager.on_text(api_url, color="green", end="\n", verbose=self.verbose)
|
||||
api_url = api_url.strip()
|
||||
api_response = self.requests_wrapper.get(api_url)
|
||||
_run_manager.on_text(
|
||||
api_response, color="yellow", end="\n", verbose=self.verbose
|
||||
@@ -107,7 +106,6 @@ class APIChain(Chain):
|
||||
await _run_manager.on_text(
|
||||
api_url, color="green", end="\n", verbose=self.verbose
|
||||
)
|
||||
api_url = api_url.strip()
|
||||
api_response = await self.requests_wrapper.aget(api_url)
|
||||
await _run_manager.on_text(
|
||||
api_response, color="yellow", end="\n", verbose=self.verbose
|
||||
|
||||
@@ -7,7 +7,7 @@ from pathlib import Path
|
||||
from typing import Any, Dict, List, Optional, Union
|
||||
|
||||
import yaml
|
||||
from pydantic import Field, root_validator, validator
|
||||
from pydantic import BaseModel, Field, root_validator, validator
|
||||
|
||||
import langchain
|
||||
from langchain.callbacks.base import BaseCallbackManager
|
||||
@@ -18,16 +18,14 @@ from langchain.callbacks.manager import (
|
||||
CallbackManagerForChainRun,
|
||||
Callbacks,
|
||||
)
|
||||
from langchain.load.dump import dumpd
|
||||
from langchain.load.serializable import Serializable
|
||||
from langchain.schema import RUN_KEY, BaseMemory, RunInfo
|
||||
from langchain.schema import BaseMemory
|
||||
|
||||
|
||||
def _get_verbosity() -> bool:
|
||||
return langchain.verbose
|
||||
|
||||
|
||||
class Chain(Serializable, ABC):
|
||||
class Chain(BaseModel, ABC):
|
||||
"""Base interface that all chains should implement."""
|
||||
|
||||
memory: Optional[BaseMemory] = None
|
||||
@@ -110,8 +108,6 @@ class Chain(Serializable, ABC):
|
||||
inputs: Union[Dict[str, Any], Any],
|
||||
return_only_outputs: bool = False,
|
||||
callbacks: Callbacks = None,
|
||||
*,
|
||||
include_run_info: bool = False,
|
||||
) -> Dict[str, Any]:
|
||||
"""Run the logic of this chain and add to output if desired.
|
||||
|
||||
@@ -122,10 +118,7 @@ class Chain(Serializable, ABC):
|
||||
response. If True, only new keys generated by this chain will be
|
||||
returned. If False, both input keys and new keys generated by this
|
||||
chain will be returned. Defaults to False.
|
||||
callbacks: Callbacks to use for this chain run. If not provided, will
|
||||
use the callbacks provided to the chain.
|
||||
include_run_info: Whether to include run info in the response. Defaults
|
||||
to False.
|
||||
|
||||
"""
|
||||
inputs = self.prep_inputs(inputs)
|
||||
callback_manager = CallbackManager.configure(
|
||||
@@ -133,7 +126,7 @@ class Chain(Serializable, ABC):
|
||||
)
|
||||
new_arg_supported = inspect.signature(self._call).parameters.get("run_manager")
|
||||
run_manager = callback_manager.on_chain_start(
|
||||
dumpd(self),
|
||||
{"name": self.__class__.__name__},
|
||||
inputs,
|
||||
)
|
||||
try:
|
||||
@@ -146,20 +139,13 @@ class Chain(Serializable, ABC):
|
||||
run_manager.on_chain_error(e)
|
||||
raise e
|
||||
run_manager.on_chain_end(outputs)
|
||||
final_outputs: Dict[str, Any] = self.prep_outputs(
|
||||
inputs, outputs, return_only_outputs
|
||||
)
|
||||
if include_run_info:
|
||||
final_outputs[RUN_KEY] = RunInfo(run_id=run_manager.run_id)
|
||||
return final_outputs
|
||||
return self.prep_outputs(inputs, outputs, return_only_outputs)
|
||||
|
||||
async def acall(
|
||||
self,
|
||||
inputs: Union[Dict[str, Any], Any],
|
||||
return_only_outputs: bool = False,
|
||||
callbacks: Callbacks = None,
|
||||
*,
|
||||
include_run_info: bool = False,
|
||||
) -> Dict[str, Any]:
|
||||
"""Run the logic of this chain and add to output if desired.
|
||||
|
||||
@@ -170,10 +156,7 @@ class Chain(Serializable, ABC):
|
||||
response. If True, only new keys generated by this chain will be
|
||||
returned. If False, both input keys and new keys generated by this
|
||||
chain will be returned. Defaults to False.
|
||||
callbacks: Callbacks to use for this chain run. If not provided, will
|
||||
use the callbacks provided to the chain.
|
||||
include_run_info: Whether to include run info in the response. Defaults
|
||||
to False.
|
||||
|
||||
"""
|
||||
inputs = self.prep_inputs(inputs)
|
||||
callback_manager = AsyncCallbackManager.configure(
|
||||
@@ -181,7 +164,7 @@ class Chain(Serializable, ABC):
|
||||
)
|
||||
new_arg_supported = inspect.signature(self._acall).parameters.get("run_manager")
|
||||
run_manager = await callback_manager.on_chain_start(
|
||||
dumpd(self),
|
||||
{"name": self.__class__.__name__},
|
||||
inputs,
|
||||
)
|
||||
try:
|
||||
@@ -194,12 +177,7 @@ class Chain(Serializable, ABC):
|
||||
await run_manager.on_chain_error(e)
|
||||
raise e
|
||||
await run_manager.on_chain_end(outputs)
|
||||
final_outputs: Dict[str, Any] = self.prep_outputs(
|
||||
inputs, outputs, return_only_outputs
|
||||
)
|
||||
if include_run_info:
|
||||
final_outputs[RUN_KEY] = RunInfo(run_id=run_manager.run_id)
|
||||
return final_outputs
|
||||
return self.prep_outputs(inputs, outputs, return_only_outputs)
|
||||
|
||||
def prep_outputs(
|
||||
self,
|
||||
|
||||
@@ -12,7 +12,6 @@ from langchain.base_language import BaseLanguageModel
|
||||
from langchain.callbacks.manager import (
|
||||
AsyncCallbackManagerForChainRun,
|
||||
CallbackManagerForChainRun,
|
||||
Callbacks,
|
||||
)
|
||||
from langchain.chains.base import Chain
|
||||
from langchain.chains.combine_documents.base import BaseCombineDocumentsChain
|
||||
@@ -205,7 +204,6 @@ class ConversationalRetrievalChain(BaseConversationalRetrievalChain):
|
||||
verbose: bool = False,
|
||||
condense_question_llm: Optional[BaseLanguageModel] = None,
|
||||
combine_docs_chain_kwargs: Optional[Dict] = None,
|
||||
callbacks: Callbacks = None,
|
||||
**kwargs: Any,
|
||||
) -> BaseConversationalRetrievalChain:
|
||||
"""Load chain from LLM."""
|
||||
@@ -214,22 +212,17 @@ class ConversationalRetrievalChain(BaseConversationalRetrievalChain):
|
||||
llm,
|
||||
chain_type=chain_type,
|
||||
verbose=verbose,
|
||||
callbacks=callbacks,
|
||||
**combine_docs_chain_kwargs,
|
||||
)
|
||||
|
||||
_llm = condense_question_llm or llm
|
||||
condense_question_chain = LLMChain(
|
||||
llm=_llm,
|
||||
prompt=condense_question_prompt,
|
||||
verbose=verbose,
|
||||
callbacks=callbacks,
|
||||
llm=_llm, prompt=condense_question_prompt, verbose=verbose
|
||||
)
|
||||
return cls(
|
||||
retriever=retriever,
|
||||
combine_docs_chain=doc_chain,
|
||||
question_generator=condense_question_chain,
|
||||
callbacks=callbacks,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
@@ -271,7 +264,6 @@ class ChatVectorDBChain(BaseConversationalRetrievalChain):
|
||||
condense_question_prompt: BasePromptTemplate = CONDENSE_QUESTION_PROMPT,
|
||||
chain_type: str = "stuff",
|
||||
combine_docs_chain_kwargs: Optional[Dict] = None,
|
||||
callbacks: Callbacks = None,
|
||||
**kwargs: Any,
|
||||
) -> BaseConversationalRetrievalChain:
|
||||
"""Load chain from LLM."""
|
||||
@@ -279,16 +271,12 @@ class ChatVectorDBChain(BaseConversationalRetrievalChain):
|
||||
doc_chain = load_qa_chain(
|
||||
llm,
|
||||
chain_type=chain_type,
|
||||
callbacks=callbacks,
|
||||
**combine_docs_chain_kwargs,
|
||||
)
|
||||
condense_question_chain = LLMChain(
|
||||
llm=llm, prompt=condense_question_prompt, callbacks=callbacks
|
||||
)
|
||||
condense_question_chain = LLMChain(llm=llm, prompt=condense_question_prompt)
|
||||
return cls(
|
||||
vectorstore=vectorstore,
|
||||
combine_docs_chain=doc_chain,
|
||||
question_generator=condense_question_chain,
|
||||
callbacks=callbacks,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
@@ -14,8 +14,6 @@ from langchain.chains.llm import LLMChain
|
||||
from langchain.graphs.neo4j_graph import Neo4jGraph
|
||||
from langchain.prompts.base import BasePromptTemplate
|
||||
|
||||
INTERMEDIATE_STEPS_KEY = "intermediate_steps"
|
||||
|
||||
|
||||
def extract_cypher(text: str) -> str:
|
||||
# The pattern to find Cypher code enclosed in triple backticks
|
||||
@@ -35,12 +33,6 @@ class GraphCypherQAChain(Chain):
|
||||
qa_chain: LLMChain
|
||||
input_key: str = "query" #: :meta private:
|
||||
output_key: str = "result" #: :meta private:
|
||||
top_k: int = 10
|
||||
"""Number of results to return from the query"""
|
||||
return_intermediate_steps: bool = False
|
||||
"""Whether or not to return the intermediate steps along with the final answer."""
|
||||
return_direct: bool = False
|
||||
"""Whether or not to return the result of querying the graph directly."""
|
||||
|
||||
@property
|
||||
def input_keys(self) -> List[str]:
|
||||
@@ -82,14 +74,12 @@ class GraphCypherQAChain(Chain):
|
||||
self,
|
||||
inputs: Dict[str, Any],
|
||||
run_manager: Optional[CallbackManagerForChainRun] = None,
|
||||
) -> Dict[str, Any]:
|
||||
) -> Dict[str, str]:
|
||||
"""Generate Cypher statement, use it to look up in db and answer question."""
|
||||
_run_manager = run_manager or CallbackManagerForChainRun.get_noop_manager()
|
||||
callbacks = _run_manager.get_child()
|
||||
question = inputs[self.input_key]
|
||||
|
||||
intermediate_steps: List = []
|
||||
|
||||
generated_cypher = self.cypher_generation_chain.run(
|
||||
{"question": question, "schema": self.graph.get_schema}, callbacks=callbacks
|
||||
)
|
||||
@@ -101,30 +91,14 @@ class GraphCypherQAChain(Chain):
|
||||
_run_manager.on_text(
|
||||
generated_cypher, color="green", end="\n", verbose=self.verbose
|
||||
)
|
||||
context = self.graph.query(generated_cypher)
|
||||
|
||||
intermediate_steps.append({"query": generated_cypher})
|
||||
|
||||
# Retrieve and limit the number of results
|
||||
context = self.graph.query(generated_cypher)[: self.top_k]
|
||||
|
||||
if self.return_direct:
|
||||
final_result = context
|
||||
else:
|
||||
_run_manager.on_text("Full Context:", end="\n", verbose=self.verbose)
|
||||
_run_manager.on_text(
|
||||
str(context), color="green", end="\n", verbose=self.verbose
|
||||
)
|
||||
|
||||
intermediate_steps.append({"context": context})
|
||||
|
||||
result = self.qa_chain(
|
||||
{"question": question, "context": context},
|
||||
callbacks=callbacks,
|
||||
)
|
||||
final_result = result[self.qa_chain.output_key]
|
||||
|
||||
chain_result: Dict[str, Any] = {self.output_key: final_result}
|
||||
if self.return_intermediate_steps:
|
||||
chain_result[INTERMEDIATE_STEPS_KEY] = intermediate_steps
|
||||
|
||||
return chain_result
|
||||
_run_manager.on_text("Full Context:", end="\n", verbose=self.verbose)
|
||||
_run_manager.on_text(
|
||||
str(context), color="green", end="\n", verbose=self.verbose
|
||||
)
|
||||
result = self.qa_chain(
|
||||
{"question": question, "context": context},
|
||||
callbacks=callbacks,
|
||||
)
|
||||
return {self.output_key: result[self.qa_chain.output_key]}
|
||||
|
||||
@@ -1,91 +0,0 @@
|
||||
"""Question answering over a graph."""
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
from pydantic import Field
|
||||
|
||||
from langchain.base_language import BaseLanguageModel
|
||||
from langchain.callbacks.manager import CallbackManagerForChainRun
|
||||
from langchain.chains.base import Chain
|
||||
from langchain.chains.graph_qa.prompts import CYPHER_QA_PROMPT, NGQL_GENERATION_PROMPT
|
||||
from langchain.chains.llm import LLMChain
|
||||
from langchain.graphs.nebula_graph import NebulaGraph
|
||||
from langchain.prompts.base import BasePromptTemplate
|
||||
|
||||
|
||||
class NebulaGraphQAChain(Chain):
|
||||
"""Chain for question-answering against a graph by generating nGQL statements."""
|
||||
|
||||
graph: NebulaGraph = Field(exclude=True)
|
||||
ngql_generation_chain: LLMChain
|
||||
qa_chain: LLMChain
|
||||
input_key: str = "query" #: :meta private:
|
||||
output_key: str = "result" #: :meta private:
|
||||
|
||||
@property
|
||||
def input_keys(self) -> List[str]:
|
||||
"""Return the input keys.
|
||||
|
||||
:meta private:
|
||||
"""
|
||||
return [self.input_key]
|
||||
|
||||
@property
|
||||
def output_keys(self) -> List[str]:
|
||||
"""Return the output keys.
|
||||
|
||||
:meta private:
|
||||
"""
|
||||
_output_keys = [self.output_key]
|
||||
return _output_keys
|
||||
|
||||
@classmethod
|
||||
def from_llm(
|
||||
cls,
|
||||
llm: BaseLanguageModel,
|
||||
*,
|
||||
qa_prompt: BasePromptTemplate = CYPHER_QA_PROMPT,
|
||||
ngql_prompt: BasePromptTemplate = NGQL_GENERATION_PROMPT,
|
||||
**kwargs: Any,
|
||||
) -> NebulaGraphQAChain:
|
||||
"""Initialize from LLM."""
|
||||
qa_chain = LLMChain(llm=llm, prompt=qa_prompt)
|
||||
ngql_generation_chain = LLMChain(llm=llm, prompt=ngql_prompt)
|
||||
|
||||
return cls(
|
||||
qa_chain=qa_chain,
|
||||
ngql_generation_chain=ngql_generation_chain,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
def _call(
|
||||
self,
|
||||
inputs: Dict[str, Any],
|
||||
run_manager: Optional[CallbackManagerForChainRun] = None,
|
||||
) -> Dict[str, str]:
|
||||
"""Generate nGQL statement, use it to look up in db and answer question."""
|
||||
_run_manager = run_manager or CallbackManagerForChainRun.get_noop_manager()
|
||||
callbacks = _run_manager.get_child()
|
||||
question = inputs[self.input_key]
|
||||
|
||||
generated_ngql = self.ngql_generation_chain.run(
|
||||
{"question": question, "schema": self.graph.get_schema}, callbacks=callbacks
|
||||
)
|
||||
|
||||
_run_manager.on_text("Generated nGQL:", end="\n", verbose=self.verbose)
|
||||
_run_manager.on_text(
|
||||
generated_ngql, color="green", end="\n", verbose=self.verbose
|
||||
)
|
||||
context = self.graph.query(generated_ngql)
|
||||
|
||||
_run_manager.on_text("Full Context:", end="\n", verbose=self.verbose)
|
||||
_run_manager.on_text(
|
||||
str(context), color="green", end="\n", verbose=self.verbose
|
||||
)
|
||||
|
||||
result = self.qa_chain(
|
||||
{"question": question, "context": context},
|
||||
callbacks=callbacks,
|
||||
)
|
||||
return {self.output_key: result[self.qa_chain.output_key]}
|
||||
@@ -49,29 +49,6 @@ CYPHER_GENERATION_PROMPT = PromptTemplate(
|
||||
input_variables=["schema", "question"], template=CYPHER_GENERATION_TEMPLATE
|
||||
)
|
||||
|
||||
NEBULAGRAPH_EXTRA_INSTRUCTIONS = """
|
||||
Instructions:
|
||||
|
||||
First, generate cypher then convert it to NebulaGraph Cypher dialect(rather than standard):
|
||||
1. it requires explicit label specification when referring to node properties: v.`Foo`.name
|
||||
2. it uses double equals sign for comparison: `==` rather than `=`
|
||||
For instance:
|
||||
```diff
|
||||
< MATCH (p:person)-[:directed]->(m:movie) WHERE m.name = 'The Godfather II'
|
||||
< RETURN p.name;
|
||||
---
|
||||
> MATCH (p:`person`)-[:directed]->(m:`movie`) WHERE m.`movie`.`name` == 'The Godfather II'
|
||||
> RETURN p.`person`.`name`;
|
||||
```\n"""
|
||||
|
||||
NGQL_GENERATION_TEMPLATE = CYPHER_GENERATION_TEMPLATE.replace(
|
||||
"Generate Cypher", "Generate NebulaGraph Cypher"
|
||||
).replace("Instructions:", NEBULAGRAPH_EXTRA_INSTRUCTIONS)
|
||||
|
||||
NGQL_GENERATION_PROMPT = PromptTemplate(
|
||||
input_variables=["schema", "question"], template=NGQL_GENERATION_TEMPLATE
|
||||
)
|
||||
|
||||
CYPHER_QA_TEMPLATE = """You are an assistant that helps to form nice and human understandable answers.
|
||||
The information part contains the provided information that you must use to construct an answer.
|
||||
The provided information is authorative, you must never doubt it or try to use your internal knowledge to correct it.
|
||||
|
||||
@@ -15,7 +15,6 @@ from langchain.callbacks.manager import (
|
||||
)
|
||||
from langchain.chains.base import Chain
|
||||
from langchain.input import get_colored_text
|
||||
from langchain.load.dump import dumpd
|
||||
from langchain.prompts.base import BasePromptTemplate
|
||||
from langchain.prompts.prompt import PromptTemplate
|
||||
from langchain.schema import LLMResult, PromptValue
|
||||
@@ -35,10 +34,6 @@ class LLMChain(Chain):
|
||||
llm = LLMChain(llm=OpenAI(), prompt=prompt)
|
||||
"""
|
||||
|
||||
@property
|
||||
def lc_serializable(self) -> bool:
|
||||
return True
|
||||
|
||||
prompt: BasePromptTemplate
|
||||
"""Prompt object to use."""
|
||||
llm: BaseLanguageModel
|
||||
@@ -152,7 +147,7 @@ class LLMChain(Chain):
|
||||
callbacks, self.callbacks, self.verbose
|
||||
)
|
||||
run_manager = callback_manager.on_chain_start(
|
||||
dumpd(self),
|
||||
{"name": self.__class__.__name__},
|
||||
{"input_list": input_list},
|
||||
)
|
||||
try:
|
||||
@@ -172,7 +167,7 @@ class LLMChain(Chain):
|
||||
callbacks, self.callbacks, self.verbose
|
||||
)
|
||||
run_manager = await callback_manager.on_chain_start(
|
||||
dumpd(self),
|
||||
{"name": self.__class__.__name__},
|
||||
{"input_list": input_list},
|
||||
)
|
||||
try:
|
||||
|
||||
@@ -20,7 +20,7 @@ from langchain.chains.llm_requests import LLMRequestsChain
|
||||
from langchain.chains.pal.base import PALChain
|
||||
from langchain.chains.qa_with_sources.base import QAWithSourcesChain
|
||||
from langchain.chains.qa_with_sources.vector_db import VectorDBQAWithSourcesChain
|
||||
from langchain.chains.retrieval_qa.base import RetrievalQA, VectorDBQA
|
||||
from langchain.chains.retrieval_qa.base import VectorDBQA
|
||||
from langchain.chains.sql_database.base import SQLDatabaseChain
|
||||
from langchain.llms.loading import load_llm, load_llm_from_config
|
||||
from langchain.prompts.loading import load_prompt, load_prompt_from_config
|
||||
@@ -372,28 +372,6 @@ def _load_vector_db_qa_with_sources_chain(
|
||||
)
|
||||
|
||||
|
||||
def _load_retrieval_qa(config: dict, **kwargs: Any) -> RetrievalQA:
|
||||
if "retriever" in kwargs:
|
||||
retriever = kwargs.pop("retriever")
|
||||
else:
|
||||
raise ValueError("`retriever` must be present.")
|
||||
if "combine_documents_chain" in config:
|
||||
combine_documents_chain_config = config.pop("combine_documents_chain")
|
||||
combine_documents_chain = load_chain_from_config(combine_documents_chain_config)
|
||||
elif "combine_documents_chain_path" in config:
|
||||
combine_documents_chain = load_chain(config.pop("combine_documents_chain_path"))
|
||||
else:
|
||||
raise ValueError(
|
||||
"One of `combine_documents_chain` or "
|
||||
"`combine_documents_chain_path` must be present."
|
||||
)
|
||||
return RetrievalQA(
|
||||
combine_documents_chain=combine_documents_chain,
|
||||
retriever=retriever,
|
||||
**config,
|
||||
)
|
||||
|
||||
|
||||
def _load_vector_db_qa(config: dict, **kwargs: Any) -> VectorDBQA:
|
||||
if "vectorstore" in kwargs:
|
||||
vectorstore = kwargs.pop("vectorstore")
|
||||
@@ -481,7 +459,6 @@ type_to_loader_dict = {
|
||||
"sql_database_chain": _load_sql_database_chain,
|
||||
"vector_db_qa_with_sources_chain": _load_vector_db_qa_with_sources_chain,
|
||||
"vector_db_qa": _load_vector_db_qa,
|
||||
"retrieval_qa": _load_retrieval_qa,
|
||||
}
|
||||
|
||||
|
||||
|
||||
@@ -149,7 +149,7 @@ def load_qa_with_sources_chain(
|
||||
Args:
|
||||
llm: Language Model to use in the chain.
|
||||
chain_type: Type of document combining chain to use. Should be one of "stuff",
|
||||
"map_reduce", "refine" and "map_rerank".
|
||||
"map_reduce", and "refine".
|
||||
verbose: Whether chains should be run in verbose mode or not. Note that this
|
||||
applies to all chains that make up the final chain.
|
||||
|
||||
|
||||
@@ -3,7 +3,6 @@ from typing import Any, Mapping, Optional, Protocol
|
||||
|
||||
from langchain.base_language import BaseLanguageModel
|
||||
from langchain.callbacks.base import BaseCallbackManager
|
||||
from langchain.callbacks.manager import Callbacks
|
||||
from langchain.chains.combine_documents.base import BaseCombineDocumentsChain
|
||||
from langchain.chains.combine_documents.map_reduce import MapReduceDocumentsChain
|
||||
from langchain.chains.combine_documents.map_rerank import MapRerankDocumentsChain
|
||||
@@ -36,15 +35,10 @@ def _load_map_rerank_chain(
|
||||
rank_key: str = "score",
|
||||
answer_key: str = "answer",
|
||||
callback_manager: Optional[BaseCallbackManager] = None,
|
||||
callbacks: Callbacks = None,
|
||||
**kwargs: Any,
|
||||
) -> MapRerankDocumentsChain:
|
||||
llm_chain = LLMChain(
|
||||
llm=llm,
|
||||
prompt=prompt,
|
||||
verbose=verbose,
|
||||
callback_manager=callback_manager,
|
||||
callbacks=callbacks,
|
||||
llm=llm, prompt=prompt, verbose=verbose, callback_manager=callback_manager
|
||||
)
|
||||
return MapRerankDocumentsChain(
|
||||
llm_chain=llm_chain,
|
||||
@@ -63,16 +57,11 @@ def _load_stuff_chain(
|
||||
document_variable_name: str = "context",
|
||||
verbose: Optional[bool] = None,
|
||||
callback_manager: Optional[BaseCallbackManager] = None,
|
||||
callbacks: Callbacks = None,
|
||||
**kwargs: Any,
|
||||
) -> StuffDocumentsChain:
|
||||
_prompt = prompt or stuff_prompt.PROMPT_SELECTOR.get_prompt(llm)
|
||||
llm_chain = LLMChain(
|
||||
llm=llm,
|
||||
prompt=_prompt,
|
||||
verbose=verbose,
|
||||
callback_manager=callback_manager,
|
||||
callbacks=callbacks,
|
||||
llm=llm, prompt=_prompt, verbose=verbose, callback_manager=callback_manager
|
||||
)
|
||||
# TODO: document prompt
|
||||
return StuffDocumentsChain(
|
||||
@@ -95,7 +84,6 @@ def _load_map_reduce_chain(
|
||||
collapse_llm: Optional[BaseLanguageModel] = None,
|
||||
verbose: Optional[bool] = None,
|
||||
callback_manager: Optional[BaseCallbackManager] = None,
|
||||
callbacks: Callbacks = None,
|
||||
**kwargs: Any,
|
||||
) -> MapReduceDocumentsChain:
|
||||
_question_prompt = (
|
||||
@@ -109,7 +97,6 @@ def _load_map_reduce_chain(
|
||||
prompt=_question_prompt,
|
||||
verbose=verbose,
|
||||
callback_manager=callback_manager,
|
||||
callbacks=callbacks,
|
||||
)
|
||||
_reduce_llm = reduce_llm or llm
|
||||
reduce_chain = LLMChain(
|
||||
@@ -117,7 +104,6 @@ def _load_map_reduce_chain(
|
||||
prompt=_combine_prompt,
|
||||
verbose=verbose,
|
||||
callback_manager=callback_manager,
|
||||
callbacks=callbacks,
|
||||
)
|
||||
# TODO: document prompt
|
||||
combine_document_chain = StuffDocumentsChain(
|
||||
@@ -125,7 +111,6 @@ def _load_map_reduce_chain(
|
||||
document_variable_name=combine_document_variable_name,
|
||||
verbose=verbose,
|
||||
callback_manager=callback_manager,
|
||||
callbacks=callbacks,
|
||||
)
|
||||
if collapse_prompt is None:
|
||||
collapse_chain = None
|
||||
@@ -142,7 +127,6 @@ def _load_map_reduce_chain(
|
||||
prompt=collapse_prompt,
|
||||
verbose=verbose,
|
||||
callback_manager=callback_manager,
|
||||
callbacks=callbacks,
|
||||
),
|
||||
document_variable_name=combine_document_variable_name,
|
||||
verbose=verbose,
|
||||
@@ -155,7 +139,6 @@ def _load_map_reduce_chain(
|
||||
collapse_document_chain=collapse_chain,
|
||||
verbose=verbose,
|
||||
callback_manager=callback_manager,
|
||||
callbacks=callbacks,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
@@ -169,7 +152,6 @@ def _load_refine_chain(
|
||||
refine_llm: Optional[BaseLanguageModel] = None,
|
||||
verbose: Optional[bool] = None,
|
||||
callback_manager: Optional[BaseCallbackManager] = None,
|
||||
callbacks: Callbacks = None,
|
||||
**kwargs: Any,
|
||||
) -> RefineDocumentsChain:
|
||||
_question_prompt = (
|
||||
@@ -183,7 +165,6 @@ def _load_refine_chain(
|
||||
prompt=_question_prompt,
|
||||
verbose=verbose,
|
||||
callback_manager=callback_manager,
|
||||
callbacks=callbacks,
|
||||
)
|
||||
_refine_llm = refine_llm or llm
|
||||
refine_chain = LLMChain(
|
||||
@@ -191,7 +172,6 @@ def _load_refine_chain(
|
||||
prompt=_refine_prompt,
|
||||
verbose=verbose,
|
||||
callback_manager=callback_manager,
|
||||
callbacks=callbacks,
|
||||
)
|
||||
return RefineDocumentsChain(
|
||||
initial_llm_chain=initial_chain,
|
||||
|
||||
@@ -183,11 +183,6 @@ class RetrievalQA(BaseRetrievalQA):
|
||||
async def _aget_docs(self, question: str) -> List[Document]:
|
||||
return await self.retriever.aget_relevant_documents(question)
|
||||
|
||||
@property
|
||||
def _chain_type(self) -> str:
|
||||
"""Return the chain type."""
|
||||
return "retrieval_qa"
|
||||
|
||||
|
||||
class VectorDBQA(BaseRetrievalQA):
|
||||
"""Chain for question-answering against a vector database."""
|
||||
|
||||
@@ -44,10 +44,6 @@ class ChatAnthropic(BaseChatModel, _AnthropicCommon):
|
||||
"""Return type of chat model."""
|
||||
return "anthropic-chat"
|
||||
|
||||
@property
|
||||
def lc_serializable(self) -> bool:
|
||||
return True
|
||||
|
||||
def _convert_one_message_to_text(self, message: BaseMessage) -> str:
|
||||
if isinstance(message, ChatMessage):
|
||||
message_text = f"\n\n{message.role.capitalize()}: {message.content}"
|
||||
@@ -98,10 +94,9 @@ class ChatAnthropic(BaseChatModel, _AnthropicCommon):
|
||||
messages: List[BaseMessage],
|
||||
stop: Optional[List[str]] = None,
|
||||
run_manager: Optional[CallbackManagerForLLMRun] = None,
|
||||
**kwargs: Any,
|
||||
) -> ChatResult:
|
||||
prompt = self._convert_messages_to_prompt(messages)
|
||||
params: Dict[str, Any] = {"prompt": prompt, **self._default_params, **kwargs}
|
||||
params: Dict[str, Any] = {"prompt": prompt, **self._default_params}
|
||||
if stop:
|
||||
params["stop_sequences"] = stop
|
||||
|
||||
@@ -126,10 +121,9 @@ class ChatAnthropic(BaseChatModel, _AnthropicCommon):
|
||||
messages: List[BaseMessage],
|
||||
stop: Optional[List[str]] = None,
|
||||
run_manager: Optional[AsyncCallbackManagerForLLMRun] = None,
|
||||
**kwargs: Any,
|
||||
) -> ChatResult:
|
||||
prompt = self._convert_messages_to_prompt(messages)
|
||||
params: Dict[str, Any] = {"prompt": prompt, **self._default_params, **kwargs}
|
||||
params: Dict[str, Any] = {"prompt": prompt, **self._default_params}
|
||||
if stop:
|
||||
params["stop_sequences"] = stop
|
||||
|
||||
|
||||
@@ -53,33 +53,33 @@ class AzureChatOpenAI(ChatOpenAI):
|
||||
@root_validator()
|
||||
def validate_environment(cls, values: Dict) -> Dict:
|
||||
"""Validate that api key and python package exists in environment."""
|
||||
values["openai_api_key"] = get_from_dict_or_env(
|
||||
openai_api_key = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_api_key",
|
||||
"OPENAI_API_KEY",
|
||||
)
|
||||
values["openai_api_base"] = get_from_dict_or_env(
|
||||
openai_api_base = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_api_base",
|
||||
"OPENAI_API_BASE",
|
||||
)
|
||||
values["openai_api_version"] = get_from_dict_or_env(
|
||||
openai_api_version = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_api_version",
|
||||
"OPENAI_API_VERSION",
|
||||
)
|
||||
values["openai_api_type"] = get_from_dict_or_env(
|
||||
openai_api_type = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_api_type",
|
||||
"OPENAI_API_TYPE",
|
||||
)
|
||||
values["openai_organization"] = get_from_dict_or_env(
|
||||
openai_organization = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_organization",
|
||||
"OPENAI_ORGANIZATION",
|
||||
default="",
|
||||
)
|
||||
values["openai_proxy"] = get_from_dict_or_env(
|
||||
openai_proxy = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_proxy",
|
||||
"OPENAI_PROXY",
|
||||
@@ -88,6 +88,14 @@ class AzureChatOpenAI(ChatOpenAI):
|
||||
try:
|
||||
import openai
|
||||
|
||||
openai.api_type = openai_api_type
|
||||
openai.api_base = openai_api_base
|
||||
openai.api_version = openai_api_version
|
||||
openai.api_key = openai_api_key
|
||||
if openai_organization:
|
||||
openai.organization = openai_organization
|
||||
if openai_proxy:
|
||||
openai.proxy = {"http": openai_proxy, "https": openai_proxy} # type: ignore[assignment] # noqa: E501
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"Could not import openai python package. "
|
||||
@@ -120,14 +128,6 @@ class AzureChatOpenAI(ChatOpenAI):
|
||||
"""Get the identifying parameters."""
|
||||
return {**self._default_params}
|
||||
|
||||
@property
|
||||
def _invocation_params(self) -> Mapping[str, Any]:
|
||||
openai_creds = {
|
||||
"api_type": self.openai_api_type,
|
||||
"api_version": self.openai_api_version,
|
||||
}
|
||||
return {**openai_creds, **super()._invocation_params}
|
||||
|
||||
@property
|
||||
def _llm_type(self) -> str:
|
||||
return "azure-openai-chat"
|
||||
|
||||
@@ -17,7 +17,6 @@ from langchain.callbacks.manager import (
|
||||
CallbackManagerForLLMRun,
|
||||
Callbacks,
|
||||
)
|
||||
from langchain.load.dump import dumpd
|
||||
from langchain.schema import (
|
||||
AIMessage,
|
||||
BaseMessage,
|
||||
@@ -26,7 +25,6 @@ from langchain.schema import (
|
||||
HumanMessage,
|
||||
LLMResult,
|
||||
PromptValue,
|
||||
RunInfo,
|
||||
)
|
||||
|
||||
|
||||
@@ -65,19 +63,17 @@ class BaseChatModel(BaseLanguageModel, ABC):
|
||||
messages: List[List[BaseMessage]],
|
||||
stop: Optional[List[str]] = None,
|
||||
callbacks: Callbacks = None,
|
||||
**kwargs: Any,
|
||||
) -> LLMResult:
|
||||
"""Top Level call"""
|
||||
|
||||
params = self.dict()
|
||||
params["stop"] = stop
|
||||
options = {"stop": stop}
|
||||
|
||||
callback_manager = CallbackManager.configure(
|
||||
callbacks, self.callbacks, self.verbose
|
||||
)
|
||||
run_manager = callback_manager.on_chat_model_start(
|
||||
dumpd(self), messages, invocation_params=params, options=options
|
||||
{"name": self.__class__.__name__}, messages, invocation_params=params
|
||||
)
|
||||
|
||||
new_arg_supported = inspect.signature(self._generate).parameters.get(
|
||||
@@ -85,7 +81,7 @@ class BaseChatModel(BaseLanguageModel, ABC):
|
||||
)
|
||||
try:
|
||||
results = [
|
||||
self._generate(m, stop=stop, run_manager=run_manager, **kwargs)
|
||||
self._generate(m, stop=stop, run_manager=run_manager)
|
||||
if new_arg_supported
|
||||
else self._generate(m, stop=stop)
|
||||
for m in messages
|
||||
@@ -97,8 +93,6 @@ class BaseChatModel(BaseLanguageModel, ABC):
|
||||
generations = [res.generations for res in results]
|
||||
output = LLMResult(generations=generations, llm_output=llm_output)
|
||||
run_manager.on_llm_end(output)
|
||||
if run_manager:
|
||||
output.run = RunInfo(run_id=run_manager.run_id)
|
||||
return output
|
||||
|
||||
async def agenerate(
|
||||
@@ -106,18 +100,16 @@ class BaseChatModel(BaseLanguageModel, ABC):
|
||||
messages: List[List[BaseMessage]],
|
||||
stop: Optional[List[str]] = None,
|
||||
callbacks: Callbacks = None,
|
||||
**kwargs: Any,
|
||||
) -> LLMResult:
|
||||
"""Top Level call"""
|
||||
params = self.dict()
|
||||
params["stop"] = stop
|
||||
options = {"stop": stop}
|
||||
|
||||
callback_manager = AsyncCallbackManager.configure(
|
||||
callbacks, self.callbacks, self.verbose
|
||||
)
|
||||
run_manager = await callback_manager.on_chat_model_start(
|
||||
dumpd(self), messages, invocation_params=params, options=options
|
||||
{"name": self.__class__.__name__}, messages, invocation_params=params
|
||||
)
|
||||
|
||||
new_arg_supported = inspect.signature(self._agenerate).parameters.get(
|
||||
@@ -126,7 +118,7 @@ class BaseChatModel(BaseLanguageModel, ABC):
|
||||
try:
|
||||
results = await asyncio.gather(
|
||||
*[
|
||||
self._agenerate(m, stop=stop, run_manager=run_manager, **kwargs)
|
||||
self._agenerate(m, stop=stop, run_manager=run_manager)
|
||||
if new_arg_supported
|
||||
else self._agenerate(m, stop=stop)
|
||||
for m in messages
|
||||
@@ -139,8 +131,6 @@ class BaseChatModel(BaseLanguageModel, ABC):
|
||||
generations = [res.generations for res in results]
|
||||
output = LLMResult(generations=generations, llm_output=llm_output)
|
||||
await run_manager.on_llm_end(output)
|
||||
if run_manager:
|
||||
output.run = RunInfo(run_id=run_manager.run_id)
|
||||
return output
|
||||
|
||||
def generate_prompt(
|
||||
@@ -148,22 +138,18 @@ class BaseChatModel(BaseLanguageModel, ABC):
|
||||
prompts: List[PromptValue],
|
||||
stop: Optional[List[str]] = None,
|
||||
callbacks: Callbacks = None,
|
||||
**kwargs: Any,
|
||||
) -> LLMResult:
|
||||
prompt_messages = [p.to_messages() for p in prompts]
|
||||
return self.generate(prompt_messages, stop=stop, callbacks=callbacks, **kwargs)
|
||||
return self.generate(prompt_messages, stop=stop, callbacks=callbacks)
|
||||
|
||||
async def agenerate_prompt(
|
||||
self,
|
||||
prompts: List[PromptValue],
|
||||
stop: Optional[List[str]] = None,
|
||||
callbacks: Callbacks = None,
|
||||
**kwargs: Any,
|
||||
) -> LLMResult:
|
||||
prompt_messages = [p.to_messages() for p in prompts]
|
||||
return await self.agenerate(
|
||||
prompt_messages, stop=stop, callbacks=callbacks, **kwargs
|
||||
)
|
||||
return await self.agenerate(prompt_messages, stop=stop, callbacks=callbacks)
|
||||
|
||||
@abstractmethod
|
||||
def _generate(
|
||||
@@ -171,7 +157,6 @@ class BaseChatModel(BaseLanguageModel, ABC):
|
||||
messages: List[BaseMessage],
|
||||
stop: Optional[List[str]] = None,
|
||||
run_manager: Optional[CallbackManagerForLLMRun] = None,
|
||||
**kwargs: Any,
|
||||
) -> ChatResult:
|
||||
"""Top Level call"""
|
||||
|
||||
@@ -181,7 +166,6 @@ class BaseChatModel(BaseLanguageModel, ABC):
|
||||
messages: List[BaseMessage],
|
||||
stop: Optional[List[str]] = None,
|
||||
run_manager: Optional[AsyncCallbackManagerForLLMRun] = None,
|
||||
**kwargs: Any,
|
||||
) -> ChatResult:
|
||||
"""Top Level call"""
|
||||
|
||||
@@ -190,10 +174,9 @@ class BaseChatModel(BaseLanguageModel, ABC):
|
||||
messages: List[BaseMessage],
|
||||
stop: Optional[List[str]] = None,
|
||||
callbacks: Callbacks = None,
|
||||
**kwargs: Any,
|
||||
) -> BaseMessage:
|
||||
generation = self.generate(
|
||||
[messages], stop=stop, callbacks=callbacks, **kwargs
|
||||
[messages], stop=stop, callbacks=callbacks
|
||||
).generations[0][0]
|
||||
if isinstance(generation, ChatGeneration):
|
||||
return generation.message
|
||||
@@ -205,69 +188,50 @@ class BaseChatModel(BaseLanguageModel, ABC):
|
||||
messages: List[BaseMessage],
|
||||
stop: Optional[List[str]] = None,
|
||||
callbacks: Callbacks = None,
|
||||
**kwargs: Any,
|
||||
) -> BaseMessage:
|
||||
result = await self.agenerate(
|
||||
[messages], stop=stop, callbacks=callbacks, **kwargs
|
||||
)
|
||||
result = await self.agenerate([messages], stop=stop, callbacks=callbacks)
|
||||
generation = result.generations[0][0]
|
||||
if isinstance(generation, ChatGeneration):
|
||||
return generation.message
|
||||
else:
|
||||
raise ValueError("Unexpected generation type")
|
||||
|
||||
def call_as_llm(
|
||||
self, message: str, stop: Optional[List[str]] = None, **kwargs: Any
|
||||
) -> str:
|
||||
return self.predict(message, stop=stop, **kwargs)
|
||||
def call_as_llm(self, message: str, stop: Optional[List[str]] = None) -> str:
|
||||
return self.predict(message, stop=stop)
|
||||
|
||||
def predict(
|
||||
self, text: str, *, stop: Optional[Sequence[str]] = None, **kwargs: Any
|
||||
) -> str:
|
||||
def predict(self, text: str, *, stop: Optional[Sequence[str]] = None) -> str:
|
||||
if stop is None:
|
||||
_stop = None
|
||||
else:
|
||||
_stop = list(stop)
|
||||
result = self([HumanMessage(content=text)], stop=_stop, **kwargs)
|
||||
result = self([HumanMessage(content=text)], stop=_stop)
|
||||
return result.content
|
||||
|
||||
def predict_messages(
|
||||
self,
|
||||
messages: List[BaseMessage],
|
||||
*,
|
||||
stop: Optional[Sequence[str]] = None,
|
||||
**kwargs: Any,
|
||||
self, messages: List[BaseMessage], *, stop: Optional[Sequence[str]] = None
|
||||
) -> BaseMessage:
|
||||
if stop is None:
|
||||
_stop = None
|
||||
else:
|
||||
_stop = list(stop)
|
||||
return self(messages, stop=_stop, **kwargs)
|
||||
return self(messages, stop=_stop)
|
||||
|
||||
async def apredict(
|
||||
self, text: str, *, stop: Optional[Sequence[str]] = None, **kwargs: Any
|
||||
) -> str:
|
||||
async def apredict(self, text: str, *, stop: Optional[Sequence[str]] = None) -> str:
|
||||
if stop is None:
|
||||
_stop = None
|
||||
else:
|
||||
_stop = list(stop)
|
||||
result = await self._call_async(
|
||||
[HumanMessage(content=text)], stop=_stop, **kwargs
|
||||
)
|
||||
result = await self._call_async([HumanMessage(content=text)], stop=_stop)
|
||||
return result.content
|
||||
|
||||
async def apredict_messages(
|
||||
self,
|
||||
messages: List[BaseMessage],
|
||||
*,
|
||||
stop: Optional[Sequence[str]] = None,
|
||||
**kwargs: Any,
|
||||
self, messages: List[BaseMessage], *, stop: Optional[Sequence[str]] = None
|
||||
) -> BaseMessage:
|
||||
if stop is None:
|
||||
_stop = None
|
||||
else:
|
||||
_stop = list(stop)
|
||||
return await self._call_async(messages, stop=_stop, **kwargs)
|
||||
return await self._call_async(messages, stop=_stop)
|
||||
|
||||
@property
|
||||
def _identifying_params(self) -> Mapping[str, Any]:
|
||||
@@ -292,9 +256,8 @@ class SimpleChatModel(BaseChatModel):
|
||||
messages: List[BaseMessage],
|
||||
stop: Optional[List[str]] = None,
|
||||
run_manager: Optional[CallbackManagerForLLMRun] = None,
|
||||
**kwargs: Any,
|
||||
) -> ChatResult:
|
||||
output_str = self._call(messages, stop=stop, run_manager=run_manager, **kwargs)
|
||||
output_str = self._call(messages, stop=stop, run_manager=run_manager)
|
||||
message = AIMessage(content=output_str)
|
||||
generation = ChatGeneration(message=message)
|
||||
return ChatResult(generations=[generation])
|
||||
@@ -305,7 +268,6 @@ class SimpleChatModel(BaseChatModel):
|
||||
messages: List[BaseMessage],
|
||||
stop: Optional[List[str]] = None,
|
||||
run_manager: Optional[CallbackManagerForLLMRun] = None,
|
||||
**kwargs: Any,
|
||||
) -> str:
|
||||
"""Simpler interface."""
|
||||
|
||||
@@ -314,9 +276,6 @@ class SimpleChatModel(BaseChatModel):
|
||||
messages: List[BaseMessage],
|
||||
stop: Optional[List[str]] = None,
|
||||
run_manager: Optional[AsyncCallbackManagerForLLMRun] = None,
|
||||
**kwargs: Any,
|
||||
) -> ChatResult:
|
||||
func = partial(
|
||||
self._generate, messages, stop=stop, run_manager=run_manager, **kwargs
|
||||
)
|
||||
func = partial(self._generate, messages, stop=stop, run_manager=run_manager)
|
||||
return await asyncio.get_event_loop().run_in_executor(None, func)
|
||||
|
||||
@@ -280,7 +280,6 @@ class ChatGooglePalm(BaseChatModel, BaseModel):
|
||||
messages: List[BaseMessage],
|
||||
stop: Optional[List[str]] = None,
|
||||
run_manager: Optional[CallbackManagerForLLMRun] = None,
|
||||
**kwargs: Any,
|
||||
) -> ChatResult:
|
||||
prompt = _messages_to_prompt_dict(messages)
|
||||
|
||||
@@ -292,7 +291,6 @@ class ChatGooglePalm(BaseChatModel, BaseModel):
|
||||
top_p=self.top_p,
|
||||
top_k=self.top_k,
|
||||
candidate_count=self.n,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
return _response_to_result(response, stop)
|
||||
@@ -302,7 +300,6 @@ class ChatGooglePalm(BaseChatModel, BaseModel):
|
||||
messages: List[BaseMessage],
|
||||
stop: Optional[List[str]] = None,
|
||||
run_manager: Optional[AsyncCallbackManagerForLLMRun] = None,
|
||||
**kwargs: Any,
|
||||
) -> ChatResult:
|
||||
prompt = _messages_to_prompt_dict(messages)
|
||||
|
||||
|
||||
@@ -92,17 +92,12 @@ async def acompletion_with_retry(llm: ChatOpenAI, **kwargs: Any) -> Any:
|
||||
return await _completion_with_retry(**kwargs)
|
||||
|
||||
|
||||
def _convert_dict_to_message(_dict: Mapping[str, Any]) -> BaseMessage:
|
||||
def _convert_dict_to_message(_dict: dict) -> BaseMessage:
|
||||
role = _dict["role"]
|
||||
if role == "user":
|
||||
return HumanMessage(content=_dict["content"])
|
||||
elif role == "assistant":
|
||||
content = _dict["content"] or "" # OpenAI returns None for tool invocations
|
||||
if _dict.get("function_call"):
|
||||
additional_kwargs = {"function_call": dict(_dict["function_call"])}
|
||||
else:
|
||||
additional_kwargs = {}
|
||||
return AIMessage(content=content, additional_kwargs=additional_kwargs)
|
||||
return AIMessage(content=_dict["content"])
|
||||
elif role == "system":
|
||||
return SystemMessage(content=_dict["content"])
|
||||
else:
|
||||
@@ -116,8 +111,6 @@ def _convert_message_to_dict(message: BaseMessage) -> dict:
|
||||
message_dict = {"role": "user", "content": message.content}
|
||||
elif isinstance(message, AIMessage):
|
||||
message_dict = {"role": "assistant", "content": message.content}
|
||||
if "function_call" in message.additional_kwargs:
|
||||
message_dict["function_call"] = message.additional_kwargs["function_call"]
|
||||
elif isinstance(message, SystemMessage):
|
||||
message_dict = {"role": "system", "content": message.content}
|
||||
else:
|
||||
@@ -143,10 +136,6 @@ class ChatOpenAI(BaseChatModel):
|
||||
openai = ChatOpenAI(model_name="gpt-3.5-turbo")
|
||||
"""
|
||||
|
||||
@property
|
||||
def lc_serializable(self) -> bool:
|
||||
return True
|
||||
|
||||
client: Any #: :meta private:
|
||||
model_name: str = Field(default="gpt-3.5-turbo", alias="model")
|
||||
"""Model name to use."""
|
||||
@@ -207,22 +196,22 @@ class ChatOpenAI(BaseChatModel):
|
||||
@root_validator()
|
||||
def validate_environment(cls, values: Dict) -> Dict:
|
||||
"""Validate that api key and python package exists in environment."""
|
||||
values["openai_api_key"] = get_from_dict_or_env(
|
||||
openai_api_key = get_from_dict_or_env(
|
||||
values, "openai_api_key", "OPENAI_API_KEY"
|
||||
)
|
||||
values["openai_organization"] = get_from_dict_or_env(
|
||||
openai_organization = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_organization",
|
||||
"OPENAI_ORGANIZATION",
|
||||
default="",
|
||||
)
|
||||
values["openai_api_base"] = get_from_dict_or_env(
|
||||
openai_api_base = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_api_base",
|
||||
"OPENAI_API_BASE",
|
||||
default="",
|
||||
)
|
||||
values["openai_proxy"] = get_from_dict_or_env(
|
||||
openai_proxy = get_from_dict_or_env(
|
||||
values,
|
||||
"openai_proxy",
|
||||
"OPENAI_PROXY",
|
||||
@@ -236,6 +225,13 @@ class ChatOpenAI(BaseChatModel):
|
||||
"Could not import openai python package. "
|
||||
"Please install it with `pip install openai`."
|
||||
)
|
||||
openai.api_key = openai_api_key
|
||||
if openai_organization:
|
||||
openai.organization = openai_organization
|
||||
if openai_api_base:
|
||||
openai.api_base = openai_api_base
|
||||
if openai_proxy:
|
||||
openai.proxy = {"http": openai_proxy, "https": openai_proxy} # type: ignore[assignment] # noqa: E501
|
||||
try:
|
||||
values["client"] = openai.ChatCompletion
|
||||
except AttributeError:
|
||||
@@ -313,10 +309,8 @@ class ChatOpenAI(BaseChatModel):
|
||||
messages: List[BaseMessage],
|
||||
stop: Optional[List[str]] = None,
|
||||
run_manager: Optional[CallbackManagerForLLMRun] = None,
|
||||
**kwargs: Any,
|
||||
) -> ChatResult:
|
||||
message_dicts, params = self._create_message_dicts(messages, stop)
|
||||
params = {**params, **kwargs}
|
||||
if self.streaming:
|
||||
inner_completion = ""
|
||||
role = "assistant"
|
||||
@@ -339,7 +333,7 @@ class ChatOpenAI(BaseChatModel):
|
||||
def _create_message_dicts(
|
||||
self, messages: List[BaseMessage], stop: Optional[List[str]]
|
||||
) -> Tuple[List[Dict[str, Any]], Dict[str, Any]]:
|
||||
params = dict(self._invocation_params)
|
||||
params: Dict[str, Any] = {**{"model": self.model_name}, **self._default_params}
|
||||
if stop is not None:
|
||||
if "stop" in params:
|
||||
raise ValueError("`stop` found in both the input and default params.")
|
||||
@@ -361,10 +355,8 @@ class ChatOpenAI(BaseChatModel):
|
||||
messages: List[BaseMessage],
|
||||
stop: Optional[List[str]] = None,
|
||||
run_manager: Optional[AsyncCallbackManagerForLLMRun] = None,
|
||||
**kwargs: Any,
|
||||
) -> ChatResult:
|
||||
message_dicts, params = self._create_message_dicts(messages, stop)
|
||||
params = {**params, **kwargs}
|
||||
if self.streaming:
|
||||
inner_completion = ""
|
||||
role = "assistant"
|
||||
@@ -392,21 +384,6 @@ class ChatOpenAI(BaseChatModel):
|
||||
"""Get the identifying parameters."""
|
||||
return {**{"model_name": self.model_name}, **self._default_params}
|
||||
|
||||
@property
|
||||
def _invocation_params(self) -> Mapping[str, Any]:
|
||||
"""Get the parameters used to invoke the model."""
|
||||
openai_creds: Dict[str, Any] = {
|
||||
"api_key": self.openai_api_key,
|
||||
"api_base": self.openai_api_base,
|
||||
"organization": self.openai_organization,
|
||||
"model": self.model_name,
|
||||
}
|
||||
if self.openai_proxy:
|
||||
import openai
|
||||
|
||||
openai.proxy = {"http": self.openai_proxy, "https": self.openai_proxy} # type: ignore[assignment] # noqa: E501
|
||||
return {**openai_creds, **self._default_params}
|
||||
|
||||
@property
|
||||
def _llm_type(self) -> str:
|
||||
"""Return type of chat model."""
|
||||
|
||||
@@ -42,7 +42,6 @@ class PromptLayerChatOpenAI(ChatOpenAI):
|
||||
messages: List[BaseMessage],
|
||||
stop: Optional[List[str]] = None,
|
||||
run_manager: Optional[CallbackManagerForLLMRun] = None,
|
||||
**kwargs: Any
|
||||
) -> ChatResult:
|
||||
"""Call ChatOpenAI generate and then call PromptLayer API to log the request."""
|
||||
from promptlayer.utils import get_api_key, promptlayer_api_request
|
||||
@@ -55,7 +54,6 @@ class PromptLayerChatOpenAI(ChatOpenAI):
|
||||
response_dict, params = super()._create_message_dicts(
|
||||
[generation.message], stop
|
||||
)
|
||||
params = {**params, **kwargs}
|
||||
pl_request_id = promptlayer_api_request(
|
||||
"langchain.PromptLayerChatOpenAI",
|
||||
"langchain",
|
||||
@@ -81,7 +79,6 @@ class PromptLayerChatOpenAI(ChatOpenAI):
|
||||
messages: List[BaseMessage],
|
||||
stop: Optional[List[str]] = None,
|
||||
run_manager: Optional[AsyncCallbackManagerForLLMRun] = None,
|
||||
**kwargs: Any
|
||||
) -> ChatResult:
|
||||
"""Call ChatOpenAI agenerate and then call PromptLayer to log."""
|
||||
from promptlayer.utils import get_api_key, promptlayer_api_request_async
|
||||
@@ -94,7 +91,6 @@ class PromptLayerChatOpenAI(ChatOpenAI):
|
||||
response_dict, params = super()._create_message_dicts(
|
||||
[generation.message], stop
|
||||
)
|
||||
params = {**params, **kwargs}
|
||||
pl_request_id = await promptlayer_api_request_async(
|
||||
"langchain.PromptLayerChatOpenAI.async",
|
||||
"langchain",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
"""Wrapper around Google VertexAI chat-based models."""
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Any, Dict, List, Optional
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
from pydantic import root_validator
|
||||
|
||||
@@ -93,7 +93,6 @@ class ChatVertexAI(_VertexAICommon, BaseChatModel):
|
||||
messages: List[BaseMessage],
|
||||
stop: Optional[List[str]] = None,
|
||||
run_manager: Optional[CallbackManagerForLLMRun] = None,
|
||||
**kwargs: Any,
|
||||
) -> ChatResult:
|
||||
"""Generate next turn in the conversation.
|
||||
|
||||
@@ -120,8 +119,7 @@ class ChatVertexAI(_VertexAICommon, BaseChatModel):
|
||||
|
||||
history = _parse_chat_history(messages[:-1])
|
||||
context = history.system_message.content if history.system_message else None
|
||||
params = {**self._default_params, **kwargs}
|
||||
chat = self.client.start_chat(context=context, **params)
|
||||
chat = self.client.start_chat(context=context, **self._default_params)
|
||||
for pair in history.history:
|
||||
chat._history.append((pair.question.content, pair.answer.content))
|
||||
response = chat.send_message(question.content, **self._default_params)
|
||||
@@ -133,7 +131,6 @@ class ChatVertexAI(_VertexAICommon, BaseChatModel):
|
||||
messages: List[BaseMessage],
|
||||
stop: Optional[List[str]] = None,
|
||||
run_manager: Optional[AsyncCallbackManagerForLLMRun] = None,
|
||||
**kwargs: Any,
|
||||
) -> ChatResult:
|
||||
raise NotImplementedError(
|
||||
"""Vertex AI doesn't support async requests at the moment."""
|
||||
|
||||
213
langchain/docstore/artifact_stores.py
Normal file
213
langchain/docstore/artifact_stores.py
Normal file
@@ -0,0 +1,213 @@
|
||||
"""Implement artifact storage using the file system.
|
||||
|
||||
This is a simple implementation that stores artifacts in a directory and
|
||||
metadata in a JSON file. It's used for prototyping.
|
||||
|
||||
Metadata should move into a SQLLite.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import abc
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import (
|
||||
TypedDict,
|
||||
Sequence,
|
||||
Optional,
|
||||
Iterator,
|
||||
Union,
|
||||
List,
|
||||
Iterable,
|
||||
)
|
||||
|
||||
from langchain.docstore.base import ArtifactStore, Selector, Artifact, ArtifactWithData
|
||||
from langchain.docstore.serialization import serialize_document, deserialize_document
|
||||
from langchain.embeddings.base import Embeddings
|
||||
from langchain.schema import Document
|
||||
|
||||
MaybeDocument = Optional[Document]
|
||||
|
||||
PathLike = Union[str, Path]
|
||||
|
||||
|
||||
class Metadata(TypedDict):
|
||||
"""Metadata format"""
|
||||
|
||||
artifacts: List[Artifact]
|
||||
|
||||
|
||||
class MetadataStore(abc.ABC):
|
||||
"""Abstract metadata store.
|
||||
|
||||
Need to populate with all required methods.
|
||||
"""
|
||||
|
||||
@abc.abstractmethod
|
||||
def upsert(self, artifact: Artifact):
|
||||
"""Add the given artifact to the store."""
|
||||
|
||||
@abc.abstractmethod
|
||||
def select(self, selector: Selector) -> Iterable[str]:
|
||||
"""Select the artifacts matching the given selector."""
|
||||
raise NotImplementedError
|
||||
|
||||
|
||||
class CacheBackedEmbedder:
|
||||
"""Interface for embedding models."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
artifact_store: ArtifactStore,
|
||||
underlying_embedder: Embeddings,
|
||||
) -> None:
|
||||
"""Initialize the embedder."""
|
||||
self.artifact_store = artifact_store
|
||||
self.underlying_embedder = underlying_embedder
|
||||
|
||||
def embed_documents(self, texts: List[str]) -> List[List[float]]:
|
||||
"""Embed search docs."""
|
||||
raise NotImplementedError()
|
||||
|
||||
def embed_query(self, text: str) -> List[float]:
|
||||
"""Embed query text."""
|
||||
raise NotImplementedError()
|
||||
|
||||
|
||||
class InMemoryStore(MetadataStore):
|
||||
"""In-memory metadata store backed by a file.
|
||||
|
||||
In its current form, this store will be really slow for large collections of files.
|
||||
"""
|
||||
|
||||
def __init__(self, data: Metadata) -> None:
|
||||
"""Initialize the in-memory store."""
|
||||
super().__init__()
|
||||
self.data = data
|
||||
self.artifacts = data["artifacts"]
|
||||
# indexes for speed
|
||||
self.artifact_uids = {artifact["uid"]: artifact for artifact in self.artifacts}
|
||||
|
||||
def exists_by_uids(self, uids: Sequence[str]) -> List[bool]:
|
||||
"""Order preserving check if the artifact with the given id exists."""
|
||||
return [bool(uid in self.artifact_uids) for uid in uids]
|
||||
|
||||
def get_by_uids(self, uids: Sequence[str]) -> List[Artifact]:
|
||||
"""Return the documents with the given uuids."""
|
||||
return [self.artifact_uids[uid] for uid in uids]
|
||||
|
||||
def select(self, selector: Selector) -> Iterable[str]:
|
||||
"""Return the hashes the artifacts matching the given selector."""
|
||||
# Inefficient implementation that loops through all artifacts.
|
||||
# Optimize later.
|
||||
for artifact in self.data["artifacts"]:
|
||||
uid = artifact["uid"]
|
||||
# Implement conjunctive normal form
|
||||
if selector.uids and artifact["uid"] in selector.uids:
|
||||
yield uid
|
||||
continue
|
||||
|
||||
if selector.parent_uids and set(artifact["parent_uids"]).intersection(
|
||||
selector.parent_uids
|
||||
):
|
||||
yield uid
|
||||
continue
|
||||
|
||||
def save(self, path: PathLike) -> None:
|
||||
"""Save the metadata to the given path."""
|
||||
with open(path, "w") as f:
|
||||
json.dump(self.data, f)
|
||||
|
||||
def upsert(self, artifact: Artifact) -> None:
|
||||
"""Add the given artifact to the store."""
|
||||
uid = artifact["uid"]
|
||||
if uid not in self.artifact_uids:
|
||||
self.data["artifacts"].append(artifact)
|
||||
self.artifact_uids[artifact["uid"]] = artifact
|
||||
|
||||
def remove(self, selector: Selector) -> None:
|
||||
"""Remove the given artifacts from the store."""
|
||||
uids = list(self.select(selector))
|
||||
self.remove_by_uuids(uids)
|
||||
|
||||
def remove_by_uuids(self, uids: Sequence[str]) -> None:
|
||||
"""Remove the given artifacts from the store."""
|
||||
for uid in uids:
|
||||
del self.artifact_uids[uid]
|
||||
raise NotImplementedError(f"Need to delete artifacts as well")
|
||||
|
||||
@classmethod
|
||||
def from_file(cls, path: PathLike) -> InMemoryStore:
|
||||
"""Load store metadata from the given path."""
|
||||
with open(path, "r") as f:
|
||||
content = json.load(f)
|
||||
return cls(content)
|
||||
|
||||
|
||||
class FileSystemArtifactLayer(ArtifactStore):
|
||||
"""An artifact layer for storing artifacts on the file system."""
|
||||
|
||||
def __init__(self, root: PathLike) -> None:
|
||||
"""Initialize the file system artifact layer."""
|
||||
_root = root if isinstance(root, Path) else Path(root)
|
||||
self.root = _root
|
||||
# Metadata file will be kept in memory for now and updated with
|
||||
# each call.
|
||||
# This is error-prone due to race conditions (if multiple
|
||||
# processes are writing), but OK for prototyping / simple use cases.
|
||||
metadata_path = _root / "metadata.json"
|
||||
self.metadata_path = metadata_path
|
||||
|
||||
if metadata_path.exists():
|
||||
self.metadata_store = InMemoryStore.from_file(self.metadata_path)
|
||||
else:
|
||||
self.metadata_store = InMemoryStore({"artifacts": []})
|
||||
|
||||
def exists_by_uid(self, uuids: Sequence[str]) -> List[bool]:
|
||||
"""Check if the artifacts with the given uuid exist."""
|
||||
return self.metadata_store.exists_by_uids(uuids)
|
||||
|
||||
def _get_file_path(self, uid: str) -> Path:
|
||||
"""Get path to file for the given uuid."""
|
||||
return self.root / f"{uid}"
|
||||
|
||||
def upsert(
|
||||
self,
|
||||
artifacts_with_data: Sequence[ArtifactWithData],
|
||||
) -> None:
|
||||
"""Add the given artifacts."""
|
||||
# Write the documents to the file system
|
||||
for artifact_with_data in artifacts_with_data:
|
||||
# Use the document hash to write the contents to the file system
|
||||
document = artifact_with_data["document"]
|
||||
file_path = self.root / f"{document.hash_}"
|
||||
with open(file_path, "w") as f:
|
||||
f.write(serialize_document(document))
|
||||
|
||||
artifact = artifact_with_data["artifact"].copy()
|
||||
# Storing at a file -- can clean up the artifact with data request
|
||||
# later
|
||||
artifact["location"] = str(file_path)
|
||||
self.metadata_store.upsert(artifact)
|
||||
|
||||
self.metadata_store.save(self.metadata_path)
|
||||
|
||||
def list_document_ids(self, selector: Selector) -> Iterator[str]:
|
||||
"""List the document ids matching the given selector."""
|
||||
yield from self.metadata_store.select(selector)
|
||||
|
||||
def list_documents(self, selector: Selector) -> Iterator[Document]:
|
||||
"""Can even use JQ here!"""
|
||||
uuids = self.metadata_store.select(selector)
|
||||
|
||||
for uuid in uuids:
|
||||
artifact = self.metadata_store.get_by_uids([uuid])[0]
|
||||
path = artifact["location"]
|
||||
with open(path, "r") as f:
|
||||
page_content = deserialize_document(f.read()).page_content
|
||||
yield Document(
|
||||
uid=artifact["uid"],
|
||||
parent_uids=artifact["parent_uids"],
|
||||
metadata=artifact["metadata"],
|
||||
tags=artifact["tags"],
|
||||
page_content=page_content,
|
||||
)
|
||||
@@ -1,8 +1,21 @@
|
||||
"""Interface to access to place that stores documents."""
|
||||
import abc
|
||||
import dataclasses
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Dict, Union
|
||||
from typing import (
|
||||
Dict,
|
||||
Sequence,
|
||||
Iterator,
|
||||
Optional,
|
||||
List,
|
||||
Literal,
|
||||
TypedDict,
|
||||
Tuple,
|
||||
Union,
|
||||
Any,
|
||||
)
|
||||
|
||||
from langchain.docstore.document import Document
|
||||
from langchain.schema import Document
|
||||
|
||||
|
||||
class Docstore(ABC):
|
||||
@@ -23,3 +36,126 @@ class AddableMixin(ABC):
|
||||
@abstractmethod
|
||||
def add(self, texts: Dict[str, Document]) -> None:
|
||||
"""Add more documents."""
|
||||
|
||||
|
||||
@dataclasses.dataclass(frozen=True)
|
||||
class Selector:
|
||||
"""Selection criteria represented in conjunctive normal form.
|
||||
|
||||
https://en.wikipedia.org/wiki/Conjunctive_normal_form
|
||||
|
||||
At the moment, the explicit representation is used for simplicity / prototyping.
|
||||
|
||||
It may be replaced by an ability of specifying selection with jq
|
||||
if operating on JSON metadata or else something free form like SQL.
|
||||
"""
|
||||
|
||||
parent_uids: Optional[Sequence[str]] = None
|
||||
uids: Optional[Sequence[str]] = None
|
||||
# Pick up all artifacts with the given tags.
|
||||
# Maybe we should call this transformations.
|
||||
tags: Optional[Sequence[str]] = None # <-- WE DONT WANT TO DO IT THIS WAY
|
||||
transformation_path: Sequence[str] = None
|
||||
"""Use to specify a transformation path according to which we select documents"""
|
||||
|
||||
|
||||
# KNOWN WAYS THIS CAN FAIL:
|
||||
# 1) If the process crashes while text splitting, creating only some of the artifacts
|
||||
# ... new pipeline will not re-create the missing artifacts! (at least for now)
|
||||
# it will use the ones that exist and assume that all of them have been created
|
||||
|
||||
|
||||
# TODO: MAJOR MAJOR MAJOR MAJOR
|
||||
# 1. FIX SEMANTICS WITH REGARDS TO ID, UUID. AND POTENTIALLY ARTIFACT_ID
|
||||
# NEED TO REASON THROUGH USE CASES CAREFULLY TO REASON ABOUT WHATS MINIMAL SUFFICIENT
|
||||
# 2. Using hashes throughout for implementation simplicity, but may want to switch
|
||||
# to ids assigned by the a database? probability of collision is really small
|
||||
class Artifact(TypedDict):
|
||||
|
||||
"""A representation of an artifact."""
|
||||
|
||||
uid: str # This has to be handled carefully -- we'll eventually get collisions
|
||||
"""A unique identifier for the artifact."""
|
||||
type_: Union[Literal["document"], Literal["embedding"], Literal["blob"]]
|
||||
"""A unique identifier for the artifact."""
|
||||
data_hash: str
|
||||
"""A hash of the data of the artifact."""
|
||||
metadata_hash: str
|
||||
"""A hash of the metadata of the artifact."""
|
||||
parent_uids: Tuple[str, ...]
|
||||
"""A tuple of uids representing the parent artifacts."""
|
||||
parent_hashes: Tuple[str, ...]
|
||||
"""A tuple of hashes representing the parent artifacts at time of transformation."""
|
||||
transformation_hash: str
|
||||
"""A hash of the transformation that was applied to generate artifact.
|
||||
|
||||
This parameterizes the transformation logic together with any transformation
|
||||
parameters.
|
||||
"""
|
||||
created_at: str # ISO-8601
|
||||
"""The time the artifact was created."""
|
||||
updated_at: str # ISO-8601
|
||||
"""The time the artifact was last updated."""
|
||||
metadata: Any
|
||||
"""A dictionary representing the metadata of the artifact."""
|
||||
tags: Tuple[str, ...]
|
||||
"""A tuple of tags associated with the artifact.
|
||||
|
||||
Can use tags to add information about the transformation that was applied
|
||||
to the given artifact.
|
||||
|
||||
THIS IS NOT A GOOD REPRESENTATION.
|
||||
"""
|
||||
"""The type of the artifact.""" # THIS MAY NEED TO BE CHANGED
|
||||
data: Optional[bytes]
|
||||
"""The data of the artifact when the artifact contains the data by value.
|
||||
|
||||
Will likely change somehow.
|
||||
|
||||
* For first pass contains embedding data.
|
||||
* document data and blob data stored externally.
|
||||
"""
|
||||
location: Optional[str]
|
||||
# Location specifies the location of the artifact when
|
||||
# the artifact contains the data by reference (use for documents / blobs)
|
||||
|
||||
|
||||
class ArtifactWithData(TypedDict):
|
||||
"""A document with the transformation that generated it."""
|
||||
|
||||
artifact: Artifact
|
||||
document: Document
|
||||
|
||||
|
||||
class ArtifactStore(abc.ABC):
|
||||
"""Use to keep track of artifacts generated while processing content.
|
||||
|
||||
The first version of the artifact store is used to work with Documents
|
||||
rather than Blobs.
|
||||
|
||||
We will likely want to evolve this into Blobs, but faster to prototype
|
||||
with Documents.
|
||||
"""
|
||||
|
||||
def exists_by_uid(self, uids: Sequence[str]) -> List[bool]:
|
||||
"""Check if the artifacts with the given uuid exist."""
|
||||
raise NotImplementedError()
|
||||
|
||||
def exists_by_parent_uids(self, uids: Sequence[str]) -> List[bool]:
|
||||
"""Check if the artifacts with the given id exist."""
|
||||
raise NotImplementedError()
|
||||
|
||||
def upsert(
|
||||
self,
|
||||
artifacts_with_data: Sequence[ArtifactWithData],
|
||||
) -> None:
|
||||
"""Upsert the given artifacts."""
|
||||
raise NotImplementedError()
|
||||
|
||||
def list_documents(self, selector: Selector) -> Iterator[Document]:
|
||||
"""Yield documents matching the given selector."""
|
||||
raise NotImplementedError()
|
||||
|
||||
def list_document_ids(self, selector: Selector) -> Iterator[str]:
|
||||
"""Yield document ids matching the given selector."""
|
||||
raise NotImplementedError()
|
||||
|
||||
133
langchain/docstore/pipeline.py
Normal file
133
langchain/docstore/pipeline.py
Normal file
@@ -0,0 +1,133 @@
|
||||
"""Module implements a pipeline.
|
||||
|
||||
There might be a better name for this.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import datetime
|
||||
from typing import Sequence, Optional, Iterator, Iterable, List
|
||||
|
||||
from langchain.docstore.base import ArtifactWithData, ArtifactStore, Selector
|
||||
from langchain.document_loaders.base import BaseLoader
|
||||
from langchain.schema import Document, BaseDocumentTransformer
|
||||
from langchain.text_splitter import TextSplitter
|
||||
|
||||
|
||||
def _convert_document_to_artifact_upsert(
|
||||
document: Document, parent_documents: Sequence[Document], transformation_hash: str
|
||||
) -> ArtifactWithData:
|
||||
"""Convert the given documents to artifacts for upserting."""
|
||||
dt = datetime.datetime.now().isoformat()
|
||||
parent_uids = [str(parent_doc.uid) for parent_doc in parent_documents]
|
||||
parent_hashes = [str(parent_doc.hash_) for parent_doc in parent_documents]
|
||||
|
||||
return {
|
||||
"artifact": {
|
||||
"uid": str(document.uid),
|
||||
"parent_uids": parent_uids,
|
||||
"metadata": document.metadata,
|
||||
"parent_hashes": parent_hashes,
|
||||
"tags": tuple(),
|
||||
"type_": "document",
|
||||
"data": None,
|
||||
"location": None,
|
||||
"data_hash": str(document.hash_),
|
||||
"metadata_hash": "N/A",
|
||||
"created_at": dt,
|
||||
"updated_at": dt,
|
||||
"transformation_hash": transformation_hash,
|
||||
},
|
||||
"document": document,
|
||||
}
|
||||
|
||||
|
||||
class Pipeline(BaseLoader): # MAY NOT WANT TO INHERIT FROM LOADER
|
||||
def __init__(
|
||||
self,
|
||||
loader: BaseLoader,
|
||||
*,
|
||||
transformers: Optional[Sequence[BaseDocumentTransformer]] = None,
|
||||
artifact_store: Optional[ArtifactStore] = None,
|
||||
) -> None:
|
||||
"""Initialize the document pipeline.
|
||||
|
||||
Args:
|
||||
loader: The loader to use for loading the documents.
|
||||
transformers: The transformers to use for transforming the documents.
|
||||
artifact_store: The artifact store to use for storing the artifacts.
|
||||
"""
|
||||
self.loader = loader
|
||||
self.transformers = transformers
|
||||
self.artifact_store = artifact_store
|
||||
|
||||
def lazy_load(
|
||||
self,
|
||||
) -> Iterator[Document]:
|
||||
"""Lazy load the documents."""
|
||||
transformations = self.transformers or []
|
||||
# Need syntax for determining whether this should be cached.
|
||||
|
||||
try:
|
||||
doc_iterator = self.loader.lazy_load()
|
||||
except NotImplementedError:
|
||||
doc_iterator = self.loader.load()
|
||||
|
||||
for document in doc_iterator:
|
||||
new_documents = [document]
|
||||
for transformation in transformations:
|
||||
# Batched for now here -- lots of optimization possible
|
||||
# but not needed for now and is likely going to get complex
|
||||
new_documents = list(
|
||||
self._propagate_documents(new_documents, transformation)
|
||||
)
|
||||
|
||||
yield from new_documents
|
||||
|
||||
def _propagate_documents(
|
||||
self, documents: Sequence[Document], transformation: BaseDocumentTransformer
|
||||
) -> Iterable[Document]:
|
||||
"""Transform the given documents using the transformation with caching."""
|
||||
docs_exist = self.artifact_store.exists_by_uid(
|
||||
[document.uid for document in documents]
|
||||
)
|
||||
|
||||
for document, exists in zip(documents, docs_exist):
|
||||
if exists:
|
||||
existing_docs = self.artifact_store.list_documents(
|
||||
Selector(parent_uids=[document.uid])
|
||||
)
|
||||
|
||||
materialized_docs = list(existing_docs)
|
||||
|
||||
if materialized_docs:
|
||||
yield from materialized_docs
|
||||
continue
|
||||
|
||||
transformed_docs = transformation.transform_documents([document])
|
||||
|
||||
# MAJOR: Hash should encapsulate transformation parameters
|
||||
transformation_hash = transformation.__class__.__name__
|
||||
|
||||
artifacts_with_data = [
|
||||
_convert_document_to_artifact_upsert(
|
||||
transformed_doc, [document], transformation_hash
|
||||
)
|
||||
for transformed_doc in transformed_docs
|
||||
]
|
||||
|
||||
self.artifact_store.upsert(artifacts_with_data)
|
||||
yield from transformed_docs
|
||||
|
||||
def load(self) -> List[Document]:
|
||||
"""Load the documents."""
|
||||
return list(self.lazy_load())
|
||||
|
||||
def run(self) -> None: # BAD API NEED
|
||||
"""Execute the pipeline, returning nothing."""
|
||||
for _ in self.lazy_load():
|
||||
pass
|
||||
|
||||
def load_and_split(
|
||||
self, text_splitter: Optional[TextSplitter] = None
|
||||
) -> List[Document]:
|
||||
raise NotImplementedError("This method will never be implemented.")
|
||||
48
langchain/docstore/serialization.py
Normal file
48
langchain/docstore/serialization.py
Normal file
@@ -0,0 +1,48 @@
|
||||
"""Module for serialization code.
|
||||
|
||||
This code will likely be replaced by Nuno's serialization method.
|
||||
"""
|
||||
import json
|
||||
from json import JSONEncoder, JSONDecodeError
|
||||
from uuid import UUID
|
||||
|
||||
from langchain.schema import Document
|
||||
|
||||
|
||||
class UUIDEncoder(JSONEncoder):
|
||||
"""Will either be replaced by Nuno's serialization method or something else.
|
||||
|
||||
Potentially there will be no serialization for a document object since
|
||||
the document can be broken into 2 pieces:
|
||||
|
||||
* the content -> saved on disk or in database
|
||||
* the metadata -> saved in metadata store
|
||||
|
||||
It may not make sense to keep the metadata together with the document
|
||||
for the persistence.
|
||||
"""
|
||||
|
||||
def default(self, obj):
|
||||
if isinstance(obj, UUID):
|
||||
return str(obj) # Convert UUID to string
|
||||
return super().default(obj)
|
||||
|
||||
|
||||
# PUBLIC API
|
||||
|
||||
|
||||
def serialize_document(document: Document) -> str:
|
||||
"""Serialize the given document to a string."""
|
||||
try:
|
||||
# Serialize only the content.
|
||||
# Metadata always stored separately.
|
||||
return json.dumps(document.page_content)
|
||||
except JSONDecodeError:
|
||||
raise ValueError(f"Could not serialize document with ID: {document.uid}")
|
||||
|
||||
|
||||
def deserialize_document(serialized_document: str) -> Document:
|
||||
"""Deserialize the given document from a string."""
|
||||
return Document(
|
||||
page_content=json.loads(serialized_document),
|
||||
)
|
||||
69
langchain/docstore/sync.py
Normal file
69
langchain/docstore/sync.py
Normal file
@@ -0,0 +1,69 @@
|
||||
"""Module contains doc for syncing from docstore to vectorstores."""
|
||||
from __future__ import annotations
|
||||
|
||||
from itertools import islice
|
||||
from typing import TypedDict, Sequence, Optional, TypeVar, Iterable, Iterator, List
|
||||
from langchain.docstore.base import ArtifactStore, Selector
|
||||
from langchain.vectorstores import VectorStore
|
||||
|
||||
|
||||
class SyncResult(TypedDict):
|
||||
"""Syncing result."""
|
||||
|
||||
first_n_errors: Sequence[str]
|
||||
"""First n errors that occurred during syncing."""
|
||||
num_added: Optional[int]
|
||||
"""Number of added documents."""
|
||||
num_updated: Optional[int]
|
||||
"""Number of updated documents because they were not up to date."""
|
||||
num_deleted: Optional[int]
|
||||
"""Number of deleted documents."""
|
||||
num_skipped: Optional[int]
|
||||
"""Number of skipped documents because they were already up to date."""
|
||||
|
||||
|
||||
T = TypeVar("T")
|
||||
|
||||
|
||||
def _batch(size: int, iterable: Iterable[T]) -> Iterator[List[T]]:
|
||||
"""Utility batching function."""
|
||||
it = iter(iterable)
|
||||
while True:
|
||||
chunk = list(islice(it, size))
|
||||
if not chunk:
|
||||
return
|
||||
yield chunk
|
||||
|
||||
|
||||
# SYNC IMPLEMENTATION
|
||||
|
||||
|
||||
def sync(
|
||||
artifact_store: ArtifactStore,
|
||||
vector_store: VectorStore,
|
||||
selector: Selector,
|
||||
*,
|
||||
batch_size: int = 1000,
|
||||
) -> SyncResult:
|
||||
"""Sync the given artifact layer with the given vector store."""
|
||||
document_uids = artifact_store.list_document_ids(selector)
|
||||
|
||||
all_uids = []
|
||||
# IDs must fit into memory for this to work.
|
||||
for uid_batch in _batch(batch_size, document_uids):
|
||||
all_uids.extend(uid_batch)
|
||||
document_batch = list(artifact_store.list_documents(Selector(uids=uid_batch)))
|
||||
upsert_info = vector_store.upsert_by_id(
|
||||
documents=document_batch, batch_size=batch_size
|
||||
)
|
||||
# Non-intuitive interface, but simple to implement
|
||||
# (maybe we can have a better solution though)
|
||||
num_deleted = vector_store.delete_non_matching_ids(all_uids)
|
||||
|
||||
return {
|
||||
"first_n_errors": [],
|
||||
"num_added": None,
|
||||
"num_updated": None,
|
||||
"num_skipped": None,
|
||||
"num_deleted": None,
|
||||
}
|
||||
@@ -1,7 +1,6 @@
|
||||
"""All different types of document loaders."""
|
||||
|
||||
from langchain.document_loaders.airbyte_json import AirbyteJSONLoader
|
||||
from langchain.document_loaders.airtable import AirtableLoader
|
||||
from langchain.document_loaders.apify_dataset import ApifyDatasetLoader
|
||||
from langchain.document_loaders.arxiv import ArxivLoader
|
||||
from langchain.document_loaders.azlyrics import AZLyricsLoader
|
||||
@@ -20,7 +19,7 @@ from langchain.document_loaders.chatgpt import ChatGPTLoader
|
||||
from langchain.document_loaders.college_confidential import CollegeConfidentialLoader
|
||||
from langchain.document_loaders.confluence import ConfluenceLoader
|
||||
from langchain.document_loaders.conllu import CoNLLULoader
|
||||
from langchain.document_loaders.csv_loader import CSVLoader, UnstructuredCSVLoader
|
||||
from langchain.document_loaders.csv_loader import CSVLoader
|
||||
from langchain.document_loaders.dataframe import DataFrameLoader
|
||||
from langchain.document_loaders.diffbot import DiffbotLoader
|
||||
from langchain.document_loaders.directory import DirectoryLoader
|
||||
@@ -31,12 +30,10 @@ from langchain.document_loaders.email import (
|
||||
OutlookMessageLoader,
|
||||
UnstructuredEmailLoader,
|
||||
)
|
||||
from langchain.document_loaders.embaas import EmbaasBlobLoader, EmbaasLoader
|
||||
from langchain.document_loaders.epub import UnstructuredEPubLoader
|
||||
from langchain.document_loaders.evernote import EverNoteLoader
|
||||
from langchain.document_loaders.excel import UnstructuredExcelLoader
|
||||
from langchain.document_loaders.facebook_chat import FacebookChatLoader
|
||||
from langchain.document_loaders.fauna import FaunaLoader
|
||||
from langchain.document_loaders.figma import FigmaFileLoader
|
||||
from langchain.document_loaders.gcs_directory import GCSDirectoryLoader
|
||||
from langchain.document_loaders.gcs_file import GCSFileLoader
|
||||
@@ -92,7 +89,6 @@ from langchain.document_loaders.s3_directory import S3DirectoryLoader
|
||||
from langchain.document_loaders.s3_file import S3FileLoader
|
||||
from langchain.document_loaders.sitemap import SitemapLoader
|
||||
from langchain.document_loaders.slack_directory import SlackDirectoryLoader
|
||||
from langchain.document_loaders.snowflake_loader import SnowflakeLoader
|
||||
from langchain.document_loaders.spreedly import SpreedlyLoader
|
||||
from langchain.document_loaders.srt import SRTLoader
|
||||
from langchain.document_loaders.stripe import StripeLoader
|
||||
@@ -122,7 +118,6 @@ from langchain.document_loaders.word_document import (
|
||||
Docx2txtLoader,
|
||||
UnstructuredWordDocumentLoader,
|
||||
)
|
||||
from langchain.document_loaders.xml import UnstructuredXMLLoader
|
||||
from langchain.document_loaders.youtube import (
|
||||
GoogleApiClient,
|
||||
GoogleApiYoutubeLoader,
|
||||
@@ -138,7 +133,6 @@ TelegramChatLoader = TelegramChatFileLoader
|
||||
__all__ = [
|
||||
"AZLyricsLoader",
|
||||
"AirbyteJSONLoader",
|
||||
"AirtableLoader",
|
||||
"ApifyDatasetLoader",
|
||||
"ArxivLoader",
|
||||
"AzureBlobStorageContainerLoader",
|
||||
@@ -161,7 +155,6 @@ __all__ = [
|
||||
"DocugamiLoader",
|
||||
"Docx2txtLoader",
|
||||
"DuckDBLoader",
|
||||
"FaunaLoader",
|
||||
"EverNoteLoader",
|
||||
"FacebookChatLoader",
|
||||
"FigmaFileLoader",
|
||||
@@ -229,7 +222,6 @@ __all__ = [
|
||||
"TwitterTweetLoader",
|
||||
"UnstructuredAPIFileIOLoader",
|
||||
"UnstructuredAPIFileLoader",
|
||||
"UnstructuredCSVLoader",
|
||||
"UnstructuredEPubLoader",
|
||||
"UnstructuredEmailLoader",
|
||||
"UnstructuredExcelLoader",
|
||||
@@ -244,13 +236,9 @@ __all__ = [
|
||||
"UnstructuredRTFLoader",
|
||||
"UnstructuredURLLoader",
|
||||
"UnstructuredWordDocumentLoader",
|
||||
"UnstructuredXMLLoader",
|
||||
"WeatherDataLoader",
|
||||
"WebBaseLoader",
|
||||
"WhatsAppChatLoader",
|
||||
"WikipediaLoader",
|
||||
"YoutubeLoader",
|
||||
"SnowflakeLoader",
|
||||
"EmbaasLoader",
|
||||
"EmbaasBlobLoader",
|
||||
]
|
||||
|
||||
@@ -1,36 +0,0 @@
|
||||
from typing import Iterator, List
|
||||
|
||||
from langchain.docstore.document import Document
|
||||
from langchain.document_loaders.base import BaseLoader
|
||||
|
||||
|
||||
class AirtableLoader(BaseLoader):
|
||||
"""Loader that loads local airbyte json files."""
|
||||
|
||||
def __init__(self, api_token: str, table_id: str, base_id: str):
|
||||
"""Initialize with API token and the IDs for table and base"""
|
||||
self.api_token = api_token
|
||||
self.table_id = table_id
|
||||
self.base_id = base_id
|
||||
|
||||
def lazy_load(self) -> Iterator[Document]:
|
||||
"""Load Table."""
|
||||
|
||||
from pyairtable import Table
|
||||
|
||||
table = Table(self.api_token, self.base_id, self.table_id)
|
||||
records = table.all()
|
||||
for record in records:
|
||||
# Need to convert record from dict to str
|
||||
yield Document(
|
||||
page_content=str(record),
|
||||
metadata={
|
||||
"source": self.base_id + "_" + self.table_id,
|
||||
"base_id": self.base_id,
|
||||
"table_id": self.table_id,
|
||||
},
|
||||
)
|
||||
|
||||
def load(self) -> List[Document]:
|
||||
"""Load Table."""
|
||||
return list(self.lazy_load())
|
||||
@@ -1,5 +1,4 @@
|
||||
from langchain.document_loaders.blob_loaders.file_system import FileSystemBlobLoader
|
||||
from langchain.document_loaders.blob_loaders.schema import Blob, BlobLoader
|
||||
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
|
||||
|
||||
__all__ = ["BlobLoader", "Blob", "FileSystemBlobLoader", "YoutubeAudioLoader"]
|
||||
__all__ = ["BlobLoader", "Blob", "FileSystemBlobLoader"]
|
||||
|
||||
@@ -1,50 +0,0 @@
|
||||
from typing import Iterable, List
|
||||
|
||||
from langchain.document_loaders.blob_loaders import FileSystemBlobLoader
|
||||
from langchain.document_loaders.blob_loaders.schema import Blob, BlobLoader
|
||||
|
||||
|
||||
class YoutubeAudioLoader(BlobLoader):
|
||||
|
||||
"""Load YouTube urls as audio file(s)."""
|
||||
|
||||
def __init__(self, urls: List[str], save_dir: str):
|
||||
if not isinstance(urls, list):
|
||||
raise TypeError("urls must be a list")
|
||||
|
||||
self.urls = urls
|
||||
self.save_dir = save_dir
|
||||
|
||||
def yield_blobs(self) -> Iterable[Blob]:
|
||||
"""Yield audio blobs for each url."""
|
||||
|
||||
try:
|
||||
import yt_dlp
|
||||
except ImportError:
|
||||
raise ValueError(
|
||||
"yt_dlp package not found, please install it with "
|
||||
"`pip install yt_dlp`"
|
||||
)
|
||||
|
||||
# Use yt_dlp to download audio given a YouTube url
|
||||
ydl_opts = {
|
||||
"format": "m4a/bestaudio/best",
|
||||
"noplaylist": True,
|
||||
"outtmpl": self.save_dir + "/%(title)s.%(ext)s",
|
||||
"postprocessors": [
|
||||
{
|
||||
"key": "FFmpegExtractAudio",
|
||||
"preferredcodec": "m4a",
|
||||
}
|
||||
],
|
||||
}
|
||||
|
||||
for url in self.urls:
|
||||
# Download file
|
||||
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
|
||||
ydl.download(url)
|
||||
|
||||
# Yield the written blobs
|
||||
loader = FileSystemBlobLoader(self.save_dir, glob="*.m4a")
|
||||
for blob in loader.yield_blobs():
|
||||
yield blob
|
||||
@@ -1,7 +1,7 @@
|
||||
"""Load Data from a Confluence Space"""
|
||||
import logging
|
||||
from io import BytesIO
|
||||
from typing import Any, Callable, Dict, List, Optional, Union
|
||||
from typing import Any, Callable, List, Optional, Union
|
||||
|
||||
from tenacity import (
|
||||
before_sleep_log,
|
||||
@@ -180,7 +180,6 @@ class ConfluenceLoader(BaseLoader):
|
||||
include_comments: bool = False,
|
||||
limit: Optional[int] = 50,
|
||||
max_pages: Optional[int] = 1000,
|
||||
ocr_languages: Optional[str] = None,
|
||||
) -> List[Document]:
|
||||
"""
|
||||
:param space_key: Space key retrieved from a confluence URL, defaults to None
|
||||
@@ -204,10 +203,6 @@ class ConfluenceLoader(BaseLoader):
|
||||
:type limit: int, optional
|
||||
:param max_pages: Maximum number of pages to retrieve in total, defaults 1000
|
||||
:type max_pages: int, optional
|
||||
:param ocr_languages: The languages to use for the Tesseract agent. To use a
|
||||
language, you'll first need to install the appropriate
|
||||
Tesseract language pack.
|
||||
:type ocr_languages: str, optional
|
||||
:raises ValueError: _description_
|
||||
:raises ImportError: _description_
|
||||
:return: _description_
|
||||
@@ -231,11 +226,7 @@ class ConfluenceLoader(BaseLoader):
|
||||
expand="body.storage.value",
|
||||
)
|
||||
docs += self.process_pages(
|
||||
pages,
|
||||
include_restricted_content,
|
||||
include_attachments,
|
||||
include_comments,
|
||||
ocr_languages,
|
||||
pages, include_restricted_content, include_attachments, include_comments
|
||||
)
|
||||
|
||||
if label:
|
||||
@@ -253,7 +244,7 @@ class ConfluenceLoader(BaseLoader):
|
||||
|
||||
if cql:
|
||||
pages = self.paginate_request(
|
||||
self._search_content_by_cql,
|
||||
self.confluence.cql,
|
||||
cql=cql,
|
||||
limit=limit,
|
||||
max_pages=max_pages,
|
||||
@@ -261,11 +252,7 @@ class ConfluenceLoader(BaseLoader):
|
||||
expand="body.storage.value",
|
||||
)
|
||||
docs += self.process_pages(
|
||||
pages,
|
||||
include_restricted_content,
|
||||
include_attachments,
|
||||
include_comments,
|
||||
ocr_languages,
|
||||
pages, include_restricted_content, include_attachments, include_comments
|
||||
)
|
||||
|
||||
if page_ids:
|
||||
@@ -285,26 +272,11 @@ class ConfluenceLoader(BaseLoader):
|
||||
page = get_page(page_id=page_id, expand="body.storage.value")
|
||||
if not include_restricted_content and not self.is_public_page(page):
|
||||
continue
|
||||
doc = self.process_page(
|
||||
page, include_attachments, include_comments, ocr_languages
|
||||
)
|
||||
doc = self.process_page(page, include_attachments, include_comments)
|
||||
docs.append(doc)
|
||||
|
||||
return docs
|
||||
|
||||
def _search_content_by_cql(
|
||||
self, cql: str, include_archived_spaces: Optional[bool] = None, **kwargs: Any
|
||||
) -> List[dict]:
|
||||
url = "rest/api/content/search"
|
||||
|
||||
params: Dict[str, Any] = {"cql": cql}
|
||||
params.update(kwargs)
|
||||
if include_archived_spaces is not None:
|
||||
params["includeArchivedSpaces"] = include_archived_spaces
|
||||
|
||||
response = self.confluence.get(url, params=params)
|
||||
return response.get("results", [])
|
||||
|
||||
def paginate_request(self, retrieval_method: Callable, **kwargs: Any) -> List:
|
||||
"""Paginate the various methods to retrieve groups of pages.
|
||||
|
||||
@@ -363,16 +335,13 @@ class ConfluenceLoader(BaseLoader):
|
||||
include_restricted_content: bool,
|
||||
include_attachments: bool,
|
||||
include_comments: bool,
|
||||
ocr_languages: Optional[str] = None,
|
||||
) -> List[Document]:
|
||||
"""Process a list of pages into a list of documents."""
|
||||
docs = []
|
||||
for page in pages:
|
||||
if not include_restricted_content and not self.is_public_page(page):
|
||||
continue
|
||||
doc = self.process_page(
|
||||
page, include_attachments, include_comments, ocr_languages
|
||||
)
|
||||
doc = self.process_page(page, include_attachments, include_comments)
|
||||
docs.append(doc)
|
||||
|
||||
return docs
|
||||
@@ -382,7 +351,6 @@ class ConfluenceLoader(BaseLoader):
|
||||
page: dict,
|
||||
include_attachments: bool,
|
||||
include_comments: bool,
|
||||
ocr_languages: Optional[str] = None,
|
||||
) -> Document:
|
||||
try:
|
||||
from bs4 import BeautifulSoup # type: ignore
|
||||
@@ -393,7 +361,7 @@ class ConfluenceLoader(BaseLoader):
|
||||
)
|
||||
|
||||
if include_attachments:
|
||||
attachment_texts = self.process_attachment(page["id"], ocr_languages)
|
||||
attachment_texts = self.process_attachment(page["id"])
|
||||
else:
|
||||
attachment_texts = []
|
||||
text = BeautifulSoup(page["body"]["storage"]["value"], "lxml").get_text(
|
||||
@@ -420,11 +388,7 @@ class ConfluenceLoader(BaseLoader):
|
||||
},
|
||||
)
|
||||
|
||||
def process_attachment(
|
||||
self,
|
||||
page_id: str,
|
||||
ocr_languages: Optional[str] = None,
|
||||
) -> List[str]:
|
||||
def process_attachment(self, page_id: str) -> List[str]:
|
||||
try:
|
||||
from PIL import Image # noqa: F401
|
||||
except ImportError:
|
||||
@@ -441,13 +405,13 @@ class ConfluenceLoader(BaseLoader):
|
||||
absolute_url = self.base_url + attachment["_links"]["download"]
|
||||
title = attachment["title"]
|
||||
if media_type == "application/pdf":
|
||||
text = title + self.process_pdf(absolute_url, ocr_languages)
|
||||
text = title + self.process_pdf(absolute_url)
|
||||
elif (
|
||||
media_type == "image/png"
|
||||
or media_type == "image/jpg"
|
||||
or media_type == "image/jpeg"
|
||||
):
|
||||
text = title + self.process_image(absolute_url, ocr_languages)
|
||||
text = title + self.process_image(absolute_url)
|
||||
elif (
|
||||
media_type == "application/vnd.openxmlformats-officedocument"
|
||||
".wordprocessingml.document"
|
||||
@@ -456,18 +420,14 @@ class ConfluenceLoader(BaseLoader):
|
||||
elif media_type == "application/vnd.ms-excel":
|
||||
text = title + self.process_xls(absolute_url)
|
||||
elif media_type == "image/svg+xml":
|
||||
text = title + self.process_svg(absolute_url, ocr_languages)
|
||||
text = title + self.process_svg(absolute_url)
|
||||
else:
|
||||
continue
|
||||
texts.append(text)
|
||||
|
||||
return texts
|
||||
|
||||
def process_pdf(
|
||||
self,
|
||||
link: str,
|
||||
ocr_languages: Optional[str] = None,
|
||||
) -> str:
|
||||
def process_pdf(self, link: str) -> str:
|
||||
try:
|
||||
import pytesseract # noqa: F401
|
||||
from pdf2image import convert_from_bytes # noqa: F401
|
||||
@@ -492,16 +452,12 @@ class ConfluenceLoader(BaseLoader):
|
||||
return text
|
||||
|
||||
for i, image in enumerate(images):
|
||||
image_text = pytesseract.image_to_string(image, lang=ocr_languages)
|
||||
image_text = pytesseract.image_to_string(image)
|
||||
text += f"Page {i + 1}:\n{image_text}\n\n"
|
||||
|
||||
return text
|
||||
|
||||
def process_image(
|
||||
self,
|
||||
link: str,
|
||||
ocr_languages: Optional[str] = None,
|
||||
) -> str:
|
||||
def process_image(self, link: str) -> str:
|
||||
try:
|
||||
import pytesseract # noqa: F401
|
||||
from PIL import Image # noqa: F401
|
||||
@@ -525,7 +481,7 @@ class ConfluenceLoader(BaseLoader):
|
||||
except OSError:
|
||||
return text
|
||||
|
||||
return pytesseract.image_to_string(image, lang=ocr_languages)
|
||||
return pytesseract.image_to_string(image)
|
||||
|
||||
def process_doc(self, link: str) -> str:
|
||||
try:
|
||||
@@ -575,11 +531,7 @@ class ConfluenceLoader(BaseLoader):
|
||||
|
||||
return text
|
||||
|
||||
def process_svg(
|
||||
self,
|
||||
link: str,
|
||||
ocr_languages: Optional[str] = None,
|
||||
) -> str:
|
||||
def process_svg(self, link: str) -> str:
|
||||
try:
|
||||
import pytesseract # noqa: F401
|
||||
from PIL import Image # noqa: F401
|
||||
@@ -608,4 +560,4 @@ class ConfluenceLoader(BaseLoader):
|
||||
img_data.seek(0)
|
||||
image = Image.open(img_data)
|
||||
|
||||
return pytesseract.image_to_string(image, lang=ocr_languages)
|
||||
return pytesseract.image_to_string(image)
|
||||
|
||||
@@ -1,12 +1,8 @@
|
||||
import csv
|
||||
from typing import Any, Dict, List, Optional
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
from langchain.docstore.document import Document
|
||||
from langchain.document_loaders.base import BaseLoader
|
||||
from langchain.document_loaders.unstructured import (
|
||||
UnstructuredFileLoader,
|
||||
validate_unstructured_version,
|
||||
)
|
||||
|
||||
|
||||
class CSVLoader(BaseLoader):
|
||||
@@ -65,18 +61,3 @@ class CSVLoader(BaseLoader):
|
||||
docs.append(doc)
|
||||
|
||||
return docs
|
||||
|
||||
|
||||
class UnstructuredCSVLoader(UnstructuredFileLoader):
|
||||
"""Loader that uses unstructured to load CSV files."""
|
||||
|
||||
def __init__(
|
||||
self, file_path: str, mode: str = "single", **unstructured_kwargs: Any
|
||||
):
|
||||
validate_unstructured_version(min_unstructured_version="0.6.8")
|
||||
super().__init__(file_path=file_path, mode=mode, **unstructured_kwargs)
|
||||
|
||||
def _get_elements(self) -> List:
|
||||
from unstructured.partition.csv import partition_csv
|
||||
|
||||
return partition_csv(filename=self.file_path, **self.unstructured_kwargs)
|
||||
|
||||
@@ -1,234 +0,0 @@
|
||||
import base64
|
||||
import warnings
|
||||
from typing import Any, Dict, Iterator, List, Optional
|
||||
|
||||
import requests
|
||||
from pydantic import BaseModel, root_validator, validator
|
||||
from typing_extensions import NotRequired, TypedDict
|
||||
|
||||
from langchain.docstore.document import Document
|
||||
from langchain.document_loaders.base import BaseBlobParser, BaseLoader
|
||||
from langchain.document_loaders.blob_loaders import Blob
|
||||
from langchain.text_splitter import TextSplitter
|
||||
from langchain.utils import get_from_dict_or_env
|
||||
|
||||
EMBAAS_DOC_API_URL = "https://api.embaas.io/v1/document/extract-text/bytes/"
|
||||
|
||||
|
||||
class EmbaasDocumentExtractionParameters(TypedDict):
|
||||
"""Parameters for the embaas document extraction API."""
|
||||
|
||||
mime_type: NotRequired[str]
|
||||
"""The mime type of the document."""
|
||||
file_extension: NotRequired[str]
|
||||
"""The file extension of the document."""
|
||||
file_name: NotRequired[str]
|
||||
"""The file name of the document."""
|
||||
|
||||
should_chunk: NotRequired[bool]
|
||||
"""Whether to chunk the document into pages."""
|
||||
chunk_size: NotRequired[int]
|
||||
"""The maximum size of the text chunks."""
|
||||
chunk_overlap: NotRequired[int]
|
||||
"""The maximum overlap allowed between chunks."""
|
||||
chunk_splitter: NotRequired[str]
|
||||
"""The text splitter class name for creating chunks."""
|
||||
separators: NotRequired[List[str]]
|
||||
"""The separators for chunks."""
|
||||
|
||||
should_embed: NotRequired[bool]
|
||||
"""Whether to create embeddings for the document in the response."""
|
||||
model: NotRequired[str]
|
||||
"""The model to pass to the Embaas document extraction API."""
|
||||
instruction: NotRequired[str]
|
||||
"""The instruction to pass to the Embaas document extraction API."""
|
||||
|
||||
|
||||
class EmbaasDocumentExtractionPayload(EmbaasDocumentExtractionParameters):
|
||||
bytes: str
|
||||
"""The base64 encoded bytes of the document to extract text from."""
|
||||
|
||||
|
||||
class BaseEmbaasLoader(BaseModel):
|
||||
embaas_api_key: Optional[str] = None
|
||||
api_url: str = EMBAAS_DOC_API_URL
|
||||
"""The URL of the embaas document extraction API."""
|
||||
params: EmbaasDocumentExtractionParameters = EmbaasDocumentExtractionParameters()
|
||||
"""Additional parameters to pass to the embaas document extraction API."""
|
||||
|
||||
@root_validator(pre=True)
|
||||
def validate_environment(cls, values: Dict) -> Dict:
|
||||
"""Validate that api key and python package exists in environment."""
|
||||
embaas_api_key = get_from_dict_or_env(
|
||||
values, "embaas_api_key", "EMBAAS_API_KEY"
|
||||
)
|
||||
values["embaas_api_key"] = embaas_api_key
|
||||
return values
|
||||
|
||||
|
||||
class EmbaasBlobLoader(BaseEmbaasLoader, BaseBlobParser):
|
||||
"""Wrapper around embaas's document byte loader service.
|
||||
|
||||
To use, you should have the
|
||||
environment variable ``EMBAAS_API_KEY`` set with your API key, or pass
|
||||
it as a named parameter to the constructor.
|
||||
|
||||
Example:
|
||||
.. code-block:: python
|
||||
|
||||
# Default parsing
|
||||
from langchain.document_loaders.embaas import EmbaasBlobLoader
|
||||
loader = EmbaasBlobLoader()
|
||||
blob = Blob.from_path(path="example.mp3")
|
||||
documents = loader.parse(blob=blob)
|
||||
|
||||
# Custom api parameters (create embeddings automatically)
|
||||
from langchain.document_loaders.embaas import EmbaasBlobLoader
|
||||
loader = EmbaasBlobLoader(
|
||||
params={
|
||||
"should_embed": True,
|
||||
"model": "e5-large-v2",
|
||||
"chunk_size": 256,
|
||||
"chunk_splitter": "CharacterTextSplitter"
|
||||
}
|
||||
)
|
||||
blob = Blob.from_path(path="example.pdf")
|
||||
documents = loader.parse(blob=blob)
|
||||
"""
|
||||
|
||||
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
|
||||
yield from self._get_documents(blob=blob)
|
||||
|
||||
@staticmethod
|
||||
def _api_response_to_documents(chunks: List[Dict[str, Any]]) -> List[Document]:
|
||||
"""Convert the API response to a list of documents."""
|
||||
docs = []
|
||||
for chunk in chunks:
|
||||
metadata = chunk["metadata"]
|
||||
if chunk.get("embedding", None) is not None:
|
||||
metadata["embedding"] = chunk["embedding"]
|
||||
doc = Document(page_content=chunk["text"], metadata=metadata)
|
||||
docs.append(doc)
|
||||
|
||||
return docs
|
||||
|
||||
def _generate_payload(self, blob: Blob) -> EmbaasDocumentExtractionPayload:
|
||||
"""Generates payload for the API request."""
|
||||
base64_byte_str = base64.b64encode(blob.as_bytes()).decode()
|
||||
payload: EmbaasDocumentExtractionPayload = EmbaasDocumentExtractionPayload(
|
||||
bytes=base64_byte_str,
|
||||
# Workaround for mypy issue: https://github.com/python/mypy/issues/9408
|
||||
# type: ignore
|
||||
**self.params,
|
||||
)
|
||||
|
||||
if blob.mimetype is not None and payload.get("mime_type", None) is None:
|
||||
payload["mime_type"] = blob.mimetype
|
||||
|
||||
return payload
|
||||
|
||||
def _handle_request(
|
||||
self, payload: EmbaasDocumentExtractionPayload
|
||||
) -> List[Document]:
|
||||
"""Sends a request to the embaas API and handles the response."""
|
||||
headers = {
|
||||
"Authorization": f"Bearer {self.embaas_api_key}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
response = requests.post(self.api_url, headers=headers, json=payload)
|
||||
response.raise_for_status()
|
||||
|
||||
parsed_response = response.json()
|
||||
return EmbaasBlobLoader._api_response_to_documents(
|
||||
chunks=parsed_response["data"]["chunks"]
|
||||
)
|
||||
|
||||
def _get_documents(self, blob: Blob) -> Iterator[Document]:
|
||||
"""Get the documents from the blob."""
|
||||
payload = self._generate_payload(blob=blob)
|
||||
|
||||
try:
|
||||
documents = self._handle_request(payload=payload)
|
||||
except requests.exceptions.RequestException as e:
|
||||
if e.response is None or not e.response.text:
|
||||
raise ValueError(
|
||||
f"Error raised by embaas document text extraction API: {e}"
|
||||
)
|
||||
|
||||
parsed_response = e.response.json()
|
||||
if "message" in parsed_response:
|
||||
raise ValueError(
|
||||
f"Validation Error raised by embaas document text extraction API:"
|
||||
f" {parsed_response['message']}"
|
||||
)
|
||||
raise
|
||||
|
||||
yield from documents
|
||||
|
||||
|
||||
class EmbaasLoader(BaseEmbaasLoader, BaseLoader):
|
||||
"""Wrapper around embaas's document loader service.
|
||||
|
||||
To use, you should have the
|
||||
environment variable ``EMBAAS_API_KEY`` set with your API key, or pass
|
||||
it as a named parameter to the constructor.
|
||||
|
||||
Example:
|
||||
.. code-block:: python
|
||||
|
||||
# Default parsing
|
||||
from langchain.document_loaders.embaas import EmbaasLoader
|
||||
loader = EmbaasLoader(file_path="example.mp3")
|
||||
documents = loader.load()
|
||||
|
||||
# Custom api parameters (create embeddings automatically)
|
||||
from langchain.document_loaders.embaas import EmbaasBlobLoader
|
||||
loader = EmbaasBlobLoader(
|
||||
file_path="example.pdf",
|
||||
params={
|
||||
"should_embed": True,
|
||||
"model": "e5-large-v2",
|
||||
"chunk_size": 256,
|
||||
"chunk_splitter": "CharacterTextSplitter"
|
||||
}
|
||||
)
|
||||
documents = loader.load()
|
||||
"""
|
||||
|
||||
file_path: str
|
||||
"""The path to the file to load."""
|
||||
blob_loader: Optional[EmbaasBlobLoader]
|
||||
"""The blob loader to use. If not provided, a default one will be created."""
|
||||
|
||||
@validator("blob_loader", always=True)
|
||||
def validate_blob_loader(
|
||||
cls, v: EmbaasBlobLoader, values: Dict
|
||||
) -> EmbaasBlobLoader:
|
||||
return v or EmbaasBlobLoader(
|
||||
embaas_api_key=values["embaas_api_key"],
|
||||
api_url=values["api_url"],
|
||||
params=values["params"],
|
||||
)
|
||||
|
||||
def lazy_load(self) -> Iterator[Document]:
|
||||
"""Load the documents from the file path lazily."""
|
||||
blob = Blob.from_path(path=self.file_path)
|
||||
|
||||
assert self.blob_loader is not None
|
||||
# Should never be None, but mypy doesn't know that.
|
||||
yield from self.blob_loader.lazy_parse(blob=blob)
|
||||
|
||||
def load(self) -> List[Document]:
|
||||
return list(self.lazy_load())
|
||||
|
||||
def load_and_split(
|
||||
self, text_splitter: Optional[TextSplitter] = None
|
||||
) -> List[Document]:
|
||||
if self.params.get("should_embed", False):
|
||||
warnings.warn(
|
||||
"Embeddings are not supported with load_and_split."
|
||||
" Use the API splitter to properly generate embeddings."
|
||||
" For more information see embaas.io docs."
|
||||
)
|
||||
return super().load_and_split(text_splitter=text_splitter)
|
||||
@@ -1,63 +0,0 @@
|
||||
from typing import Iterator, List, Optional, Sequence
|
||||
|
||||
from langchain.docstore.document import Document
|
||||
from langchain.document_loaders.base import BaseLoader
|
||||
|
||||
|
||||
class FaunaLoader(BaseLoader):
|
||||
"""
|
||||
Attributes:
|
||||
query (str): The FQL query string to execute.
|
||||
page_content_field (str): The field that contains the content of each page.
|
||||
secret (str): The secret key for authenticating to FaunaDB.
|
||||
metadata_fields (Optional[Sequence[str]]):
|
||||
Optional list of field names to include in metadata.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
query: str,
|
||||
page_content_field: str,
|
||||
secret: str,
|
||||
metadata_fields: Optional[Sequence[str]] = None,
|
||||
):
|
||||
self.query = query
|
||||
self.page_content_field = page_content_field
|
||||
self.secret = secret
|
||||
self.metadata_fields = metadata_fields
|
||||
|
||||
def load(self) -> List[Document]:
|
||||
return list(self.lazy_load())
|
||||
|
||||
def lazy_load(self) -> Iterator[Document]:
|
||||
try:
|
||||
from fauna import Page, fql
|
||||
from fauna.client import Client
|
||||
from fauna.encoding import QuerySuccess
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"Could not import fauna python package. "
|
||||
"Please install it with `pip install fauna`."
|
||||
)
|
||||
# Create Fauna Client
|
||||
client = Client(secret=self.secret)
|
||||
# Run FQL Query
|
||||
response: QuerySuccess = client.query(fql(self.query))
|
||||
page: Page = response.data
|
||||
for result in page:
|
||||
if result is not None:
|
||||
document_dict = dict(result.items())
|
||||
page_content = ""
|
||||
for key, value in document_dict.items():
|
||||
if key == self.page_content_field:
|
||||
page_content = value
|
||||
document: Document = Document(
|
||||
page_content=page_content,
|
||||
metadata={"id": result.id, "ts": result.ts},
|
||||
)
|
||||
yield document
|
||||
if page.after is not None:
|
||||
yield Document(
|
||||
page_content="Next Page Exists",
|
||||
metadata={"after": page.after},
|
||||
)
|
||||
@@ -12,45 +12,10 @@ class OpenAIWhisperParser(BaseBlobParser):
|
||||
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
|
||||
"""Lazily parse the blob."""
|
||||
|
||||
import io
|
||||
|
||||
try:
|
||||
import openai
|
||||
except ImportError:
|
||||
raise ValueError(
|
||||
"openai package not found, please install it with "
|
||||
"`pip install openai`"
|
||||
)
|
||||
try:
|
||||
from pydub import AudioSegment
|
||||
except ImportError:
|
||||
raise ValueError(
|
||||
"pydub package not found, please install it with " "`pip install pydub`"
|
||||
)
|
||||
|
||||
# Audio file from disk
|
||||
audio = AudioSegment.from_file(blob.path)
|
||||
|
||||
# Define the duration of each chunk in minutes
|
||||
# Need to meet 25MB size limit for Whisper API
|
||||
chunk_duration = 20
|
||||
chunk_duration_ms = chunk_duration * 60 * 1000
|
||||
|
||||
# Split the audio into chunk_duration_ms chunks
|
||||
for split_number, i in enumerate(range(0, len(audio), chunk_duration_ms)):
|
||||
# Audio chunk
|
||||
chunk = audio[i : i + chunk_duration_ms]
|
||||
file_obj = io.BytesIO(chunk.export(format="mp3").read())
|
||||
if blob.source is not None:
|
||||
file_obj.name = blob.source + f"_part_{split_number}.mp3"
|
||||
else:
|
||||
file_obj.name = f"part_{split_number}.mp3"
|
||||
|
||||
# Transcribe
|
||||
print(f"Transcribing part {split_number+1}!")
|
||||
transcript = openai.Audio.transcribe("whisper-1", file_obj)
|
||||
import openai
|
||||
|
||||
with blob.as_bytes_io() as f:
|
||||
transcript = openai.Audio.transcribe("whisper-1", f)
|
||||
yield Document(
|
||||
page_content=transcript.text,
|
||||
metadata={"source": blob.source, "chunk": split_number},
|
||||
page_content=transcript.text, metadata={"source": blob.source}
|
||||
)
|
||||
|
||||
@@ -1,126 +0,0 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any, Dict, Iterator, List, Optional, Tuple
|
||||
|
||||
from langchain.docstore.document import Document
|
||||
from langchain.document_loaders.base import BaseLoader
|
||||
|
||||
|
||||
class SnowflakeLoader(BaseLoader):
|
||||
"""Loads a query result from Snowflake into a list of documents.
|
||||
|
||||
Each document represents one row of the result. The `page_content_columns`
|
||||
are written into the `page_content` of the document. The `metadata_columns`
|
||||
are written into the `metadata` of the document. By default, all columns
|
||||
are written into the `page_content` and none into the `metadata`.
|
||||
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
query: str,
|
||||
user: str,
|
||||
password: str,
|
||||
account: str,
|
||||
warehouse: str,
|
||||
role: str,
|
||||
database: str,
|
||||
schema: str,
|
||||
parameters: Optional[Dict[str, Any]] = None,
|
||||
page_content_columns: Optional[List[str]] = None,
|
||||
metadata_columns: Optional[List[str]] = None,
|
||||
):
|
||||
"""Initialize Snowflake document loader.
|
||||
|
||||
Args:
|
||||
query: The query to run in Snowflake.
|
||||
user: Snowflake user.
|
||||
password: Snowflake password.
|
||||
account: Snowflake account.
|
||||
warehouse: Snowflake warehouse.
|
||||
role: Snowflake role.
|
||||
database: Snowflake database
|
||||
schema: Snowflake schema
|
||||
page_content_columns: Optional. Columns written to Document `page_content`.
|
||||
metadata_columns: Optional. Columns written to Document `metadata`.
|
||||
"""
|
||||
self.query = query
|
||||
self.user = user
|
||||
self.password = password
|
||||
self.account = account
|
||||
self.warehouse = warehouse
|
||||
self.role = role
|
||||
self.database = database
|
||||
self.schema = schema
|
||||
self.parameters = parameters
|
||||
self.page_content_columns = (
|
||||
page_content_columns if page_content_columns is not None else ["*"]
|
||||
)
|
||||
self.metadata_columns = metadata_columns if metadata_columns is not None else []
|
||||
|
||||
def _execute_query(self) -> List[Dict[str, Any]]:
|
||||
try:
|
||||
import snowflake.connector
|
||||
except ImportError as ex:
|
||||
raise ValueError(
|
||||
"Could not import snowflake-connector-python package. "
|
||||
"Please install it with `pip install snowflake-connector-python`."
|
||||
) from ex
|
||||
|
||||
conn = snowflake.connector.connect(
|
||||
user=self.user,
|
||||
password=self.password,
|
||||
account=self.account,
|
||||
warehouse=self.warehouse,
|
||||
role=self.role,
|
||||
database=self.database,
|
||||
schema=self.schema,
|
||||
parameters=self.parameters,
|
||||
)
|
||||
try:
|
||||
cur = conn.cursor()
|
||||
cur.execute("USE DATABASE " + self.database)
|
||||
cur.execute("USE SCHEMA " + self.schema)
|
||||
cur.execute(self.query, self.parameters)
|
||||
query_result = cur.fetchall()
|
||||
column_names = [column[0] for column in cur.description]
|
||||
query_result = [dict(zip(column_names, row)) for row in query_result]
|
||||
except Exception as e:
|
||||
print(f"An error occurred: {e}")
|
||||
query_result = []
|
||||
finally:
|
||||
cur.close()
|
||||
return query_result
|
||||
|
||||
def _get_columns(
|
||||
self, query_result: List[Dict[str, Any]]
|
||||
) -> Tuple[List[str], List[str]]:
|
||||
page_content_columns = (
|
||||
self.page_content_columns if self.page_content_columns else []
|
||||
)
|
||||
metadata_columns = self.metadata_columns if self.metadata_columns else []
|
||||
if page_content_columns is None and query_result:
|
||||
page_content_columns = list(query_result[0].keys())
|
||||
if metadata_columns is None:
|
||||
metadata_columns = []
|
||||
return page_content_columns or [], metadata_columns
|
||||
|
||||
def lazy_load(self) -> Iterator[Document]:
|
||||
query_result = self._execute_query()
|
||||
if isinstance(query_result, Exception):
|
||||
print(f"An error occurred during the query: {query_result}")
|
||||
return []
|
||||
page_content_columns, metadata_columns = self._get_columns(query_result)
|
||||
if "*" in page_content_columns:
|
||||
page_content_columns = list(query_result[0].keys())
|
||||
for row in query_result:
|
||||
page_content = "\n".join(
|
||||
f"{k}: {v}" for k, v in row.items() if k in page_content_columns
|
||||
)
|
||||
metadata = {k: v for k, v in row.items() if k in metadata_columns}
|
||||
doc = Document(page_content=page_content, metadata=metadata)
|
||||
yield doc
|
||||
|
||||
def load(self) -> List[Document]:
|
||||
"""Load data into document objects."""
|
||||
return list(self.lazy_load())
|
||||
@@ -1,22 +0,0 @@
|
||||
"""Loader that loads Microsoft Excel files."""
|
||||
from typing import Any, List
|
||||
|
||||
from langchain.document_loaders.unstructured import (
|
||||
UnstructuredFileLoader,
|
||||
validate_unstructured_version,
|
||||
)
|
||||
|
||||
|
||||
class UnstructuredXMLLoader(UnstructuredFileLoader):
|
||||
"""Loader that uses unstructured to load XML files."""
|
||||
|
||||
def __init__(
|
||||
self, file_path: str, mode: str = "single", **unstructured_kwargs: Any
|
||||
):
|
||||
validate_unstructured_version(min_unstructured_version="0.6.7")
|
||||
super().__init__(file_path=file_path, mode=mode, **unstructured_kwargs)
|
||||
|
||||
def _get_elements(self) -> List:
|
||||
from unstructured.partition.xml import partition_xml
|
||||
|
||||
return partition_xml(filename=self.file_path, **self.unstructured_kwargs)
|
||||
@@ -8,10 +8,7 @@ from langchain.embeddings.aleph_alpha import (
|
||||
)
|
||||
from langchain.embeddings.bedrock import BedrockEmbeddings
|
||||
from langchain.embeddings.cohere import CohereEmbeddings
|
||||
from langchain.embeddings.dashscope import DashScopeEmbeddings
|
||||
from langchain.embeddings.deepinfra import DeepInfraEmbeddings
|
||||
from langchain.embeddings.elasticsearch import ElasticsearchEmbeddings
|
||||
from langchain.embeddings.embaas import EmbaasEmbeddings
|
||||
from langchain.embeddings.fake import FakeEmbeddings
|
||||
from langchain.embeddings.google_palm import GooglePalmEmbeddings
|
||||
from langchain.embeddings.huggingface import (
|
||||
@@ -61,9 +58,6 @@ __all__ = [
|
||||
"MiniMaxEmbeddings",
|
||||
"VertexAIEmbeddings",
|
||||
"BedrockEmbeddings",
|
||||
"DeepInfraEmbeddings",
|
||||
"DashScopeEmbeddings",
|
||||
"EmbaasEmbeddings",
|
||||
]
|
||||
|
||||
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user