templates: Add Ollama multi-modal templates (#14868)

Templates for [local multi-modal
LLMs](https://llava-vl.github.io/llava-interactive/) using -
* Image summaries
* Multi-modal embeddings

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
This commit is contained in:
Lance Martin
2023-12-20 15:28:53 -08:00
committed by GitHub
parent 57d1eb733f
commit 320c3ae4c8
31 changed files with 7478 additions and 56 deletions

View File

@@ -0,0 +1,2 @@
docs/img_*.jpg
chroma_db_multi_modal

View File

@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2023 LangChain, Inc.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@@ -0,0 +1,122 @@
# rag-multi-modal-local
Visual search is a famililar application to many with iPhones or Android devices: use natural language to search across your photo collection.
With the release of open source, multi-modal LLMs it's possible to build this kind of application for yourself and have it run on your personal laptop.
This template demonstrates how to perform visual search and question-answering over a collection of photos.
Given a set of photos, it will use OpenCLIP embeddings to index them, retrieve photos relevant to user question, and use Ollama to run a local, open-source multi-modal LLM to answer questions about the retrieved photos.
## Input
Supply a set of photos in the `/docs` directory.
By default, this template has a toy collection of 3 food pictures.
Example questions to ask can be:
```
What kind of soft serve did I have?
```
In practice, a larger corpus of images can be tested.
To create an index of the images, run:
```
poetry install
python ingest.py
```
## Storage
This template will use [OpenCLIP](https://github.com/mlfoundations/open_clip) multi-modal embeddings to embed the images.
You can select different embedding model options (see results [here](https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_results.csv)).
The first time you run the app, it will automatically download the multimodal embedding model.
By default, LangChain will use an embedding model with moderate performance but lower memory requirments, `ViT-H-14`.
You can choose alternative `OpenCLIPEmbeddings` models in `rag_chroma_multi_modal/ingest.py`:
```
vectorstore_mmembd = Chroma(
collection_name="multi-modal-rag",
persist_directory=str(re_vectorstore_path),
embedding_function=OpenCLIPEmbeddings(
model_name="ViT-H-14", checkpoint="laion2b_s32b_b79k"
),
)
```
## LLM
This template will use [Ollama](https://python.langchain.com/docs/integrations/chat/ollama#multi-modal).
Download the latest version of Ollama: https://ollama.ai/
Pull the an open source multi-modal LLM: e.g., https://ollama.ai/library/bakllava
```
ollama pull bakllava
```
The app is by default configured for `bakllava`. But you can change this in `chain.py` and `ingest.py` for different downloaded models.
## Usage
To use this package, you should first have the LangChain CLI installed:
```shell
pip install -U langchain-cli
```
To create a new LangChain project and install this as the only package, you can do:
```shell
langchain app new my-app --package rag-chroma-multi-modal
```
If you want to add this to an existing project, you can just run:
```shell
langchain app add rag-chroma-multi-modal
```
And add the following code to your `server.py` file:
```python
from rag_chroma_multi_modal import chain as rag_chroma_multi_modal_chain
add_routes(app, rag_chroma_multi_modal_chain, path="/rag-chroma-multi-modal")
```
(Optional) Let's now configure LangSmith.
LangSmith will help us trace, monitor and debug LangChain applications.
LangSmith is currently in private beta, you can sign up [here](https://smith.langchain.com/).
If you don't have access, you can skip this section
```shell
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=<your-api-key>
export LANGCHAIN_PROJECT=<your-project> # if not specified, defaults to "default"
```
If you are inside this directory, then you can spin up a LangServe instance directly by:
```shell
langchain serve
```
This will start the FastAPI app with a server is running locally at
[http://localhost:8000](http://localhost:8000)
We can see all templates at [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
We can access the playground at [http://127.0.0.1:8000/rag-chroma-multi-modal/playground](http://127.0.0.1:8000/rag-chroma-multi-modal/playground)
We can access the template from code with:
```python
from langserve.client import RemoteRunnable
runnable = RemoteRunnable("http://localhost:8000/rag-chroma-multi-modal")
```

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 94 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

View File

@@ -0,0 +1,35 @@
import os
from pathlib import Path
from langchain.vectorstores import Chroma
from langchain_experimental.open_clip import OpenCLIPEmbeddings
# Load images
img_dump_path = Path(__file__).parent / "docs/"
rel_img_dump_path = img_dump_path.relative_to(Path.cwd())
image_uris = sorted(
[
os.path.join(rel_img_dump_path, image_name)
for image_name in os.listdir(rel_img_dump_path)
if image_name.endswith(".jpg")
]
)
# Index
vectorstore = Path(__file__).parent / "chroma_db_multi_modal"
re_vectorstore_path = vectorstore.relative_to(Path.cwd())
# Load embedding function
print("Loading embedding function")
embedding = OpenCLIPEmbeddings(model_name="ViT-H-14", checkpoint="laion2b_s32b_b79k")
# Create chroma
vectorstore_mmembd = Chroma(
collection_name="multi-modal-rag",
persist_directory=str(Path(__file__).parent / "chroma_db_multi_modal"),
embedding_function=embedding,
)
# Add images
print("Embedding images")
vectorstore_mmembd.add_images(uris=image_uris)

3490
templates/rag-multi-modal-local/poetry.lock generated Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,38 @@
[tool.poetry]
name = "rag-multi-modal-local"
version = "0.1.0"
description = "Multi-modal RAG using Chroma"
authors = [
"Lance Martin <lance@langchain.dev>",
]
readme = "README.md"
[tool.poetry.dependencies]
python = ">=3.8.1,<4.0"
langchain = ">=0.0.351"
openai = "<2"
tiktoken = ">=0.5.1"
chromadb = ">=0.4.14"
open-clip-torch = ">=2.23.0"
torch = ">=2.1.0"
langchain-experimental = "^0.0.43"
langchain-community = ">=0.0.4"
[tool.poetry.group.dev.dependencies]
langchain-cli = ">=0.0.15"
[tool.langserve]
export_module = "rag_multi_modal_local"
export_attr = "chain"
[tool.templates-hub]
use-case = "rag"
author = "LangChain"
integrations = ["Ollama", "Chroma"]
tags = ["multi-modal"]
[build-system]
requires = [
"poetry-core",
]
build-backend = "poetry.core.masonry.api"

View File

@@ -0,0 +1,52 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "681a5d1e",
"metadata": {},
"source": [
"## Run Template\n",
"\n",
"In `server.py`, set -\n",
"```\n",
"add_routes(app, chain_rag_conv, path=\"/rag-multi-modal-local\")\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d774be2a",
"metadata": {},
"outputs": [],
"source": [
"from langserve.client import RemoteRunnable\n",
"\n",
"rag_app = RemoteRunnable(\"http://localhost:8001/rag-multi-modal-local\")\n",
"rag_app.invoke(\" < keywords here > \")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,3 @@
from rag_multi_modal_local.chain import chain
__all__ = ["chain"]

View File

@@ -0,0 +1,122 @@
import base64
import io
from pathlib import Path
from langchain.chat_models import ChatOllama
from langchain.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_core.messages import HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.pydantic_v1 import BaseModel
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_experimental.open_clip import OpenCLIPEmbeddings
from PIL import Image
def resize_base64_image(base64_string, size=(128, 128)):
"""
Resize an image encoded as a Base64 string.
:param base64_string: A Base64 encoded string of the image to be resized.
:param size: A tuple representing the new size (width, height) for the image.
:return: A Base64 encoded string of the resized image.
"""
img_data = base64.b64decode(base64_string)
img = Image.open(io.BytesIO(img_data))
resized_img = img.resize(size, Image.LANCZOS)
buffered = io.BytesIO()
resized_img.save(buffered, format=img.format)
return base64.b64encode(buffered.getvalue()).decode("utf-8")
def get_resized_images(docs):
"""
Resize images from base64-encoded strings.
:param docs: A list of base64-encoded image to be resized.
:return: Dict containing a list of resized base64-encoded strings.
"""
b64_images = []
for doc in docs:
if isinstance(doc, Document):
doc = doc.page_content
# Optional: re-size image
# resized_image = resize_base64_image(doc, size=(1280, 720))
b64_images.append(doc)
return {"images": b64_images}
def img_prompt_func(data_dict, num_images=1):
"""
GPT-4V prompt for image analysis.
:param data_dict: A dict with images and a user-provided question.
:param num_images: Number of images to include in the prompt.
:return: A list containing message objects for each image and the text prompt.
"""
messages = []
if data_dict["context"]["images"]:
for image in data_dict["context"]["images"][:num_images]:
image_message = {
"type": "image_url",
"image_url": f"data:image/jpeg;base64,{image}",
}
messages.append(image_message)
text_message = {
"type": "text",
"text": (
"You are a helpful assistant that gives a description of food pictures.\n"
"Give a detailed summary of the image.\n"
"Give reccomendations for similar foods to try.\n"
),
}
messages.append(text_message)
return [HumanMessage(content=messages)]
def multi_modal_rag_chain(retriever):
"""
Multi-modal RAG chain,
:param retriever: A function that retrieves the necessary context for the model.
:return: A chain of functions representing the multi-modal RAG process.
"""
# Initialize the multi-modal Large Language Model with specific parameters
model = ChatOllama(model="bakllava", temperature=0)
# Define the RAG pipeline
chain = (
{
"context": retriever | RunnableLambda(get_resized_images),
"question": RunnablePassthrough(),
}
| RunnableLambda(img_prompt_func)
| model
| StrOutputParser()
)
return chain
# Load chroma
vectorstore_mmembd = Chroma(
collection_name="multi-modal-rag",
persist_directory=str(Path(__file__).parent.parent / "chroma_db_multi_modal"),
embedding_function=OpenCLIPEmbeddings(
model_name="ViT-H-14", checkpoint="laion2b_s32b_b79k"
),
)
# Make retriever
retriever_mmembd = vectorstore_mmembd.as_retriever()
# Create RAG chain
chain = multi_modal_rag_chain(retriever_mmembd)
# Add typing for input
class Question(BaseModel):
__root__: str
chain = chain.with_types(input_type=Question)