templates: Add Ollama multi-modal templates (#14868)

Templates for [local multi-modal
LLMs](https://llava-vl.github.io/llava-interactive/) using -
* Image summaries
* Multi-modal embeddings

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
This commit is contained in:
Lance Martin
2023-12-20 15:28:53 -08:00
committed by GitHub
parent 57d1eb733f
commit 320c3ae4c8
31 changed files with 7478 additions and 56 deletions

View File

@@ -1,34 +1,43 @@
# rag-gemini-multi-modal
Presentations (slide decks, etc) contain visual content that challenges conventional RAG.
Multi-modal LLMs enable text-to-image retrieval and question-answering over images.
Multi-modal LLMs unlock new ways to build apps over visual content like presentations.
You can ask questions in natural language about a collection of photos, retrieve relevant ones, and have a multi-modal LLM answer questions about the retrieved images.
This template performs text-to-image retrieval for question-answering about a slide deck, which often contains visual elements that are not captured in standard RAG.
This template performs multi-modal RAG using Chroma with multi-modal OpenCLIP embeddings and [Google Gemini](https://deepmind.google/technologies/gemini/#introduction).
This will use OpenCLIP embeddings and [Google Gemini](https://deepmind.google/technologies/gemini/#introduction) for answer synthesis.
## Input
Supply a slide deck as pdf in the `/docs` directory.
Create your vectorstore with:
By default, this template has a slide deck about Q3 earnings from DataDog, a public techologyy company.
Example questions to ask can be:
```
How many customers does Datadog have?
What is Datadog platform % Y/Y growth in FY20, FY21, and FY22?
```
To create an index of the slide deck, run:
```
poetry install
python ingest.py
```
## Embeddings
## Storage
This template will use [OpenCLIP](https://github.com/mlfoundations/open_clip) multi-modal embeddings.
This template will use [OpenCLIP](https://github.com/mlfoundations/open_clip) multi-modal embeddings to embed the images.
You can select different options (see results [here](https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_results.csv)).
You can select different embedding model options (see results [here](https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_results.csv)).
The first time you run the app, it will automatically download the multimodal embedding model.
By default, LangChain will use an embedding model with reasonably strong performance, `ViT-H-14`.
By default, LangChain will use an embedding model with moderate performance but lower memory requirments, `ViT-H-14`.
You can choose alternative `OpenCLIPEmbeddings` models in `ingest.py`:
You can choose alternative `OpenCLIPEmbeddings` models in `rag_chroma_multi_modal/ingest.py`:
```
vectorstore_mmembd = Chroma(
collection_name="multi-modal-rag",
@@ -45,7 +54,7 @@ The app will retrieve images using multi-modal embeddings, and pass them to Goog
## Environment Setup
Set the `GOOGLE_API_KEY` environment variable to access Gemini.
Set your `GOOGLE_API_KEY` environment variable in order to access Gemini.
## Usage