deps

fmt
Merge branch 'master' into rlm/mm_template
2026-01-23 05:09:12 +00:00 · 2023-11-15 12:42:00 -08:00 · 2023-11-15 09:48:02 -08:00 · 2023-11-15 08:50:21 -08:00 · 2023-11-15 08:43:12 -08:00 · 2023-11-14 22:12:16 -08:00
9 changed files with 5467 additions and 0 deletions
--- a/templates/rag-multi-modal/LICENSE
+++ b/templates/rag-multi-modal/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2023 LangChain, Inc.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/templates/rag-multi-modal/README.md
+++ b/templates/rag-multi-modal/README.md
@@ -0,0 +1,94 @@
+# rag-multi-modal
+
+This template performs RAG on documents with images.
+
+It is configured to ingest a pdf file that contains images:
+
+* It uses uses [Unstructured](https://unstructured-io.github.io/unstructured/) for pdf parsing.
+* It uses [Chroma](https://www.trychroma.com/) for stroage.
+
+The file is supplied in `docs/` and set in `chain.py`:
+
+```
+fpath = "../docs/"
+fname = "cj.pdf"
+```
+
+By defaut it runs on a `.pdf` of [this blog post](https://cloudedjudgement.substack.com/p/clouded-judgement-111023) 
+
+## Environment Setup
+
+Set the `OPENAI_API_KEY` environment variable to access the OpenAI models.
+
+[Unstructured](https://unstructured-io.github.io/unstructured/) requires some system-level package installations:
+
+You will also need these in your system:
+
+* `poppler` ([installation instructions](https://pdf2image.readthedocs.io/en/latest/installation.html)) 
+* `tesseract` ([installation instructions](https://tesseract-ocr.github.io/tessdoc/Installation.html)) 
+
+On Mac, you can install the necessary packages with the following:
+
+```shell
+brew install tesseract poppler
+```
+
+## Usage
+
+To use this package, you should first have the LangChain CLI installed:
+
+```shell
+pip install -U langchain-cli
+```
+
+To create a new LangChain project and install this as the only package, you can do:
+
+```shell
+langchain app new my-app --package rag-multi-modal
+```
+
+If you want to add this to an existing project, you can just run:
+
+```shell
+langchain app add rag-multi-modal
+```
+
+And add the following code to your `server.py` file:
+```python
+from rag_semi_structured import chain as rag_semi_structured_chain
+
+add_routes(app, rag_semi_structured_chain, path="/rag-multi-modal")
+```
+
+(Optional) Let's now configure LangSmith. 
+LangSmith will help us trace, monitor and debug LangChain applications. 
+LangSmith is currently in private beta, you can sign up [here](https://smith.langchain.com/). 
+If you don't have access, you can skip this section
+
+```shell
+export LANGCHAIN_TRACING_V2=true
+export LANGCHAIN_API_KEY=<your-api-key>
+export LANGCHAIN_PROJECT=<your-project>  # if not specified, defaults to "default"
+```
+
+If you are inside this directory, then you can spin up a LangServe instance directly by:
+
+```shell
+langchain serve
+```
+
+This will start the FastAPI app with a server is running locally at 
+[http://localhost:8000](http://localhost:8000)
+
+We can see all templates at [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
+We can access the playground at [http://127.0.0.1:8000/rag-multi-modal/playground](http://127.0.0.1:8000/rag-multi-modal/playground)  
+
+We can access the template from code with:
+
+```python
+from langserve.client import RemoteRunnable
+
+runnable = RemoteRunnable("http://localhost:8000/rag-multi-modal")
+```
+
+For more details on how to connect to the template, refer to the Jupyter notebook `rag-multi-modal`.
--- a/templates/rag-multi-modal/docs/cj.pdf
+++ b/templates/rag-multi-modal/docs/cj.pdf
--- a/templates/rag-multi-modal/poetry.lock
+++ b/templates/rag-multi-modal/poetry.lock
--- a/templates/rag-multi-modal/pyproject.toml
+++ b/templates/rag-multi-modal/pyproject.toml
@@ -0,0 +1,31 @@
+[tool.poetry]
+name = "rag-multi-modal"
+version = "0.1.0"
+description = ""
+authors = [
+    "Lance Martin <lance@langchain.dev>",
+]
+readme = "README.md"
+
+[tool.poetry.dependencies]
+python = ">=3.8.1,<4.0"
+langchain = ">=0.0.334"
+tiktoken = ">=0.5.1"
+chromadb = ">=0.4.14"
+openai = ">=1.1.1"
+unstructured = {extras = ["all-docs"], version = "^0.10.30"}
+pdf2image = ">=1.16.3"
+pillow = ">=10.0.1"
+
+[tool.poetry.group.dev.dependencies]
+langchain-cli = ">=0.0.15"
+
+[tool.langserve]
+export_module = "rag_multi_modal"
+export_attr = "chain"
+
+[build-system]
+requires = [
+    "poetry-core",
+]
+build-backend = "poetry.core.masonry.api"
--- a/templates/rag-multi-modal/rag_multi_modal/init.py
+++ b/templates/rag-multi-modal/rag_multi_modal/init.py
@@ -0,0 +1,3 @@
+from rag_multi_modal.chain import chain
+
+__all__ = ["chain"]
--- a/templates/rag-multi-modal/rag_multi_modal/chain.py
+++ b/templates/rag-multi-modal/rag_multi_modal/chain.py
@@ -0,0 +1,351 @@
+import base64
+import io
+import os
+import re
+import uuid
+from pathlib import Path
+
+from langchain.chat_models import ChatOpenAI
+from langchain.embeddings import OpenAIEmbeddings
+from langchain.prompts import ChatPromptTemplate
+from langchain.pydantic_v1 import BaseModel
+from langchain.retrievers.multi_vector import MultiVectorRetriever
+from langchain.schema.document import Document
+from langchain.schema.messages import HumanMessage
+from langchain.schema.output_parser import StrOutputParser
+from langchain.schema.runnable import RunnableLambda, RunnablePassthrough
+from langchain.storage import InMemoryStore
+from langchain.vectorstores import Chroma
+from PIL import Image
+from unstructured.partition.pdf import partition_pdf
+
+
+# Extract elements from PDF
+def extract_pdf_elements(path, fname):
+    """
+    Extract images, tables, and chunk text from a PDF file.
+    path: File path, which is used to dump image files
+    fname: File name
+    """
+    partition_pdf(
+        filename=path + fname,
+        extract_images_in_pdf=True,
+        infer_table_structure=True,
+        chunking_strategy="by_title",
+        max_characters=4000,
+        new_after_n_chars=3800,
+        combine_text_under_n_chars=2000,
+        image_output_dir_path=path,
+    )
+
+
+# Categorize elements by type
+def categorize_elements(raw_pdf_elements):
+    """
+    Categorize extracted elements from a PDF into tables and texts.
+    raw_pdf_elements: List of unstructured.documents.elements
+    """
+    tables = []
+    texts = []
+    for element in raw_pdf_elements:
+        if "unstructured.documents.elements.Table" in str(type(element)):
+            tables.append(str(element))
+        elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
+            texts.append(str(element))
+    return texts, tables
+
+
+# Generate summaries of text elements
+def generate_text_summaries(texts, tables, summarize_texts=False):
+    """
+    Summarize text elements
+    texts: List of str
+    tables: List of str
+    summarize_texts: Bool to summarize texts
+    """
+
+    # Prompt
+    prompt_text = """You are an assistant tasked with summarizing tables and text for  \
+    retrieval. These summaries will be embedded and used to retrieve the raw text or \
+    table elements. Give a concise summary of the table or text that is well \
+    optimized for retrieval. Table or text: {element} """
+    prompt = ChatPromptTemplate.from_template(prompt_text)
+
+    # Text summary chain
+    model = ChatOpenAI(temperature=0, model="gpt-4")
+    summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()
+
+    # Initialize empty summaries
+    text_summaries = []
+    table_summaries = []
+
+    # Apply to text if texts are provided and summarization is requested
+    if texts and summarize_texts:
+        text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})
+    elif texts:
+        text_summaries = (
+            texts  # Directly assign texts if summarization is not requested
+        )
+
+    # Apply to tables if tables are provided
+    if tables:
+        table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})
+
+    return text_summaries, table_summaries
+
+
+def encode_image(image_path):
+    """Getting the base64 string"""
+    with open(image_path, "rb") as image_file:
+        return base64.b64encode(image_file.read()).decode("utf-8")
+
+
+def image_summarize(img_base64, prompt):
+    """Make image summary"""
+    chat = ChatOpenAI(model="gpt-4-vision-preview", max_tokens=1024)
+
+    msg = chat.invoke(
+        [
+            HumanMessage(
+                content=[
+                    {"type": "text", "text": prompt},
+                    {
+                        "type": "image_url",
+                        "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
+                    },
+                ]
+            )
+        ]
+    )
+    return msg.content
+
+
+def generate_img_summaries(path):
+    """
+    Generate summaries and base64 encoded strings for images
+    path: Path to list of .jpg files extracted by Unstructured
+    """
+
+    # Store base64 encoded images
+    img_base64_list = []
+
+    # Store image summaries
+    image_summaries = []
+
+    # Prompt
+    prompt = """You are an assistant tasked with summarizing images for retrieval. \
+    These summaries will be embedded and used to retrieve the raw image. \
+    Give a concise summary of the image that is well optimized for retrieval."""
+
+    # Apply to images
+    for img_file in sorted(os.listdir(path)):
+        if img_file.endswith(".jpg"):
+            img_path = os.path.join(path, img_file)
+            base64_image = encode_image(img_path)
+            img_base64_list.append(base64_image)
+            image_summaries.append(image_summarize(base64_image, prompt))
+
+    return img_base64_list, image_summaries
+
+
+def create_multi_vector_retriever(
+    vectorstore, text_summaries, texts, table_summaries, tables, image_summaries, images
+):
+    """
+    Create retriever that indexes summaries, but returns raw images or texts
+    """
+
+    # Initialize the storage layer
+    store = InMemoryStore()
+    id_key = "doc_id"
+
+    # Create the multi-vector retriever
+    retriever = MultiVectorRetriever(
+        vectorstore=vectorstore,
+        docstore=store,
+        id_key=id_key,
+    )
+
+    # Helper function to add documents to the vectorstore and docstore
+    def add_documents(retriever, doc_summaries, doc_contents):
+        doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
+        summary_docs = [
+            Document(page_content=s, metadata={id_key: doc_ids[i]})
+            for i, s in enumerate(doc_summaries)
+        ]
+        retriever.vectorstore.add_documents(summary_docs)
+        retriever.docstore.mset(list(zip(doc_ids, doc_contents)))
+
+    # Add texts, tables, and images
+    # Check that text_summaries is not empty before adding
+    if text_summaries:
+        add_documents(retriever, text_summaries, texts)
+    # Check that table_summaries is not empty before adding
+    if table_summaries:
+        add_documents(retriever, table_summaries, tables)
+    # Check that image_summaries is not empty before adding
+    if image_summaries:
+        add_documents(retriever, image_summaries, images)
+
+    return retriever
+
+
+def looks_like_base64(sb):
+    """
+    Check if the string looks like base64.
+    """
+    return re.match("^[A-Za-z0-9+/]+[=]{0,2}$", sb) is not None
+
+
+def is_image_data(b64data):
+    """
+    Check if the base64 data is an image by looking at the start of the data.
+    """
+    image_signatures = {
+        b"\xFF\xD8\xFF": "jpg",
+        b"\x89\x50\x4E\x47\x0D\x0A\x1A\x0A": "png",
+        b"\x47\x49\x46\x38": "gif",
+        b"\x52\x49\x46\x46": "webp",
+    }
+    try:
+        header = base64.b64decode(b64data)[:8]  # Decode and get the first 8 bytes
+        for sig, format in image_signatures.items():
+            if header.startswith(sig):
+                return True
+        return False
+    except Exception:
+        return False
+
+
+def resize_base64_image(base64_string, size=(128, 128)):
+    """
+    Resize an image encoded as a Base64 string.
+    base64_string (str): Base64 string of the original image.
+    size (tuple): Desired size of the image as (width, height).
+    """
+    # Decode the Base64 string
+    img_data = base64.b64decode(base64_string)
+    img = Image.open(io.BytesIO(img_data))
+
+    # Resize the image
+    resized_img = img.resize(size, Image.LANCZOS)
+
+    # Save the resized image to a bytes buffer
+    buffered = io.BytesIO()
+    resized_img.save(buffered, format=img.format)
+
+    # Encode the resized image to Base64
+    return base64.b64encode(buffered.getvalue()).decode("utf-8")
+
+
+def split_image_text_types(docs):
+    """
+    Split base64-encoded images and texts.
+    """
+    b64_images = []
+    texts = []
+    for doc in docs:
+        # Check if the document is of type Document and extract page_content if so
+        if isinstance(doc, Document):
+            doc = doc.page_content
+        if looks_like_base64(doc) and is_image_data(doc):
+            doc = resize_base64_image(doc, size=(250, 250))
+            b64_images.append(doc)
+        else:
+            texts.append(doc)
+    return {"images": b64_images, "texts": texts}
+
+
+def img_prompt_func(data_dict):
+    # Joining the context texts into a single string
+    formatted_texts = "\n".join(data_dict["context"]["texts"])
+    messages = []
+
+    # Adding image(s) to the messages if present
+    if data_dict["context"]["images"]:
+        for image in data_dict["context"]["images"]:
+            image_message = {
+                "type": "image_url",
+                "image_url": {"url": f"data:image/jpeg;base64,{image}"},
+            }
+            messages.append(image_message)
+
+    # Adding the text message for analysis
+    text_message = {
+        "type": "text",
+        "text": (
+            "Answer the question based only on the provided context, "
+            "which can include text, tables, and image(s). If an image is "
+            "provided, analyze it carefully to help answer the question.\n"
+            f"User-provided question / keywords: {data_dict['question']}\n\n"
+            "Text and / or tables:\n"
+            f"{formatted_texts}"
+        ),
+    }
+    messages.append(text_message)
+    return [HumanMessage(content=messages)]
+
+
+def multi_modal_rag_chain(retriever):
+    """
+    Multi-modal RAG chain
+    """
+
+    # Multi-modal LLM
+    model = ChatOpenAI(temperature=0, model="gpt-4-vision-preview", max_tokens=1024)
+
+    # RAG pipeline
+    chain = (
+        {
+            "context": retriever | RunnableLambda(split_image_text_types),
+            "question": RunnablePassthrough(),
+        }
+        | RunnableLambda(img_prompt_func)
+        | model
+        | StrOutputParser()
+    )
+
+    return chain
+
+
+# File path
+fpath = str(Path(__file__).parent.parent / "docs") + "/"
+fname = "cj.pdf"
+
+# Get elements
+raw_pdf_elements = extract_pdf_elements(fpath, fname)
+# Get text, tables
+texts, tables = categorize_elements(raw_pdf_elements)
+
+# Get text, table summaries
+text_summaries, table_summaries = generate_text_summaries(texts, tables)
+
+# Image summaries
+img_base64_list, image_summaries = generate_img_summaries(fpath)
+
+# The vectorstore to use to index the summaries
+vectorstore = Chroma(
+    collection_name="multi_vector_img", embedding_function=OpenAIEmbeddings()
+)
+
+# Create retriever
+retriever_multi_vector_img = create_multi_vector_retriever(
+    vectorstore,
+    text_summaries,
+    texts,
+    table_summaries,
+    tables,
+    image_summaries,
+    img_base64_list,
+)
+
+# Create RAG chain
+chain = multi_modal_rag_chain(retriever_multi_vector_img)
+
+
+# Add typing for input
+class Question(BaseModel):
+    __root__: str
+
+
+chain = chain.with_types(input_type=Question)
--- a/templates/rag-multi-modal/rag_semi_structured.ipynb
+++ b/templates/rag-multi-modal/rag_semi_structured.ipynb
@@ -0,0 +1,51 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "30fc2c27",
+   "metadata": {},
+   "source": [
+    "## Run Template\n",
+    "\n",
+    "In `server.py`, set -\n",
+    "```\n",
+    "add_routes(app, chain_rag_conv, path=\"/rag-semi-structured\")\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "65f5b560",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langserve.client import RemoteRunnable\n",
+    "\n",
+    "rag_app = RemoteRunnable(\"http://localhost:8001/rag-semi-structured\")\n",
+    "rag_app.invoke(\"How does agent memory work?\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/templates/rag-multi-modal/tests/init.py
+++ b/templates/rag-multi-modal/tests/init.py
Author	SHA1	Message	Date
Erick Friis	3ed4b14e37	deps	2023-11-15 12:42:00 -08:00
Lance Martin	1d09c24396	fmt	2023-11-15 09:48:02 -08:00
Lance Martin	24edaf2307	Merge branch 'master' into rlm/mm_template	2023-11-15 08:50:21 -08:00
Lance Martin	0e5f0ea935	Fmt	2023-11-15 08:43:12 -08:00
Lance Martin	d348b84eab	Multi-modal RAG	2023-11-14 22:12:16 -08:00