diff --git a/docs/docs/contributing/how_to/documentation/setup.mdx b/docs/docs/contributing/how_to/documentation/setup.mdx
index edbcb3ca1b0..d7ad896d74a 100644
--- a/docs/docs/contributing/how_to/documentation/setup.mdx
+++ b/docs/docs/contributing/how_to/documentation/setup.mdx
@@ -50,11 +50,6 @@ locally to ensure that it looks good and is free of errors.
If you're unable to build it locally that's okay as well, as you will be able to
see a preview of the documentation on the pull request page.
-From the **monorepo root**, run the following command to install the dependencies:
-
-```bash
-poetry install --with lint,docs --no-root
-````
### Building
@@ -158,14 +153,6 @@ the working directory to the `langchain-community` directory:
cd [root]/libs/langchain-community
```
-Set up a virtual environment for the package if you haven't done so already.
-
-Install the dependencies for the package.
-
-```bash
-poetry install --with lint
-```
-
Then you can run the following commands to lint and format the in-code documentation:
```bash
diff --git a/docs/docs/integrations/document_loaders/pymupdf4llm.ipynb b/docs/docs/integrations/document_loaders/pymupdf4llm.ipynb
new file mode 100644
index 00000000000..674701ee59b
--- /dev/null
+++ b/docs/docs/integrations/document_loaders/pymupdf4llm.ipynb
@@ -0,0 +1,721 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "sidebar_label: PyMuPDF4LLM\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# PyMuPDF4LLMLoader\n",
+ "\n",
+ "This notebook provides a quick overview for getting started with PyMuPDF4LLM [document loader](https://python.langchain.com/docs/concepts/#document-loaders). For detailed documentation of all PyMuPDF4LLMLoader features and configurations head to the [GitHub repository](https://github.com/lakinduboteju/langchain-pymupdf4llm).\n",
+ "\n",
+ "## Overview\n",
+ "\n",
+ "### Integration details\n",
+ "\n",
+ "| Class | Package | Local | Serializable | JS support |\n",
+ "| :--- | :--- | :---: | :---: | :---: |\n",
+ "| [PyMuPDF4LLMLoader](https://github.com/lakinduboteju/langchain-pymupdf4llm) | [langchain_pymupdf4llm](https://pypi.org/project/langchain-pymupdf4llm) | ✅ | ❌ | ❌ |\n",
+ "\n",
+ "### Loader features\n",
+ "\n",
+ "| Source | Document Lazy Loading | Native Async Support | Extract Images | Extract Tables |\n",
+ "| :---: | :---: | :---: | :---: | :---: |\n",
+ "| PyMuPDF4LLMLoader | ✅ | ❌ | ✅ | ✅ |\n",
+ "\n",
+ "## Setup\n",
+ "\n",
+ "To access PyMuPDF4LLM document loader you'll need to install the `langchain-pymupdf4llm` integration package.\n",
+ "\n",
+ "### Credentials\n",
+ "\n",
+ "No credentials are required to use PyMuPDF4LLMLoader."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If you want to get automated best in-class tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
+ "# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Installation\n",
+ "\n",
+ "Install **langchain_community** and **langchain-pymupdf4llm**."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Note: you may need to restart the kernel to use updated packages.\n"
+ ]
+ }
+ ],
+ "source": [
+ "%pip install -qU langchain_community langchain-pymupdf4llm"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Initialization\n",
+ "\n",
+ "Now we can instantiate our model object and load documents:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain_pymupdf4llm import PyMuPDF4LLMLoader\n",
+ "\n",
+ "file_path = \"./example_data/layout-parser-paper.pdf\"\n",
+ "loader = PyMuPDF4LLMLoader(file_path)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Load"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-06-22T01:27:10+00:00', 'source': './example_data/layout-parser-paper.pdf', 'file_path': './example_data/layout-parser-paper.pdf', 'total_pages': 16, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2021-06-22T01:27:10+00:00', 'trapped': '', 'modDate': 'D:20210622012710Z', 'creationDate': 'D:20210622012710Z', 'page': 0}, page_content='```\\nLayoutParser: A Unified Toolkit for Deep\\n\\n## Learning Based Document Image Analysis\\n\\n```\\n\\nZejiang Shen[1] (�), Ruochen Zhang[2], Melissa Dell[3], Benjamin Charles Germain\\nLee[4], Jacob Carlson[3], and Weining Li[5]\\n\\n1 Allen Institute for AI\\n```\\n shannons@allenai.org\\n\\n```\\n2 Brown University\\n```\\n ruochen zhang@brown.edu\\n\\n```\\n3 Harvard University\\n_{melissadell,jacob carlson}@fas.harvard.edu_\\n4 University of Washington\\n```\\n bcgl@cs.washington.edu\\n\\n```\\n5 University of Waterloo\\n```\\n w422li@uwaterloo.ca\\n\\n```\\n\\n**Abstract. Recent advances in document image analysis (DIA) have been**\\nprimarily driven by the application of neural networks. Ideally, research\\noutcomes could be easily deployed in production and extended for further\\ninvestigation. However, various factors like loosely organized codebases\\nand sophisticated model configurations complicate the easy reuse of important innovations by a wide audience. Though there have been on-going\\nefforts to improve reusability and simplify deep learning (DL) model\\ndevelopment in disciplines like natural language processing and computer\\nvision, none of them are optimized for challenges in the domain of DIA.\\nThis represents a major gap in the existing toolkit, as DIA is central to\\nacademic research across a wide range of disciplines in the social sciences\\nand humanities. This paper introduces LayoutParser, an open-source\\nlibrary for streamlining the usage of DL in DIA research and applications. The core LayoutParser library comes with a set of simple and\\nintuitive interfaces for applying and customizing DL models for layout detection, character recognition, and many other document processing tasks.\\nTo promote extensibility, LayoutParser also incorporates a community\\nplatform for sharing both pre-trained models and full document digitization pipelines. We demonstrate that LayoutParser is helpful for both\\nlightweight and large-scale digitization pipelines in real-word use cases.\\n[The library is publicly available at https://layout-parser.github.io.](https://layout-parser.github.io)\\n\\n**Keywords: Document Image Analysis · Deep Learning · Layout Analysis**\\n\\n - Character Recognition · Open Source library · Toolkit.\\n\\n### 1 Introduction\\n\\n\\nDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of\\ndocument image analysis (DIA) tasks including document image classification [11,\\n\\n')"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "docs = loader.load()\n",
+ "docs[0]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'producer': 'pdfTeX-1.40.21',\n",
+ " 'creator': 'LaTeX with hyperref',\n",
+ " 'creationdate': '2021-06-22T01:27:10+00:00',\n",
+ " 'source': './example_data/layout-parser-paper.pdf',\n",
+ " 'file_path': './example_data/layout-parser-paper.pdf',\n",
+ " 'total_pages': 16,\n",
+ " 'format': 'PDF 1.5',\n",
+ " 'title': '',\n",
+ " 'author': '',\n",
+ " 'subject': '',\n",
+ " 'keywords': '',\n",
+ " 'moddate': '2021-06-22T01:27:10+00:00',\n",
+ " 'trapped': '',\n",
+ " 'modDate': 'D:20210622012710Z',\n",
+ " 'creationDate': 'D:20210622012710Z',\n",
+ " 'page': 0}\n"
+ ]
+ }
+ ],
+ "source": [
+ "import pprint\n",
+ "\n",
+ "pprint.pp(docs[0].metadata)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Lazy Load"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "6"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pages = []\n",
+ "for doc in loader.lazy_load():\n",
+ " pages.append(doc)\n",
+ " if len(pages) >= 10:\n",
+ " # do some paged operation, e.g.\n",
+ " # index.upsert(page)\n",
+ "\n",
+ " pages = []\n",
+ "len(pages)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from IPython.display import Markdown, display\n",
+ "\n",
+ "part = pages[0].page_content[778:1189]\n",
+ "print(part)\n",
+ "# Markdown rendering\n",
+ "display(Markdown(part))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'producer': 'pdfTeX-1.40.21',\n",
+ " 'creator': 'LaTeX with hyperref',\n",
+ " 'creationdate': '2021-06-22T01:27:10+00:00',\n",
+ " 'source': './example_data/layout-parser-paper.pdf',\n",
+ " 'file_path': './example_data/layout-parser-paper.pdf',\n",
+ " 'total_pages': 16,\n",
+ " 'format': 'PDF 1.5',\n",
+ " 'title': '',\n",
+ " 'author': '',\n",
+ " 'subject': '',\n",
+ " 'keywords': '',\n",
+ " 'moddate': '2021-06-22T01:27:10+00:00',\n",
+ " 'trapped': '',\n",
+ " 'modDate': 'D:20210622012710Z',\n",
+ " 'creationDate': 'D:20210622012710Z',\n",
+ " 'page': 10}\n"
+ ]
+ }
+ ],
+ "source": [
+ "pprint.pp(pages[0].metadata)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The metadata attribute contains at least the following keys:\n",
+ "- source\n",
+ "- page (if in mode *page*)\n",
+ "- total_page\n",
+ "- creationdate\n",
+ "- creator\n",
+ "- producer\n",
+ "\n",
+ "Additional metadata are specific to each parser.\n",
+ "These pieces of information can be helpful (to categorize your PDFs for example)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Splitting mode & custom pages delimiter"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "When loading the PDF file you can split it in two different ways:\n",
+ "- By page\n",
+ "- As a single text flow\n",
+ "\n",
+ "By default PyMuPDF4LLMLoader will split the PDF by page."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Extract the PDF by page. Each page is extracted as a langchain Document object:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "16\n",
+ "{'producer': 'pdfTeX-1.40.21',\n",
+ " 'creator': 'LaTeX with hyperref',\n",
+ " 'creationdate': '2021-06-22T01:27:10+00:00',\n",
+ " 'source': './example_data/layout-parser-paper.pdf',\n",
+ " 'file_path': './example_data/layout-parser-paper.pdf',\n",
+ " 'total_pages': 16,\n",
+ " 'format': 'PDF 1.5',\n",
+ " 'title': '',\n",
+ " 'author': '',\n",
+ " 'subject': '',\n",
+ " 'keywords': '',\n",
+ " 'moddate': '2021-06-22T01:27:10+00:00',\n",
+ " 'trapped': '',\n",
+ " 'modDate': 'D:20210622012710Z',\n",
+ " 'creationDate': 'D:20210622012710Z',\n",
+ " 'page': 0}\n"
+ ]
+ }
+ ],
+ "source": [
+ "loader = PyMuPDF4LLMLoader(\n",
+ " \"./example_data/layout-parser-paper.pdf\",\n",
+ " mode=\"page\",\n",
+ ")\n",
+ "docs = loader.load()\n",
+ "\n",
+ "print(len(docs))\n",
+ "pprint.pp(docs[0].metadata)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In this mode the pdf is split by pages and the resulting Documents metadata contains the `page` (page number). But in some cases we could want to process the pdf as a single text flow (so we don't cut some paragraphs in half). In this case you can use the *single* mode :"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Extract the whole PDF as a single langchain Document object:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "1\n",
+ "{'producer': 'pdfTeX-1.40.21',\n",
+ " 'creator': 'LaTeX with hyperref',\n",
+ " 'creationdate': '2021-06-22T01:27:10+00:00',\n",
+ " 'source': './example_data/layout-parser-paper.pdf',\n",
+ " 'file_path': './example_data/layout-parser-paper.pdf',\n",
+ " 'total_pages': 16,\n",
+ " 'format': 'PDF 1.5',\n",
+ " 'title': '',\n",
+ " 'author': '',\n",
+ " 'subject': '',\n",
+ " 'keywords': '',\n",
+ " 'moddate': '2021-06-22T01:27:10+00:00',\n",
+ " 'trapped': '',\n",
+ " 'modDate': 'D:20210622012710Z',\n",
+ " 'creationDate': 'D:20210622012710Z'}\n"
+ ]
+ }
+ ],
+ "source": [
+ "loader = PyMuPDF4LLMLoader(\n",
+ " \"./example_data/layout-parser-paper.pdf\",\n",
+ " mode=\"single\",\n",
+ ")\n",
+ "docs = loader.load()\n",
+ "\n",
+ "print(len(docs))\n",
+ "pprint.pp(docs[0].metadata)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Logically, in this mode, the `page` (page_number) metadata disappears. Here's how to clearly identify where pages end in the text flow :"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Add a custom *pages_delimiter* to identify where are ends of pages in *single* mode:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "loader = PyMuPDF4LLMLoader(\n",
+ " \"./example_data/layout-parser-paper.pdf\",\n",
+ " mode=\"single\",\n",
+ " pages_delimiter=\"\\n-------THIS IS A CUSTOM END OF PAGE-------\\n\\n\",\n",
+ ")\n",
+ "docs = loader.load()\n",
+ "\n",
+ "part = docs[0].page_content[10663:11317]\n",
+ "print(part)\n",
+ "display(Markdown(part))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The default `pages_delimiter` is \\n-----\\n\\n.\n",
+ "But this could simply be \\n, or \\f to clearly indicate a page change, or \\ for seamless injection in a Markdown viewer without a visual effect."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Extract images from the PDF"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You can extract images from your PDFs (in text form) with a choice of three different solutions:\n",
+ "- rapidOCR (lightweight Optical Character Recognition tool)\n",
+ "- Tesseract (OCR tool with high precision)\n",
+ "- Multimodal language model\n",
+ "\n",
+ "The result is inserted at the end of text of the page."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Extract images from the PDF with rapidOCR:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Note: you may need to restart the kernel to use updated packages.\n"
+ ]
+ }
+ ],
+ "source": [
+ "%pip install -qU rapidocr-onnxruntime pillow"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain_community.document_loaders.parsers import RapidOCRBlobParser\n",
+ "\n",
+ "loader = PyMuPDF4LLMLoader(\n",
+ " \"./example_data/layout-parser-paper.pdf\",\n",
+ " mode=\"page\",\n",
+ " extract_images=True,\n",
+ " images_parser=RapidOCRBlobParser(),\n",
+ ")\n",
+ "docs = loader.load()\n",
+ "\n",
+ "part = docs[5].page_content[1863:]\n",
+ "print(part)\n",
+ "display(Markdown(part))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Be careful, RapidOCR is designed to work with Chinese and English, not other languages."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Extract images from the PDF with Tesseract:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Note: you may need to restart the kernel to use updated packages.\n"
+ ]
+ }
+ ],
+ "source": [
+ "%pip install -qU pytesseract"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain_community.document_loaders.parsers import TesseractBlobParser\n",
+ "\n",
+ "loader = PyMuPDF4LLMLoader(\n",
+ " \"./example_data/layout-parser-paper.pdf\",\n",
+ " mode=\"page\",\n",
+ " extract_images=True,\n",
+ " images_parser=TesseractBlobParser(),\n",
+ ")\n",
+ "docs = loader.load()\n",
+ "\n",
+ "print(docs[5].page_content[1863:])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Extract images from the PDF with multimodal model:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Note: you may need to restart the kernel to use updated packages.\n"
+ ]
+ }
+ ],
+ "source": [
+ "%pip install -qU langchain_openai"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 39,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import os\n",
+ "\n",
+ "from dotenv import load_dotenv\n",
+ "\n",
+ "load_dotenv()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from getpass import getpass\n",
+ "\n",
+ "if not os.environ.get(\"OPENAI_API_KEY\"):\n",
+ " os.environ[\"OPENAI_API_KEY\"] = getpass(\"OpenAI API key =\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain_community.document_loaders.parsers import LLMImageBlobParser\n",
+ "from langchain_openai import ChatOpenAI\n",
+ "\n",
+ "loader = PyMuPDF4LLMLoader(\n",
+ " \"./example_data/layout-parser-paper.pdf\",\n",
+ " mode=\"page\",\n",
+ " extract_images=True,\n",
+ " images_parser=LLMImageBlobParser(\n",
+ " model=ChatOpenAI(model=\"gpt-4o-mini\", max_tokens=1024)\n",
+ " ),\n",
+ ")\n",
+ "docs = loader.load()\n",
+ "\n",
+ "print(docs[5].page_content[1863:])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Extract tables from the PDF\n",
+ "\n",
+ "With PyMUPDF4LLM you can extract tables from your PDFs in *markdown* format :"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "loader = PyMuPDF4LLMLoader(\n",
+ " \"./example_data/layout-parser-paper.pdf\",\n",
+ " mode=\"page\",\n",
+ " # \"lines_strict\" is the default strategy and\n",
+ " # is the most accurate for tables with column and row lines,\n",
+ " # but may not work well with all documents.\n",
+ " # \"lines\" is a less strict strategy that may work better with\n",
+ " # some documents.\n",
+ " # \"text\" is the least strict strategy and may work better\n",
+ " # with documents that do not have tables with lines.\n",
+ " table_strategy=\"lines\",\n",
+ ")\n",
+ "docs = loader.load()\n",
+ "\n",
+ "part = docs[4].page_content[3210:]\n",
+ "print(part)\n",
+ "display(Markdown(part))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Working with Files\n",
+ "\n",
+ "Many document loaders involve parsing files. The difference between such loaders usually stems from how the file is parsed, rather than how the file is loaded. For example, you can use `open` to read the binary content of either a PDF or a markdown file, but you need different parsing logic to convert that binary data into text.\n",
+ "\n",
+ "As a result, it can be helpful to decouple the parsing logic from the loading logic, which makes it easier to re-use a given parser regardless of how the data was loaded.\n",
+ "You can use this strategy to analyze different files, with the same parsing parameters."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain_community.document_loaders import FileSystemBlobLoader\n",
+ "from langchain_community.document_loaders.generic import GenericLoader\n",
+ "from langchain_pymupdf4llm import PyMuPDF4LLMParser\n",
+ "\n",
+ "loader = GenericLoader(\n",
+ " blob_loader=FileSystemBlobLoader(\n",
+ " path=\"./example_data/\",\n",
+ " glob=\"*.pdf\",\n",
+ " ),\n",
+ " blob_parser=PyMuPDF4LLMParser(),\n",
+ ")\n",
+ "docs = loader.load()\n",
+ "\n",
+ "part = docs[0].page_content[:562]\n",
+ "print(part)\n",
+ "display(Markdown(part))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## API reference\n",
+ "\n",
+ "For detailed documentation of all PyMuPDF4LLMLoader features and configurations head to the GitHub repository: https://github.com/lakinduboteju/langchain-pymupdf4llm"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.21"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/docs/docs/integrations/providers/kinetica.mdx b/docs/docs/integrations/providers/kinetica.mdx
index 23da3da8bc2..df14e1ae8bc 100644
--- a/docs/docs/integrations/providers/kinetica.mdx
+++ b/docs/docs/integrations/providers/kinetica.mdx
@@ -20,7 +20,7 @@ from langchain_community.chat_models.kinetica import ChatKinetica
The Kinetca vectorstore wrapper leverages Kinetica's native support for [vector
similarity search](https://docs.kinetica.com/7.2/vector_search/).
-See [Kinetica Vectorsore API](/docs/integrations/vectorstores/kinetica) for usage.
+See [Kinetica Vectorstore API](/docs/integrations/vectorstores/kinetica) for usage.
```python
from langchain_community.vectorstores import Kinetica
@@ -28,8 +28,8 @@ from langchain_community.vectorstores import Kinetica
## Document Loader
-The Kinetica Document loader can be used to load LangChain Documents from the
-Kinetica database.
+The Kinetica Document loader can be used to load LangChain [Documents](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html) from the
+[Kinetica](https://www.kinetica.com/) database.
See [Kinetica Document Loader](/docs/integrations/document_loaders/kinetica) for usage
diff --git a/docs/docs/integrations/providers/pymupdf4llm.ipynb b/docs/docs/integrations/providers/pymupdf4llm.ipynb
new file mode 100644
index 00000000000..17ddcc85dc2
--- /dev/null
+++ b/docs/docs/integrations/providers/pymupdf4llm.ipynb
@@ -0,0 +1,59 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# PyMuPDF4LLM\n",
+ "\n",
+ "[PyMuPDF4LLM](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm) is aimed to make it easier to extract PDF content in Markdown format, needed for LLM & RAG applications.\n",
+ "\n",
+ "[langchain-pymupdf4llm](https://github.com/lakinduboteju/langchain-pymupdf4llm) integrates PyMuPDF4LLM to LangChain as a Document Loader."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%pip install -qU langchain-pymupdf4llm"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "y8ku6X96sebl"
+ },
+ "outputs": [],
+ "source": [
+ "from langchain_pymupdf4llm import PyMuPDF4LLMLoader, PyMuPDF4LLMParser"
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.11"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/docs/docs/integrations/vectorstores/milvus.ipynb b/docs/docs/integrations/vectorstores/milvus.ipynb
index 77aad50e555..e147541a04a 100644
--- a/docs/docs/integrations/vectorstores/milvus.ipynb
+++ b/docs/docs/integrations/vectorstores/milvus.ipynb
@@ -1,613 +1,734 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "id": "683953b3",
- "metadata": {
- "id": "683953b3"
- },
- "source": [
- "# Milvus\n",
- "\n",
- ">[Milvus](https://milvus.io/docs/overview.md) is a database that stores, indexes, and manages massive embedding vectors generated by deep neural networks and other machine learning (ML) models.\n",
- "\n",
- "This notebook shows how to use functionality related to the Milvus vector database.\n",
- "\n",
- "## Setup\n",
- "\n",
- "You'll need to install `langchain-milvus` with `pip install -qU langchain-milvus` to use this integration.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "a62cff8a-bcf7-4e33-bbbc-76999c2e3e20",
- "metadata": {
- "id": "a62cff8a-bcf7-4e33-bbbc-76999c2e3e20",
- "tags": []
- },
- "outputs": [],
- "source": [
- "%pip install -qU langchain_milvus"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "633addc3",
- "metadata": {
- "id": "633addc3"
- },
- "source": [
- "The latest version of `pymilvus` comes with a local vector database called Milvus Lite, which is good for prototyping. If you have a large amount of data (e.g., more than a million vectors), we recommend setting up a more performant Milvus server on [Docker](https://milvus.io/docs/install_standalone-docker.md#Start-Milvus) or [Kubernetes](https://milvus.io/docs/install_cluster-milvusoperator.md).\n",
- "\n",
- "### Credentials\n",
- "\n",
- "No credentials are needed to use the `Milvus` vector store.\n",
- "\n",
- "## Initialization\n",
- "\n",
- "import EmbeddingTabs from \"@theme/EmbeddingTabs\";\n",
- "\n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "a7dd253f",
- "metadata": {
- "id": "a7dd253f"
- },
- "outputs": [],
- "source": [
- "# | output: false\n",
- "# | echo: false\n",
- "from langchain_openai import OpenAIEmbeddings\n",
- "\n",
- "embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "dcf88bdf",
- "metadata": {
- "id": "dcf88bdf",
- "tags": []
- },
- "outputs": [],
- "source": [
- "from langchain_milvus import Milvus\n",
- "\n",
- "# The easiest way is to use Milvus Lite where everything is stored in a local file.\n",
- "# If you have a Milvus server you can use the server URI such as \"http://localhost:19530\".\n",
- "URI = \"./milvus_example.db\"\n",
- "\n",
- "vector_store = Milvus(\n",
- " embedding_function=embeddings,\n",
- " connection_args={\"uri\": URI},\n",
- " # Set index_params if needed\n",
- " index_params={\"index_type\": \"FLAT\", \"metric_type\": \"L2\"},\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "cae1a7d5",
- "metadata": {
- "id": "cae1a7d5"
- },
- "source": [
- "### Compartmentalize the data with Milvus Collections\n",
- "\n",
- "You can store unrelated documents in different collections within the same Milvus instance."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "c07cd24b",
- "metadata": {
- "id": "c07cd24b"
- },
- "source": [
- "Here's how you can create a new collection:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "c6f4973d",
- "metadata": {
- "id": "c6f4973d"
- },
- "outputs": [],
- "source": [
- "from langchain_core.documents import Document\n",
- "\n",
- "vector_store_saved = Milvus.from_documents(\n",
- " [Document(page_content=\"foo!\")],\n",
- " embeddings,\n",
- " collection_name=\"langchain_example\",\n",
- " connection_args={\"uri\": URI},\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "3b12df8c",
- "metadata": {
- "id": "3b12df8c"
- },
- "source": [
- "And here is how you retrieve that stored collection:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "12817d16",
- "metadata": {
- "id": "12817d16"
- },
- "outputs": [],
- "source": [
- "vector_store_loaded = Milvus(\n",
- " embeddings,\n",
- " connection_args={\"uri\": URI},\n",
- " collection_name=\"langchain_example\",\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "f1fc3818",
- "metadata": {
- "id": "f1fc3818"
- },
- "source": [
- "## Manage vector store\n",
- "\n",
- "Once you have created your vector store, we can interact with it by adding and deleting different items.\n",
- "\n",
- "### Add items to vector store\n",
- "\n",
- "We can add items to our vector store by using the `add_documents` function."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "3ced24f6",
- "metadata": {
- "id": "3ced24f6",
- "outputId": "9c57a6bb-86eb-456c-f007-6cabd6865299"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "['b0248595-2a41-4f6b-9c25-3a24c1278bb3',\n",
- " 'fa642726-5329-4495-a072-187e948dd71f',\n",
- " '9905001c-a4a3-455e-ab94-72d0ed11b476',\n",
- " 'eacc7256-d7fa-4036-b1f7-83d7a4bee0c5',\n",
- " '7508f7ff-c0c9-49ea-8189-634f8a0244d8',\n",
- " '2e179609-3ff7-4c6a-9e05-08978903fe26',\n",
- " 'fab1f2ac-43e1-45f9-b81b-fc5d334c6508',\n",
- " '1206d237-ee3a-484f-baf2-b5ac38eeb314',\n",
- " 'd43cbf9a-a772-4c40-993b-9439065fec01',\n",
- " '25e667bb-6f09-4574-a368-661069301906']"
- ]
- },
- "execution_count": 31,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "from uuid import uuid4\n",
- "\n",
- "from langchain_core.documents import Document\n",
- "\n",
- "document_1 = Document(\n",
- " page_content=\"I had chocalate chip pancakes and scrambled eggs for breakfast this morning.\",\n",
- " metadata={\"source\": \"tweet\"},\n",
- ")\n",
- "\n",
- "document_2 = Document(\n",
- " page_content=\"The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.\",\n",
- " metadata={\"source\": \"news\"},\n",
- ")\n",
- "\n",
- "document_3 = Document(\n",
- " page_content=\"Building an exciting new project with LangChain - come check it out!\",\n",
- " metadata={\"source\": \"tweet\"},\n",
- ")\n",
- "\n",
- "document_4 = Document(\n",
- " page_content=\"Robbers broke into the city bank and stole $1 million in cash.\",\n",
- " metadata={\"source\": \"news\"},\n",
- ")\n",
- "\n",
- "document_5 = Document(\n",
- " page_content=\"Wow! That was an amazing movie. I can't wait to see it again.\",\n",
- " metadata={\"source\": \"tweet\"},\n",
- ")\n",
- "\n",
- "document_6 = Document(\n",
- " page_content=\"Is the new iPhone worth the price? Read this review to find out.\",\n",
- " metadata={\"source\": \"website\"},\n",
- ")\n",
- "\n",
- "document_7 = Document(\n",
- " page_content=\"The top 10 soccer players in the world right now.\",\n",
- " metadata={\"source\": \"website\"},\n",
- ")\n",
- "\n",
- "document_8 = Document(\n",
- " page_content=\"LangGraph is the best framework for building stateful, agentic applications!\",\n",
- " metadata={\"source\": \"tweet\"},\n",
- ")\n",
- "\n",
- "document_9 = Document(\n",
- " page_content=\"The stock market is down 500 points today due to fears of a recession.\",\n",
- " metadata={\"source\": \"news\"},\n",
- ")\n",
- "\n",
- "document_10 = Document(\n",
- " page_content=\"I have a bad feeling I am going to get deleted :(\",\n",
- " metadata={\"source\": \"tweet\"},\n",
- ")\n",
- "\n",
- "documents = [\n",
- " document_1,\n",
- " document_2,\n",
- " document_3,\n",
- " document_4,\n",
- " document_5,\n",
- " document_6,\n",
- " document_7,\n",
- " document_8,\n",
- " document_9,\n",
- " document_10,\n",
- "]\n",
- "uuids = [str(uuid4()) for _ in range(len(documents))]\n",
- "\n",
- "vector_store.add_documents(documents=documents, ids=uuids)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "e23c22d8",
- "metadata": {
- "id": "e23c22d8"
- },
- "source": [
- "### Delete items from vector store"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "1f387fa8",
- "metadata": {
- "id": "1f387fa8",
- "outputId": "62fee30d-92c9-4efd-df8a-453545ff61d0"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "(insert count: 0, delete count: 1, upsert count: 0, timestamp: 0, success count: 0, err count: 0, cost: 0)"
- ]
- },
- "execution_count": 32,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "vector_store.delete(ids=[uuids[-1]])"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "fb12fa75",
- "metadata": {
- "id": "fb12fa75"
- },
- "source": [
- "## Query vector store\n",
- "\n",
- "Once your vector store has been created and the relevant documents have been added, you will most likely wish to query it during the running of your chain or agent.\n",
- "\n",
- "### Query directly\n",
- "\n",
- "#### Similarity search\n",
- "\n",
- "Performing a simple similarity search with filtering on metadata can be done as follows:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "35801a55",
- "metadata": {
- "id": "35801a55",
- "outputId": "13865abb-11a2-41ae-9ad7-44e8586fd099"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "* Building an exciting new project with LangChain - come check it out! [{'pk': '9905001c-a4a3-455e-ab94-72d0ed11b476', 'source': 'tweet'}]\n",
- "* LangGraph is the best framework for building stateful, agentic applications! [{'pk': '1206d237-ee3a-484f-baf2-b5ac38eeb314', 'source': 'tweet'}]\n"
- ]
- }
- ],
- "source": [
- "results = vector_store.similarity_search(\n",
- " \"LangChain provides abstractions to make working with LLMs easy\",\n",
- " k=2,\n",
- " expr='source == \"tweet\"',\n",
- ")\n",
- "for res in results:\n",
- " print(f\"* {res.page_content} [{res.metadata}]\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "35574409",
- "metadata": {
- "id": "35574409"
- },
- "source": [
- "#### Similarity search with score\n",
- "\n",
- "You can also search with score:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "c360af3d",
- "metadata": {
- "id": "c360af3d",
- "outputId": "16cb1961-9f4a-494a-9500-27b98a1158d8"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "* [SIM=21192.628906] bar [{'pk': '2', 'source': 'https://example.com'}]\n"
- ]
- }
- ],
- "source": [
- "results = vector_store.similarity_search_with_score(\n",
- " \"Will it be hot tomorrow?\", k=1, expr='source == \"news\"'\n",
- ")\n",
- "for res, score in results:\n",
- " print(f\"* [SIM={score:3f}] {res.page_content} [{res.metadata}]\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "14db337f",
- "metadata": {
- "id": "14db337f"
- },
- "source": [
- "For a full list of all the search options available when using the `Milvus` vector store, you can visit the [API reference](https://python.langchain.com/api_reference/milvus/vectorstores/langchain_milvus.vectorstores.milvus.Milvus.html).\n",
- "\n",
- "### Query by turning into retriever\n",
- "\n",
- "You can also transform the vector store into a retriever for easier usage in your chains."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "f6d9357c",
- "metadata": {
- "id": "f6d9357c",
- "outputId": "bcaa7620-a1c0-418f-9f54-684a472b0b55"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[Document(metadata={'pk': 'eacc7256-d7fa-4036-b1f7-83d7a4bee0c5', 'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]"
- ]
- },
- "execution_count": 34,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "retriever = vector_store.as_retriever(search_type=\"mmr\", search_kwargs={\"k\": 1})\n",
- "retriever.invoke(\"Stealing from the bank is a crime\", filter={\"source\": \"news\"})"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "8ac953f1",
- "metadata": {
- "id": "8ac953f1"
- },
- "source": [
- "## Usage for retrieval-augmented generation\n",
- "\n",
- "For guides on how to use this vector store for retrieval-augmented generation (RAG), see the following sections:\n",
- "\n",
- "- [Tutorials](/docs/tutorials/)\n",
- "- [How-to: Question and answer with RAG](https://python.langchain.com/docs/how_to/#qa-with-rag)\n",
- "- [Retrieval conceptual docs](https://python.langchain.com/docs/concepts/retrieval)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "7fb27b941602401d91542211134fc71a",
- "metadata": {
- "id": "7fb27b941602401d91542211134fc71a",
- "pycharm": {
- "name": "#%% md\n"
- }
- },
- "source": [
- "### Per-User Retrieval\n",
- "\n",
- "When building a retrieval app, you often have to build it with multiple users in mind. This means that you may be storing data not just for one user, but for many different users, and they should not be able to see each other’s data.\n",
- "\n",
- "Milvus recommends using [partition_key](https://milvus.io/docs/multi_tenancy.md#Partition-key-based-multi-tenancy) to implement multi-tenancy. Here is an example:\n",
- "> The feature of Partition key is now not available in Milvus Lite, if you want to use it, you need to start Milvus server, as mentioned above."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "acae54e37e7d407bbb7b55eff062a284",
- "metadata": {
- "id": "acae54e37e7d407bbb7b55eff062a284",
- "pycharm": {
- "name": "#%%\n"
- }
- },
- "outputs": [],
- "source": [
- "from langchain_core.documents import Document\n",
- "\n",
- "docs = [\n",
- " Document(page_content=\"i worked at kensho\", metadata={\"namespace\": \"harrison\"}),\n",
- " Document(page_content=\"i worked at facebook\", metadata={\"namespace\": \"ankush\"}),\n",
- "]\n",
- "vectorstore = Milvus.from_documents(\n",
- " docs,\n",
- " embeddings,\n",
- " connection_args={\"uri\": URI},\n",
- " drop_old=True,\n",
- " partition_key_field=\"namespace\", # Use the \"namespace\" field as the partition key\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9a63283cbaf04dbcab1f6479b197f3a8",
- "metadata": {
- "id": "9a63283cbaf04dbcab1f6479b197f3a8",
- "pycharm": {
- "name": "#%% md\n"
- }
- },
- "source": [
- "To conduct a search using the partition key, you should include either of the following in the boolean expression of the search request:\n",
- "\n",
- "`search_kwargs={\"expr\": ' == \"xxxx\"'}`\n",
- "\n",
- "`search_kwargs={\"expr\": ' == in [\"xxx\", \"xxx\"]'}`\n",
- "\n",
- "Do replace `` with the name of the field that is designated as the partition key.\n",
- "\n",
- "Milvus changes to a partition based on the specified partition key, filters entities according to the partition key, and searches among the filtered entities.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "8dd0d8092fe74a7c96281538738b07e2",
- "metadata": {
- "id": "8dd0d8092fe74a7c96281538738b07e2",
- "outputId": "e38ff0ea-1425-4f12-cfb5-7767d040397b",
- "pycharm": {
- "name": "#%%\n"
- }
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[Document(page_content='i worked at facebook', metadata={'namespace': 'ankush'})]"
- ]
- },
- "execution_count": 3,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# This will only get documents for Ankush\n",
- "vectorstore.as_retriever(search_kwargs={\"expr\": 'namespace == \"ankush\"'}).invoke(\n",
- " \"where did i work?\"\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "72eea5119410473aa328ad9291626812",
- "metadata": {
- "id": "72eea5119410473aa328ad9291626812",
- "outputId": "9d3ad63e-fcb9-4f9a-bdf1-1bc263ce832b",
- "pycharm": {
- "name": "#%%\n"
- }
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[Document(page_content='i worked at kensho', metadata={'namespace': 'harrison'})]"
- ]
- },
- "execution_count": 4,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# This will only get documents for Harrison\n",
- "vectorstore.as_retriever(search_kwargs={\"expr\": 'namespace == \"harrison\"'}).invoke(\n",
- " \"where did i work?\"\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "f1a873c5",
- "metadata": {
- "id": "f1a873c5"
- },
- "source": [
- "## API reference\n",
- "\n",
- "For detailed documentation of all __ModuleName__VectorStore features and configurations head to the API reference: https://python.langchain.com/api_reference/milvus/vectorstores/langchain_milvus.vectorstores.milvus.Milvus.html"
- ]
- }
- ],
- "metadata": {
- "colab": {
- "provenance": []
- },
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.11.9"
- }
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "683953b3",
+ "metadata": {
+ "id": "683953b3"
+ },
+ "source": [
+ "# Milvus\n",
+ "\n",
+ ">[Milvus](https://milvus.io/docs/overview.md) is a database that stores, indexes, and manages massive embedding vectors generated by deep neural networks and other machine learning (ML) models.\n",
+ "\n",
+ "This notebook shows how to use functionality related to the Milvus vector database.\n",
+ "\n",
+ "## Setup\n",
+ "\n",
+ "You'll need to install `langchain-milvus` with `pip install -qU langchain-milvus` to use this integration.\n"
+ ]
},
- "nbformat": 4,
- "nbformat_minor": 5
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "a62cff8a-bcf7-4e33-bbbc-76999c2e3e20",
+ "metadata": {
+ "id": "a62cff8a-bcf7-4e33-bbbc-76999c2e3e20",
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Note: you may need to restart the kernel to use updated packages.\n"
+ ]
+ }
+ ],
+ "source": [
+ "%pip install -qU langchain_milvus"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "dfd17253",
+ "metadata": {},
+ "source": [
+ "### Credentials\n",
+ "\n",
+ "No credentials are needed to use the `Milvus` vector store."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "633addc3",
+ "metadata": {
+ "id": "633addc3"
+ },
+ "source": [
+ "## Initialization\n",
+ "\n",
+ "import EmbeddingTabs from \"@theme/EmbeddingTabs\";\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "a7dd253f",
+ "metadata": {
+ "id": "a7dd253f"
+ },
+ "outputs": [],
+ "source": [
+ "# | output: false\n",
+ "# | echo: false\n",
+ "from langchain_openai import OpenAIEmbeddings\n",
+ "\n",
+ "embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "50e55d42",
+ "metadata": {},
+ "source": [
+ "### Milvus Lite\n",
+ "\n",
+ "The easiest way to prototype is to use Milvus Lite, where everything is stored in a local vector database file. Only the Flat index can be used."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "dcf88bdf",
+ "metadata": {
+ "id": "dcf88bdf",
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "from langchain_milvus import Milvus\n",
+ "\n",
+ "URI = \"./milvus_example.db\"\n",
+ "\n",
+ "vector_store = Milvus(\n",
+ " embedding_function=embeddings,\n",
+ " connection_args={\"uri\": URI},\n",
+ " index_params={\"index_type\": \"FLAT\", \"metric_type\": \"L2\"},\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "df34e8f4",
+ "metadata": {},
+ "source": [
+ "### Milvus Standalone\n",
+ "\n",
+ "If you have a large amount of data (e.g., more than a million vectors), we recommend setting up a more performant Milvus server on [Docker](https://milvus.io/docs/install_standalone-docker.md#Start-Milvus) or [Kubernetes](https://milvus.io/docs/install_cluster-milvusoperator.md).\n",
+ "\n",
+ "Milvus Standalone also supports different [indexes](https://milvus.io/docs/index.md?tab=floating), if you want to improve retrieval functionality.\n",
+ "\n",
+ "To launch the Docker container, run:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0fddc9a6",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Password:"
+ ]
+ }
+ ],
+ "source": [
+ "!curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh -o standalone_embed.sh\n",
+ "\n",
+ "!bash standalone_embed.sh start"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9045ef4a",
+ "metadata": {},
+ "source": [
+ "Here we create a Milvus database:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "fcff1834",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Database 'milvus_demo' does not exist.\n",
+ "Database 'milvus_demo' created successfully.\n"
+ ]
+ }
+ ],
+ "source": [
+ "from pymilvus import Collection, MilvusException, connections, db, utility\n",
+ "\n",
+ "conn = connections.connect(host=\"127.0.0.1\", port=19530)\n",
+ "\n",
+ "# Check if the database exists\n",
+ "db_name = \"milvus_demo\"\n",
+ "try:\n",
+ " existing_databases = db.list_database()\n",
+ " if db_name in existing_databases:\n",
+ " print(f\"Database '{db_name}' already exists.\")\n",
+ "\n",
+ " # Use the database context\n",
+ " db.using_database(db_name)\n",
+ "\n",
+ " # Drop all collections in the database\n",
+ " collections = utility.list_collections()\n",
+ " for collection_name in collections:\n",
+ " collection = Collection(name=collection_name)\n",
+ " collection.drop()\n",
+ " print(f\"Collection '{collection_name}' has been dropped.\")\n",
+ "\n",
+ " db.drop_database(db_name)\n",
+ " print(f\"Database '{db_name}' has been deleted.\")\n",
+ " else:\n",
+ " print(f\"Database '{db_name}' does not exist.\")\n",
+ " database = db.create_database(db_name)\n",
+ " print(f\"Database '{db_name}' created successfully.\")\n",
+ "except MilvusException as e:\n",
+ " print(f\"An error occurred: {e}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b50e3ff7",
+ "metadata": {},
+ "source": [
+ "Note the change in the URI below. Once the instance is initialized, navigate to http://127.0.0.1:9091/webui to view the local web UI.\n",
+ "\n",
+ "Here is an example of how you would use a dense embedding + the Milvus BM25 built-in function to assemble a hybrid retrieval vector store instance:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "07460732",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain_milvus import BM25BuiltInFunction, Milvus\n",
+ "\n",
+ "dense_index_param = {\n",
+ " \"metric_type\": \"COSINE\",\n",
+ " \"index_type\": \"HNSW\",\n",
+ "}\n",
+ "sparse_index_param = {\n",
+ " \"metric_type\": \"BM25\",\n",
+ " \"index_type\": \"AUTOINDEX\",\n",
+ "}\n",
+ "\n",
+ "URI = \"http://localhost:19530\"\n",
+ "\n",
+ "vectorstore = Milvus(\n",
+ " embedding_function=embeddings,\n",
+ " builtin_function=BM25BuiltInFunction(output_field_names=\"sparse\"),\n",
+ " index_params=[dense_index_param, sparse_index_param],\n",
+ " vector_field=[\"dense\", \"sparse\"],\n",
+ " connection_args={\"uri\": URI, \"token\": \"root:Milvus\", \"db_name\": \"milvus_demo\"},\n",
+ " consistency_level=\"Strong\",\n",
+ " drop_old=False, # set to True if seeking to drop the collection with that name if it exists\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cae1a7d5",
+ "metadata": {
+ "id": "cae1a7d5"
+ },
+ "source": [
+ "### Compartmentalize the data with Milvus Collections\n",
+ "\n",
+ "You can store unrelated documents in different collections within the same Milvus instance.\n",
+ "\n",
+ "Here's how you can create a new collection:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c6f4973d",
+ "metadata": {
+ "id": "c6f4973d"
+ },
+ "outputs": [],
+ "source": [
+ "from langchain_core.documents import Document\n",
+ "\n",
+ "vector_store_saved = Milvus.from_documents(\n",
+ " [Document(page_content=\"foo!\")],\n",
+ " embeddings,\n",
+ " collection_name=\"langchain_example\",\n",
+ " connection_args={\"uri\": URI},\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3b12df8c",
+ "metadata": {
+ "id": "3b12df8c"
+ },
+ "source": [
+ "And here is how you retrieve that stored collection:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "12817d16",
+ "metadata": {
+ "id": "12817d16"
+ },
+ "outputs": [],
+ "source": [
+ "vector_store_loaded = Milvus(\n",
+ " embeddings,\n",
+ " connection_args={\"uri\": URI},\n",
+ " collection_name=\"langchain_example\",\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f1fc3818",
+ "metadata": {
+ "id": "f1fc3818"
+ },
+ "source": [
+ "## Manage vector store\n",
+ "\n",
+ "Once you have created your vector store, we can interact with it by adding and deleting different items.\n",
+ "\n",
+ "### Add items to vector store\n",
+ "\n",
+ "We can add items to our vector store by using the `add_documents` function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3ced24f6",
+ "metadata": {
+ "id": "3ced24f6",
+ "outputId": "9c57a6bb-86eb-456c-f007-6cabd6865299"
+ },
+ "outputs": [],
+ "source": [
+ "from uuid import uuid4\n",
+ "\n",
+ "from langchain_core.documents import Document\n",
+ "\n",
+ "document_1 = Document(\n",
+ " page_content=\"I had chocolate chip pancakes and scrambled eggs for breakfast this morning.\",\n",
+ " metadata={\"source\": \"tweet\"},\n",
+ ")\n",
+ "\n",
+ "document_2 = Document(\n",
+ " page_content=\"The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.\",\n",
+ " metadata={\"source\": \"news\"},\n",
+ ")\n",
+ "\n",
+ "document_3 = Document(\n",
+ " page_content=\"Building an exciting new project with LangChain - come check it out!\",\n",
+ " metadata={\"source\": \"tweet\"},\n",
+ ")\n",
+ "\n",
+ "document_4 = Document(\n",
+ " page_content=\"Robbers broke into the city bank and stole $1 million in cash.\",\n",
+ " metadata={\"source\": \"news\"},\n",
+ ")\n",
+ "\n",
+ "document_5 = Document(\n",
+ " page_content=\"Wow! That was an amazing movie. I can't wait to see it again.\",\n",
+ " metadata={\"source\": \"tweet\"},\n",
+ ")\n",
+ "\n",
+ "document_6 = Document(\n",
+ " page_content=\"Is the new iPhone worth the price? Read this review to find out.\",\n",
+ " metadata={\"source\": \"website\"},\n",
+ ")\n",
+ "\n",
+ "document_7 = Document(\n",
+ " page_content=\"The top 10 soccer players in the world right now.\",\n",
+ " metadata={\"source\": \"website\"},\n",
+ ")\n",
+ "\n",
+ "document_8 = Document(\n",
+ " page_content=\"LangGraph is the best framework for building stateful, agentic applications!\",\n",
+ " metadata={\"source\": \"tweet\"},\n",
+ ")\n",
+ "\n",
+ "document_9 = Document(\n",
+ " page_content=\"The stock market is down 500 points today due to fears of a recession.\",\n",
+ " metadata={\"source\": \"news\"},\n",
+ ")\n",
+ "\n",
+ "document_10 = Document(\n",
+ " page_content=\"I have a bad feeling I am going to get deleted :(\",\n",
+ " metadata={\"source\": \"tweet\"},\n",
+ ")\n",
+ "\n",
+ "documents = [\n",
+ " document_1,\n",
+ " document_2,\n",
+ " document_3,\n",
+ " document_4,\n",
+ " document_5,\n",
+ " document_6,\n",
+ " document_7,\n",
+ " document_8,\n",
+ " document_9,\n",
+ " document_10,\n",
+ "]\n",
+ "uuids = [str(uuid4()) for _ in range(len(documents))]\n",
+ "\n",
+ "vector_store.add_documents(documents=documents, ids=uuids)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e23c22d8",
+ "metadata": {
+ "id": "e23c22d8"
+ },
+ "source": [
+ "### Delete items from vector store"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1f387fa8",
+ "metadata": {
+ "id": "1f387fa8",
+ "outputId": "62fee30d-92c9-4efd-df8a-453545ff61d0"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(insert count: 0, delete count: 1, upsert count: 0, timestamp: 0, success count: 0, err count: 0, cost: 0)"
+ ]
+ },
+ "execution_count": 32,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "vector_store.delete(ids=[uuids[-1]])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fb12fa75",
+ "metadata": {
+ "id": "fb12fa75"
+ },
+ "source": [
+ "## Query vector store\n",
+ "\n",
+ "Once your vector store has been created and the relevant documents have been added, you will most likely wish to query it during the running of your chain or agent.\n",
+ "\n",
+ "### Query directly\n",
+ "\n",
+ "#### Similarity search\n",
+ "\n",
+ "Performing a simple similarity search with filtering on metadata can be done as follows:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "35801a55",
+ "metadata": {
+ "id": "35801a55",
+ "outputId": "13865abb-11a2-41ae-9ad7-44e8586fd099"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "* Building an exciting new project with LangChain - come check it out! [{'pk': '9905001c-a4a3-455e-ab94-72d0ed11b476', 'source': 'tweet'}]\n",
+ "* LangGraph is the best framework for building stateful, agentic applications! [{'pk': '1206d237-ee3a-484f-baf2-b5ac38eeb314', 'source': 'tweet'}]\n"
+ ]
+ }
+ ],
+ "source": [
+ "results = vector_store.similarity_search(\n",
+ " \"LangChain provides abstractions to make working with LLMs easy\",\n",
+ " k=2,\n",
+ " expr='source == \"tweet\"',\n",
+ ")\n",
+ "for res in results:\n",
+ " print(f\"* {res.page_content} [{res.metadata}]\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "35574409",
+ "metadata": {
+ "id": "35574409"
+ },
+ "source": [
+ "#### Similarity search with score\n",
+ "\n",
+ "You can also search with score:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c360af3d",
+ "metadata": {
+ "id": "c360af3d",
+ "outputId": "16cb1961-9f4a-494a-9500-27b98a1158d8"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "* [SIM=21192.628906] bar [{'pk': '2', 'source': 'https://example.com'}]\n"
+ ]
+ }
+ ],
+ "source": [
+ "results = vector_store.similarity_search_with_score(\n",
+ " \"Will it be hot tomorrow?\", k=1, expr='source == \"news\"'\n",
+ ")\n",
+ "for res, score in results:\n",
+ " print(f\"* [SIM={score:3f}] {res.page_content} [{res.metadata}]\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "14db337f",
+ "metadata": {
+ "id": "14db337f"
+ },
+ "source": [
+ "For a full list of all the search options available when using the `Milvus` vector store, you can visit the [API reference](https://python.langchain.com/api_reference/milvus/vectorstores/langchain_milvus.vectorstores.milvus.Milvus.html).\n",
+ "\n",
+ "### Query by turning into retriever\n",
+ "\n",
+ "You can also transform the vector store into a retriever for easier usage in your chains."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f6d9357c",
+ "metadata": {
+ "id": "f6d9357c",
+ "outputId": "bcaa7620-a1c0-418f-9f54-684a472b0b55"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Document(metadata={'pk': 'eacc7256-d7fa-4036-b1f7-83d7a4bee0c5', 'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]"
+ ]
+ },
+ "execution_count": 34,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "retriever = vector_store.as_retriever(search_type=\"mmr\", search_kwargs={\"k\": 1})\n",
+ "retriever.invoke(\"Stealing from the bank is a crime\", filter={\"source\": \"news\"})"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8ac953f1",
+ "metadata": {
+ "id": "8ac953f1"
+ },
+ "source": [
+ "## Usage for retrieval-augmented generation\n",
+ "\n",
+ "For guides on how to use this vector store for retrieval-augmented generation (RAG), see the following sections:\n",
+ "\n",
+ "- [Tutorials](/docs/tutorials/)\n",
+ "- [How-to: Question and answer with RAG](https://python.langchain.com/docs/how_to/#qa-with-rag)\n",
+ "- [Retrieval conceptual docs](https://python.langchain.com/docs/concepts/retrieval)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7fb27b941602401d91542211134fc71a",
+ "metadata": {
+ "id": "7fb27b941602401d91542211134fc71a",
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "### Per-User Retrieval\n",
+ "\n",
+ "When building a retrieval app, you often have to build it with multiple users in mind. This means that you may be storing data not just for one user, but for many different users, and they should not be able to see each other’s data.\n",
+ "\n",
+ "Milvus recommends using [partition_key](https://milvus.io/docs/multi_tenancy.md#Partition-key-based-multi-tenancy) to implement multi-tenancy. Here is an example:\n",
+ "> The Partition key feature is not available in Milvus Lite, if you want to use it, you need to start Milvus server, as mentioned above."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "acae54e37e7d407bbb7b55eff062a284",
+ "metadata": {
+ "id": "acae54e37e7d407bbb7b55eff062a284",
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from langchain_core.documents import Document\n",
+ "\n",
+ "docs = [\n",
+ " Document(page_content=\"i worked at kensho\", metadata={\"namespace\": \"harrison\"}),\n",
+ " Document(page_content=\"i worked at facebook\", metadata={\"namespace\": \"ankush\"}),\n",
+ "]\n",
+ "vectorstore = Milvus.from_documents(\n",
+ " docs,\n",
+ " embeddings,\n",
+ " connection_args={\"uri\": URI},\n",
+ " drop_old=True,\n",
+ " partition_key_field=\"namespace\", # Use the \"namespace\" field as the partition key\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9a63283cbaf04dbcab1f6479b197f3a8",
+ "metadata": {
+ "id": "9a63283cbaf04dbcab1f6479b197f3a8",
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "To conduct a search using the partition key, you should include either of the following in the boolean expression of the search request:\n",
+ "\n",
+ "`search_kwargs={\"expr\": ' == \"xxxx\"'}`\n",
+ "\n",
+ "`search_kwargs={\"expr\": ' == in [\"xxx\", \"xxx\"]'}`\n",
+ "\n",
+ "Do replace `` with the name of the field that is designated as the partition key.\n",
+ "\n",
+ "Milvus changes to a partition based on the specified partition key, filters entities according to the partition key, and searches among the filtered entities.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8dd0d8092fe74a7c96281538738b07e2",
+ "metadata": {
+ "id": "8dd0d8092fe74a7c96281538738b07e2",
+ "outputId": "e38ff0ea-1425-4f12-cfb5-7767d040397b",
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Document(page_content='i worked at facebook', metadata={'namespace': 'ankush'})]"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# This will only get documents for Ankush\n",
+ "vectorstore.as_retriever(search_kwargs={\"expr\": 'namespace == \"ankush\"'}).invoke(\n",
+ " \"where did i work?\"\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "72eea5119410473aa328ad9291626812",
+ "metadata": {
+ "id": "72eea5119410473aa328ad9291626812",
+ "outputId": "9d3ad63e-fcb9-4f9a-bdf1-1bc263ce832b",
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Document(page_content='i worked at kensho', metadata={'namespace': 'harrison'})]"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# This will only get documents for Harrison\n",
+ "vectorstore.as_retriever(search_kwargs={\"expr\": 'namespace == \"harrison\"'}).invoke(\n",
+ " \"where did i work?\"\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f1a873c5",
+ "metadata": {
+ "id": "f1a873c5"
+ },
+ "source": [
+ "## API reference\n",
+ "\n",
+ "For detailed documentation of all __ModuleName__VectorStore features and configurations head to the API reference: https://python.langchain.com/api_reference/milvus/vectorstores/langchain_milvus.vectorstores.milvus.Milvus.html"
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "display_name": ".venv",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.9"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
}
diff --git a/docs/src/theme/FeatureTables.js b/docs/src/theme/FeatureTables.js
index c6420a29f49..73e9bfaa090 100644
--- a/docs/src/theme/FeatureTables.js
+++ b/docs/src/theme/FeatureTables.js
@@ -888,6 +888,13 @@ const FEATURE_TABLES = {
api: "Package",
apiLink: "https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyMuPDFLoader.html"
},
+ {
+ name: "PyMuPDF4LLM",
+ link: "pymupdf4llm",
+ source: "Load PDF content to Markdown using PyMuPDF4LLM",
+ api: "Package",
+ apiLink: "https://github.com/lakinduboteju/langchain-pymupdf4llm"
+ },
{
name: "PDFMiner",
link: "pdfminer",
diff --git a/libs/community/langchain_community/vectorstores/sqlitevec.py b/libs/community/langchain_community/vectorstores/sqlitevec.py
index 52da1942f5a..e8ea7b60ec6 100644
--- a/libs/community/langchain_community/vectorstores/sqlitevec.py
+++ b/libs/community/langchain_community/vectorstores/sqlitevec.py
@@ -95,7 +95,7 @@ class SQLiteVec(VectorStore):
)
self._connection.execute(
f"""
- CREATE TRIGGER IF NOT EXISTS embed_text
+ CREATE TRIGGER IF NOT EXISTS {self._table}_embed_text
AFTER INSERT ON {self._table}
BEGIN
INSERT INTO {self._table}_vec(rowid, text_embedding)
diff --git a/libs/community/tests/integration_tests/vectorstores/test_sqlitevec.py b/libs/community/tests/integration_tests/vectorstores/test_sqlitevec.py
index f7c67ba5299..01073f4c11a 100644
--- a/libs/community/tests/integration_tests/vectorstores/test_sqlitevec.py
+++ b/libs/community/tests/integration_tests/vectorstores/test_sqlitevec.py
@@ -56,3 +56,27 @@ def test_sqlitevec_add_extra() -> None:
docsearch.add_texts(texts, metadatas)
output = docsearch.similarity_search("foo", k=10)
assert len(output) == 6
+
+
+@pytest.mark.requires("sqlite-vec")
+def test_sqlitevec_search_multiple_tables() -> None:
+ """Test end to end construction and search with multiple tables."""
+ docsearch_1 = SQLiteVec.from_texts(
+ fake_texts,
+ FakeEmbeddings(),
+ table="table_1",
+ db_file=":memory:", ## change to local storage for testing
+ )
+
+ docsearch_2 = SQLiteVec.from_texts(
+ fake_texts,
+ FakeEmbeddings(),
+ table="table_2",
+ db_file=":memory:",
+ )
+
+ output_1 = docsearch_1.similarity_search("foo", k=1)
+ output_2 = docsearch_2.similarity_search("foo", k=1)
+
+ assert output_1 == [Document(page_content="foo", metadata={})]
+ assert output_2 == [Document(page_content="foo", metadata={})]
diff --git a/libs/core/langchain_core/callbacks/base.py b/libs/core/langchain_core/callbacks/base.py
index 98fd824ec18..1c0fd19de59 100644
--- a/libs/core/langchain_core/callbacks/base.py
+++ b/libs/core/langchain_core/callbacks/base.py
@@ -3,13 +3,14 @@
from __future__ import annotations
import logging
-from collections.abc import Sequence
from typing import TYPE_CHECKING, Any, Optional, TypeVar, Union
-from uuid import UUID
-
-from tenacity import RetryCallState
if TYPE_CHECKING:
+ from collections.abc import Sequence
+ from uuid import UUID
+
+ from tenacity import RetryCallState
+
from langchain_core.agents import AgentAction, AgentFinish
from langchain_core.documents import Document
from langchain_core.messages import BaseMessage
diff --git a/libs/core/langchain_core/callbacks/file.py b/libs/core/langchain_core/callbacks/file.py
index 961b0c9bc24..cd20fbe4f71 100644
--- a/libs/core/langchain_core/callbacks/file.py
+++ b/libs/core/langchain_core/callbacks/file.py
@@ -2,12 +2,14 @@
from __future__ import annotations
-from typing import Any, Optional, TextIO, cast
+from typing import TYPE_CHECKING, Any, Optional, TextIO, cast
-from langchain_core.agents import AgentAction, AgentFinish
from langchain_core.callbacks import BaseCallbackHandler
from langchain_core.utils.input import print_text
+if TYPE_CHECKING:
+ from langchain_core.agents import AgentAction, AgentFinish
+
class FileCallbackHandler(BaseCallbackHandler):
"""Callback Handler that writes to a file.
@@ -45,9 +47,15 @@ class FileCallbackHandler(BaseCallbackHandler):
inputs (Dict[str, Any]): The inputs to the chain.
**kwargs (Any): Additional keyword arguments.
"""
- class_name = serialized.get("name", serialized.get("id", [""])[-1])
+ if "name" in kwargs:
+ name = kwargs["name"]
+ else:
+ if serialized:
+ name = serialized.get("name", serialized.get("id", [""])[-1])
+ else:
+ name = ""
print_text(
- f"\n\n\033[1m> Entering new {class_name} chain...\033[0m",
+ f"\n\n\033[1m> Entering new {name} chain...\033[0m",
end="\n",
file=self.file,
)
diff --git a/libs/core/langchain_core/callbacks/manager.py b/libs/core/langchain_core/callbacks/manager.py
index f675b230c87..689003de3e6 100644
--- a/libs/core/langchain_core/callbacks/manager.py
+++ b/libs/core/langchain_core/callbacks/manager.py
@@ -5,7 +5,6 @@ import functools
import logging
import uuid
from abc import ABC, abstractmethod
-from collections.abc import AsyncGenerator, Coroutine, Generator, Sequence
from concurrent.futures import ThreadPoolExecutor
from contextlib import asynccontextmanager, contextmanager
from contextvars import copy_context
@@ -21,7 +20,6 @@ from typing import (
from uuid import UUID
from langsmith.run_helpers import get_tracing_context
-from tenacity import RetryCallState
from langchain_core.callbacks.base import (
BaseCallbackHandler,
@@ -39,6 +37,10 @@ from langchain_core.tracers.schemas import Run
from langchain_core.utils.env import env_var_is_set
if TYPE_CHECKING:
+ from collections.abc import AsyncGenerator, Coroutine, Generator, Sequence
+
+ from tenacity import RetryCallState
+
from langchain_core.agents import AgentAction, AgentFinish
from langchain_core.documents import Document
from langchain_core.outputs import ChatGenerationChunk, GenerationChunk, LLMResult
diff --git a/libs/core/langchain_core/chat_history.py b/libs/core/langchain_core/chat_history.py
index 1b5c1e7f782..9e74dfca2bb 100644
--- a/libs/core/langchain_core/chat_history.py
+++ b/libs/core/langchain_core/chat_history.py
@@ -17,8 +17,7 @@
from __future__ import annotations
from abc import ABC, abstractmethod
-from collections.abc import Sequence
-from typing import Union
+from typing import TYPE_CHECKING, Union
from pydantic import BaseModel, Field
@@ -29,6 +28,9 @@ from langchain_core.messages import (
get_buffer_string,
)
+if TYPE_CHECKING:
+ from collections.abc import Sequence
+
class BaseChatMessageHistory(ABC):
"""Abstract base class for storing chat message history.
diff --git a/libs/core/langchain_core/document_loaders/base.py b/libs/core/langchain_core/document_loaders/base.py
index b2cd20038eb..d889fbec8f9 100644
--- a/libs/core/langchain_core/document_loaders/base.py
+++ b/libs/core/langchain_core/document_loaders/base.py
@@ -3,16 +3,17 @@
from __future__ import annotations
from abc import ABC, abstractmethod
-from collections.abc import AsyncIterator, Iterator
from typing import TYPE_CHECKING, Optional
-from langchain_core.documents import Document
from langchain_core.runnables import run_in_executor
if TYPE_CHECKING:
+ from collections.abc import AsyncIterator, Iterator
+
from langchain_text_splitters import TextSplitter
-from langchain_core.documents.base import Blob
+ from langchain_core.documents import Document
+ from langchain_core.documents.base import Blob
class BaseLoader(ABC): # noqa: B024
diff --git a/libs/core/langchain_core/document_loaders/blob_loaders.py b/libs/core/langchain_core/document_loaders/blob_loaders.py
index 3c0d1986f73..098c325a50b 100644
--- a/libs/core/langchain_core/document_loaders/blob_loaders.py
+++ b/libs/core/langchain_core/document_loaders/blob_loaders.py
@@ -8,12 +8,15 @@ In addition, content loading code should provide a lazy loading interface by def
from __future__ import annotations
from abc import ABC, abstractmethod
-from collections.abc import Iterable
+from typing import TYPE_CHECKING
# Re-export Blob and PathLike for backwards compatibility
from langchain_core.documents.base import Blob as Blob
from langchain_core.documents.base import PathLike as PathLike
+if TYPE_CHECKING:
+ from collections.abc import Iterable
+
class BlobLoader(ABC):
"""Abstract interface for blob loaders implementation.
diff --git a/libs/core/langchain_core/documents/base.py b/libs/core/langchain_core/documents/base.py
index 2adfe1a7183..fb4fcd0987e 100644
--- a/libs/core/langchain_core/documents/base.py
+++ b/libs/core/langchain_core/documents/base.py
@@ -2,15 +2,17 @@ from __future__ import annotations
import contextlib
import mimetypes
-from collections.abc import Generator
from io import BufferedReader, BytesIO
from pathlib import PurePath
-from typing import Any, Literal, Optional, Union, cast
+from typing import TYPE_CHECKING, Any, Literal, Optional, Union, cast
from pydantic import ConfigDict, Field, field_validator, model_validator
from langchain_core.load.serializable import Serializable
+if TYPE_CHECKING:
+ from collections.abc import Generator
+
PathLike = Union[str, PurePath]
diff --git a/libs/core/langchain_core/documents/compressor.py b/libs/core/langchain_core/documents/compressor.py
index 31ae2901a7b..3c9217e4f7e 100644
--- a/libs/core/langchain_core/documents/compressor.py
+++ b/libs/core/langchain_core/documents/compressor.py
@@ -1,15 +1,18 @@
from __future__ import annotations
from abc import ABC, abstractmethod
-from collections.abc import Sequence
-from typing import Optional
+from typing import TYPE_CHECKING, Optional
from pydantic import BaseModel
-from langchain_core.callbacks import Callbacks
-from langchain_core.documents import Document
from langchain_core.runnables import run_in_executor
+if TYPE_CHECKING:
+ from collections.abc import Sequence
+
+ from langchain_core.callbacks import Callbacks
+ from langchain_core.documents import Document
+
class BaseDocumentCompressor(BaseModel, ABC):
"""Base class for document compressors.
diff --git a/libs/core/langchain_core/documents/transformers.py b/libs/core/langchain_core/documents/transformers.py
index 12167f820f9..5be5e98c060 100644
--- a/libs/core/langchain_core/documents/transformers.py
+++ b/libs/core/langchain_core/documents/transformers.py
@@ -1,12 +1,13 @@
from __future__ import annotations
from abc import ABC, abstractmethod
-from collections.abc import Sequence
from typing import TYPE_CHECKING, Any
from langchain_core.runnables.config import run_in_executor
if TYPE_CHECKING:
+ from collections.abc import Sequence
+
from langchain_core.documents import Document
diff --git a/libs/core/langchain_core/example_selectors/semantic_similarity.py b/libs/core/langchain_core/example_selectors/semantic_similarity.py
index b27122ec36d..cd6f82bc8c0 100644
--- a/libs/core/langchain_core/example_selectors/semantic_similarity.py
+++ b/libs/core/langchain_core/example_selectors/semantic_similarity.py
@@ -7,11 +7,11 @@ from typing import TYPE_CHECKING, Any, Optional
from pydantic import BaseModel, ConfigDict
-from langchain_core.documents import Document
from langchain_core.example_selectors.base import BaseExampleSelector
from langchain_core.vectorstores import VectorStore
if TYPE_CHECKING:
+ from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
diff --git a/libs/core/langchain_core/indexing/base.py b/libs/core/langchain_core/indexing/base.py
index d2a7d09e58a..62e0a720c0a 100644
--- a/libs/core/langchain_core/indexing/base.py
+++ b/libs/core/langchain_core/indexing/base.py
@@ -3,14 +3,17 @@ from __future__ import annotations
import abc
import time
from abc import ABC, abstractmethod
-from collections.abc import Sequence
-from typing import Any, Optional, TypedDict
+from typing import TYPE_CHECKING, Any, Optional, TypedDict
from langchain_core._api import beta
-from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever
from langchain_core.runnables import run_in_executor
+if TYPE_CHECKING:
+ from collections.abc import Sequence
+
+ from langchain_core.documents import Document
+
class RecordManager(ABC):
"""Abstract base class representing the interface for a record manager.
diff --git a/libs/core/langchain_core/language_models/chat_models.py b/libs/core/langchain_core/language_models/chat_models.py
index 1bba3ffd5c8..d380833ae46 100644
--- a/libs/core/langchain_core/language_models/chat_models.py
+++ b/libs/core/langchain_core/language_models/chat_models.py
@@ -4,7 +4,6 @@ import asyncio
import inspect
import json
import typing
-import uuid
import warnings
from abc import ABC, abstractmethod
from collections.abc import AsyncIterator, Iterator, Sequence
@@ -70,6 +69,8 @@ from langchain_core.utils.function_calling import convert_to_openai_tool
from langchain_core.utils.pydantic import TypeBaseModel, is_basemodel_subclass
if TYPE_CHECKING:
+ import uuid
+
from langchain_core.output_parsers.base import OutputParserLike
from langchain_core.runnables import Runnable, RunnableConfig
from langchain_core.tools import BaseTool
diff --git a/libs/core/langchain_core/language_models/llms.py b/libs/core/langchain_core/language_models/llms.py
index 4ba16f51696..9af4d4e9ac5 100644
--- a/libs/core/langchain_core/language_models/llms.py
+++ b/libs/core/langchain_core/language_models/llms.py
@@ -7,12 +7,12 @@ import functools
import inspect
import json
import logging
-import uuid
import warnings
from abc import ABC, abstractmethod
from collections.abc import AsyncIterator, Iterator, Sequence
from pathlib import Path
from typing import (
+ TYPE_CHECKING,
Any,
Callable,
Optional,
@@ -61,6 +61,9 @@ from langchain_core.prompt_values import ChatPromptValue, PromptValue, StringPro
from langchain_core.runnables import RunnableConfig, ensure_config, get_config_list
from langchain_core.runnables.config import run_in_executor
+if TYPE_CHECKING:
+ import uuid
+
logger = logging.getLogger(__name__)
diff --git a/libs/core/langchain_core/messages/base.py b/libs/core/langchain_core/messages/base.py
index 5614f6fb7dd..a4693e48c32 100644
--- a/libs/core/langchain_core/messages/base.py
+++ b/libs/core/langchain_core/messages/base.py
@@ -1,6 +1,5 @@
from __future__ import annotations
-from collections.abc import Sequence
from typing import TYPE_CHECKING, Any, Optional, Union, cast
from pydantic import ConfigDict, Field, field_validator
@@ -11,6 +10,8 @@ from langchain_core.utils._merge import merge_dicts, merge_lists
from langchain_core.utils.interactive_env import is_interactive_env
if TYPE_CHECKING:
+ from collections.abc import Sequence
+
from langchain_core.prompts.chat import ChatPromptTemplate
diff --git a/libs/core/langchain_core/output_parsers/list.py b/libs/core/langchain_core/output_parsers/list.py
index bedbdf47b7a..6977079f5ae 100644
--- a/libs/core/langchain_core/output_parsers/list.py
+++ b/libs/core/langchain_core/output_parsers/list.py
@@ -4,14 +4,16 @@ import csv
import re
from abc import abstractmethod
from collections import deque
-from collections.abc import AsyncIterator, Iterator
from io import StringIO
+from typing import TYPE_CHECKING, TypeVar, Union
from typing import Optional as Optional
-from typing import TypeVar, Union
from langchain_core.messages import BaseMessage
from langchain_core.output_parsers.transform import BaseTransformOutputParser
+if TYPE_CHECKING:
+ from collections.abc import AsyncIterator, Iterator
+
T = TypeVar("T")
diff --git a/libs/core/langchain_core/output_parsers/transform.py b/libs/core/langchain_core/output_parsers/transform.py
index 0636d2d661b..d484b6d06cd 100644
--- a/libs/core/langchain_core/output_parsers/transform.py
+++ b/libs/core/langchain_core/output_parsers/transform.py
@@ -1,6 +1,5 @@
from __future__ import annotations
-from collections.abc import AsyncIterator, Iterator
from typing import (
TYPE_CHECKING,
Any,
@@ -19,6 +18,8 @@ from langchain_core.outputs import (
from langchain_core.runnables.config import run_in_executor
if TYPE_CHECKING:
+ from collections.abc import AsyncIterator, Iterator
+
from langchain_core.runnables import RunnableConfig
diff --git a/libs/core/langchain_core/outputs/chat_generation.py b/libs/core/langchain_core/outputs/chat_generation.py
index d40e1fd5362..9eac7514d4e 100644
--- a/libs/core/langchain_core/outputs/chat_generation.py
+++ b/libs/core/langchain_core/outputs/chat_generation.py
@@ -1,14 +1,16 @@
from __future__ import annotations
-from typing import Literal, Union
+from typing import TYPE_CHECKING, Literal, Union
from pydantic import model_validator
-from typing_extensions import Self
from langchain_core.messages import BaseMessage, BaseMessageChunk
from langchain_core.outputs.generation import Generation
from langchain_core.utils._merge import merge_dicts
+if TYPE_CHECKING:
+ from typing_extensions import Self
+
class ChatGeneration(Generation):
"""A single chat generation output.
diff --git a/libs/core/langchain_core/prompts/chat.py b/libs/core/langchain_core/prompts/chat.py
index 80d427c8be6..c819bbe06d7 100644
--- a/libs/core/langchain_core/prompts/chat.py
+++ b/libs/core/langchain_core/prompts/chat.py
@@ -3,9 +3,8 @@
from __future__ import annotations
from abc import ABC, abstractmethod
-from collections.abc import Sequence
-from pathlib import Path
from typing import (
+ TYPE_CHECKING,
Annotated,
Any,
Optional,
@@ -47,6 +46,10 @@ from langchain_core.prompts.string import (
from langchain_core.utils import get_colored_text
from langchain_core.utils.interactive_env import is_interactive_env
+if TYPE_CHECKING:
+ from collections.abc import Sequence
+ from pathlib import Path
+
class BaseMessagePromptTemplate(Serializable, ABC):
"""Base class for message prompt templates."""
diff --git a/libs/core/langchain_core/prompts/few_shot.py b/libs/core/langchain_core/prompts/few_shot.py
index 5794a913325..f14699b47bf 100644
--- a/libs/core/langchain_core/prompts/few_shot.py
+++ b/libs/core/langchain_core/prompts/few_shot.py
@@ -2,8 +2,7 @@
from __future__ import annotations
-from pathlib import Path
-from typing import Any, Literal, Optional, Union
+from typing import TYPE_CHECKING, Any, Literal, Optional, Union
from pydantic import (
BaseModel,
@@ -11,7 +10,6 @@ from pydantic import (
Field,
model_validator,
)
-from typing_extensions import Self
from langchain_core.example_selectors import BaseExampleSelector
from langchain_core.messages import BaseMessage, get_buffer_string
@@ -27,6 +25,11 @@ from langchain_core.prompts.string import (
get_template_variables,
)
+if TYPE_CHECKING:
+ from pathlib import Path
+
+ from typing_extensions import Self
+
class _FewShotPromptTemplateMixin(BaseModel):
"""Prompt template that contains few shot examples."""
diff --git a/libs/core/langchain_core/prompts/prompt.py b/libs/core/langchain_core/prompts/prompt.py
index 37f7eda64ac..888fc9ccbc9 100644
--- a/libs/core/langchain_core/prompts/prompt.py
+++ b/libs/core/langchain_core/prompts/prompt.py
@@ -3,8 +3,7 @@
from __future__ import annotations
import warnings
-from pathlib import Path
-from typing import Any, Optional, Union
+from typing import TYPE_CHECKING, Any, Optional, Union
from pydantic import BaseModel, model_validator
@@ -16,7 +15,11 @@ from langchain_core.prompts.string import (
get_template_variables,
mustache_schema,
)
-from langchain_core.runnables.config import RunnableConfig
+
+if TYPE_CHECKING:
+ from pathlib import Path
+
+ from langchain_core.runnables.config import RunnableConfig
class PromptTemplate(StringPromptTemplate):
diff --git a/libs/core/langchain_core/runnables/base.py b/libs/core/langchain_core/runnables/base.py
index 1840e3dbe7b..33f12137776 100644
--- a/libs/core/langchain_core/runnables/base.py
+++ b/libs/core/langchain_core/runnables/base.py
@@ -60,7 +60,6 @@ from langchain_core.runnables.config import (
run_in_executor,
)
from langchain_core.runnables.graph import Graph
-from langchain_core.runnables.schema import StreamEvent
from langchain_core.runnables.utils import (
AddableDict,
AnyConfigurableField,
@@ -94,6 +93,7 @@ if TYPE_CHECKING:
from langchain_core.runnables.fallbacks import (
RunnableWithFallbacks as RunnableWithFallbacksT,
)
+ from langchain_core.runnables.schema import StreamEvent
from langchain_core.tools import BaseTool
from langchain_core.tracers.log_stream import (
RunLog,
diff --git a/libs/core/langchain_core/runnables/configurable.py b/libs/core/langchain_core/runnables/configurable.py
index b59d0239fb1..79473e5d268 100644
--- a/libs/core/langchain_core/runnables/configurable.py
+++ b/libs/core/langchain_core/runnables/configurable.py
@@ -7,6 +7,7 @@ from collections.abc import AsyncIterator, Iterator, Sequence
from collections.abc import Mapping as Mapping
from functools import wraps
from typing import (
+ TYPE_CHECKING,
Any,
Callable,
Optional,
@@ -26,7 +27,6 @@ from langchain_core.runnables.config import (
get_executor_for_config,
merge_configs,
)
-from langchain_core.runnables.graph import Graph
from langchain_core.runnables.utils import (
AnyConfigurableField,
ConfigurableField,
@@ -39,6 +39,9 @@ from langchain_core.runnables.utils import (
get_unique_config_specs,
)
+if TYPE_CHECKING:
+ from langchain_core.runnables.graph import Graph
+
class DynamicRunnable(RunnableSerializable[Input, Output]):
"""Serializable Runnable that can be dynamically configured.
diff --git a/libs/core/langchain_core/runnables/graph.py b/libs/core/langchain_core/runnables/graph.py
index 84b86994dbf..99bcae5abf3 100644
--- a/libs/core/langchain_core/runnables/graph.py
+++ b/libs/core/langchain_core/runnables/graph.py
@@ -2,7 +2,6 @@ from __future__ import annotations
import inspect
from collections import defaultdict
-from collections.abc import Sequence
from dataclasses import dataclass, field
from enum import Enum
from typing import (
@@ -18,11 +17,13 @@ from typing import (
)
from uuid import UUID, uuid4
-from pydantic import BaseModel
-
from langchain_core.utils.pydantic import _IgnoreUnserializable, is_basemodel_subclass
if TYPE_CHECKING:
+ from collections.abc import Sequence
+
+ from pydantic import BaseModel
+
from langchain_core.runnables.base import Runnable as RunnableType
diff --git a/libs/core/langchain_core/runnables/passthrough.py b/libs/core/langchain_core/runnables/passthrough.py
index b0da175ae32..95c27e7bcc6 100644
--- a/libs/core/langchain_core/runnables/passthrough.py
+++ b/libs/core/langchain_core/runnables/passthrough.py
@@ -5,7 +5,7 @@ from __future__ import annotations
import asyncio
import inspect
import threading
-from collections.abc import AsyncIterator, Awaitable, Iterator, Mapping
+from collections.abc import Awaitable
from typing import (
TYPE_CHECKING,
Any,
@@ -32,7 +32,6 @@ from langchain_core.runnables.config import (
get_executor_for_config,
patch_config,
)
-from langchain_core.runnables.graph import Graph
from langchain_core.runnables.utils import (
AddableDict,
ConfigurableFieldSpec,
@@ -42,10 +41,13 @@ from langchain_core.utils.iter import safetee
from langchain_core.utils.pydantic import create_model_v2
if TYPE_CHECKING:
+ from collections.abc import AsyncIterator, Iterator, Mapping
+
from langchain_core.callbacks.manager import (
AsyncCallbackManagerForChainRun,
CallbackManagerForChainRun,
)
+ from langchain_core.runnables.graph import Graph
def identity(x: Other) -> Other:
diff --git a/libs/core/langchain_core/runnables/router.py b/libs/core/langchain_core/runnables/router.py
index 29c6359c69a..8d679ad35c9 100644
--- a/libs/core/langchain_core/runnables/router.py
+++ b/libs/core/langchain_core/runnables/router.py
@@ -1,8 +1,9 @@
from __future__ import annotations
-from collections.abc import AsyncIterator, Iterator, Mapping
+from collections.abc import Mapping
from itertools import starmap
from typing import (
+ TYPE_CHECKING,
Any,
Callable,
Optional,
@@ -31,6 +32,9 @@ from langchain_core.runnables.utils import (
get_unique_config_specs,
)
+if TYPE_CHECKING:
+ from collections.abc import AsyncIterator, Iterator
+
class RouterInput(TypedDict):
"""Router input.
diff --git a/libs/core/langchain_core/runnables/schema.py b/libs/core/langchain_core/runnables/schema.py
index dcfd32b8b39..20ad580070a 100644
--- a/libs/core/langchain_core/runnables/schema.py
+++ b/libs/core/langchain_core/runnables/schema.py
@@ -2,11 +2,13 @@
from __future__ import annotations
-from collections.abc import Sequence
-from typing import Any, Literal, Union
+from typing import TYPE_CHECKING, Any, Literal, Union
from typing_extensions import NotRequired, TypedDict
+if TYPE_CHECKING:
+ from collections.abc import Sequence
+
class EventData(TypedDict, total=False):
"""Data associated with a streaming event."""
diff --git a/libs/core/langchain_core/runnables/utils.py b/libs/core/langchain_core/runnables/utils.py
index 75063c7db58..6ed4e7d6816 100644
--- a/libs/core/langchain_core/runnables/utils.py
+++ b/libs/core/langchain_core/runnables/utils.py
@@ -6,19 +6,11 @@ import ast
import asyncio
import inspect
import textwrap
-from collections.abc import (
- AsyncIterable,
- AsyncIterator,
- Awaitable,
- Coroutine,
- Iterable,
- Mapping,
- Sequence,
-)
from functools import lru_cache
from inspect import signature
from itertools import groupby
from typing import (
+ TYPE_CHECKING,
Any,
Callable,
NamedTuple,
@@ -30,11 +22,22 @@ from typing import (
from typing_extensions import TypeGuard, override
-from langchain_core.runnables.schema import StreamEvent
-
# Re-export create-model for backwards compatibility
from langchain_core.utils.pydantic import create_model as create_model
+if TYPE_CHECKING:
+ from collections.abc import (
+ AsyncIterable,
+ AsyncIterator,
+ Awaitable,
+ Coroutine,
+ Iterable,
+ Mapping,
+ Sequence,
+ )
+
+ from langchain_core.runnables.schema import StreamEvent
+
Input = TypeVar("Input", contravariant=True)
# Output type should implement __concat__, as eg str, list, dict do
Output = TypeVar("Output", covariant=True)
diff --git a/libs/core/langchain_core/structured_query.py b/libs/core/langchain_core/structured_query.py
index 8aacbfbcc60..c0c502da005 100644
--- a/libs/core/langchain_core/structured_query.py
+++ b/libs/core/langchain_core/structured_query.py
@@ -3,12 +3,14 @@
from __future__ import annotations
from abc import ABC, abstractmethod
-from collections.abc import Sequence
from enum import Enum
-from typing import Any, Optional, Union
+from typing import TYPE_CHECKING, Any, Optional, Union
from pydantic import BaseModel
+if TYPE_CHECKING:
+ from collections.abc import Sequence
+
class Visitor(ABC):
"""Defines interface for IR translation using a visitor pattern."""
diff --git a/libs/core/langchain_core/tools/base.py b/libs/core/langchain_core/tools/base.py
index 5a618f6a160..5e1a8a47523 100644
--- a/libs/core/langchain_core/tools/base.py
+++ b/libs/core/langchain_core/tools/base.py
@@ -4,13 +4,12 @@ import asyncio
import functools
import inspect
import json
-import uuid
import warnings
from abc import ABC, abstractmethod
-from collections.abc import Sequence
from contextvars import copy_context
from inspect import signature
from typing import (
+ TYPE_CHECKING,
Annotated,
Any,
Callable,
@@ -68,6 +67,10 @@ from langchain_core.utils.pydantic import (
is_pydantic_v2_subclass,
)
+if TYPE_CHECKING:
+ import uuid
+ from collections.abc import Sequence
+
FILTERED_ARGS = ("run_manager", "callbacks")
diff --git a/libs/core/langchain_core/tools/retriever.py b/libs/core/langchain_core/tools/retriever.py
index b1036058ca5..0bc26ce2a77 100644
--- a/libs/core/langchain_core/tools/retriever.py
+++ b/libs/core/langchain_core/tools/retriever.py
@@ -1,21 +1,23 @@
from __future__ import annotations
from functools import partial
-from typing import Literal, Optional, Union
+from typing import TYPE_CHECKING, Literal, Optional, Union
from pydantic import BaseModel, Field
-from langchain_core.callbacks import Callbacks
-from langchain_core.documents import Document
from langchain_core.prompts import (
BasePromptTemplate,
PromptTemplate,
aformat_document,
format_document,
)
-from langchain_core.retrievers import BaseRetriever
from langchain_core.tools.simple import Tool
+if TYPE_CHECKING:
+ from langchain_core.callbacks import Callbacks
+ from langchain_core.documents import Document
+ from langchain_core.retrievers import BaseRetriever
+
class RetrieverInput(BaseModel):
"""Input to the retriever."""
diff --git a/libs/core/langchain_core/tools/simple.py b/libs/core/langchain_core/tools/simple.py
index 370d2091a68..d9806c3e94b 100644
--- a/libs/core/langchain_core/tools/simple.py
+++ b/libs/core/langchain_core/tools/simple.py
@@ -3,6 +3,7 @@ from __future__ import annotations
from collections.abc import Awaitable
from inspect import signature
from typing import (
+ TYPE_CHECKING,
Any,
Callable,
Optional,
@@ -13,7 +14,6 @@ from langchain_core.callbacks import (
AsyncCallbackManagerForToolRun,
CallbackManagerForToolRun,
)
-from langchain_core.messages import ToolCall
from langchain_core.runnables import RunnableConfig, run_in_executor
from langchain_core.tools.base import (
ArgsSchema,
@@ -22,6 +22,9 @@ from langchain_core.tools.base import (
_get_runnable_config_param,
)
+if TYPE_CHECKING:
+ from langchain_core.messages import ToolCall
+
class Tool(BaseTool):
"""Tool that takes in function or coroutine directly."""
diff --git a/libs/core/langchain_core/tools/structured.py b/libs/core/langchain_core/tools/structured.py
index f168289ecde..6d7620e7627 100644
--- a/libs/core/langchain_core/tools/structured.py
+++ b/libs/core/langchain_core/tools/structured.py
@@ -4,6 +4,7 @@ import textwrap
from collections.abc import Awaitable
from inspect import signature
from typing import (
+ TYPE_CHECKING,
Annotated,
Any,
Callable,
@@ -18,7 +19,6 @@ from langchain_core.callbacks import (
AsyncCallbackManagerForToolRun,
CallbackManagerForToolRun,
)
-from langchain_core.messages import ToolCall
from langchain_core.runnables import RunnableConfig, run_in_executor
from langchain_core.tools.base import (
FILTERED_ARGS,
@@ -29,6 +29,9 @@ from langchain_core.tools.base import (
)
from langchain_core.utils.pydantic import is_basemodel_subclass
+if TYPE_CHECKING:
+ from langchain_core.messages import ToolCall
+
class StructuredTool(BaseTool):
"""Tool that can operate on any number of inputs."""
diff --git a/libs/core/langchain_core/tracers/base.py b/libs/core/langchain_core/tracers/base.py
index f3ae965f602..f452d586f50 100644
--- a/libs/core/langchain_core/tracers/base.py
+++ b/libs/core/langchain_core/tracers/base.py
@@ -5,26 +5,27 @@ from __future__ import annotations
import asyncio
import logging
from abc import ABC, abstractmethod
-from collections.abc import Sequence
from typing import (
TYPE_CHECKING,
Any,
Optional,
Union,
)
-from uuid import UUID
-
-from tenacity import RetryCallState
from langchain_core.callbacks.base import AsyncCallbackHandler, BaseCallbackHandler
from langchain_core.exceptions import TracerException # noqa
-from langchain_core.messages import BaseMessage
-from langchain_core.outputs import ChatGenerationChunk, GenerationChunk, LLMResult
from langchain_core.tracers.core import _TracerCore
-from langchain_core.tracers.schemas import Run
if TYPE_CHECKING:
+ from collections.abc import Sequence
+ from uuid import UUID
+
+ from tenacity import RetryCallState
+
from langchain_core.documents import Document
+ from langchain_core.messages import BaseMessage
+ from langchain_core.outputs import ChatGenerationChunk, GenerationChunk, LLMResult
+ from langchain_core.tracers.schemas import Run
logger = logging.getLogger(__name__)
diff --git a/libs/core/langchain_core/tracers/context.py b/libs/core/langchain_core/tracers/context.py
index d36adc9bc8e..a947f162fcd 100644
--- a/libs/core/langchain_core/tracers/context.py
+++ b/libs/core/langchain_core/tracers/context.py
@@ -1,6 +1,5 @@
from __future__ import annotations
-from collections.abc import Generator
from contextlib import contextmanager
from contextvars import ContextVar
from typing import (
@@ -18,13 +17,15 @@ from langsmith import utils as ls_utils
from langchain_core.tracers.langchain import LangChainTracer
from langchain_core.tracers.run_collector import RunCollectorCallbackHandler
-from langchain_core.tracers.schemas import TracerSessionV1
if TYPE_CHECKING:
+ from collections.abc import Generator
+
from langsmith import Client as LangSmithClient
from langchain_core.callbacks.base import BaseCallbackHandler, Callbacks
from langchain_core.callbacks.manager import AsyncCallbackManager, CallbackManager
+ from langchain_core.tracers.schemas import TracerSessionV1
# for backwards partial compatibility if this is imported by users but unused
tracing_callback_var: Any = None
diff --git a/libs/core/langchain_core/tracers/core.py b/libs/core/langchain_core/tracers/core.py
index d3544df04e3..599c4000cd6 100644
--- a/libs/core/langchain_core/tracers/core.py
+++ b/libs/core/langchain_core/tracers/core.py
@@ -6,7 +6,6 @@ import logging
import sys
import traceback
from abc import ABC, abstractmethod
-from collections.abc import Coroutine, Sequence
from datetime import datetime, timezone
from typing import (
TYPE_CHECKING,
@@ -16,13 +15,9 @@ from typing import (
Union,
cast,
)
-from uuid import UUID
-
-from tenacity import RetryCallState
from langchain_core.exceptions import TracerException
from langchain_core.load import dumpd
-from langchain_core.messages import BaseMessage
from langchain_core.outputs import (
ChatGeneration,
ChatGenerationChunk,
@@ -32,7 +27,13 @@ from langchain_core.outputs import (
from langchain_core.tracers.schemas import Run
if TYPE_CHECKING:
+ from collections.abc import Coroutine, Sequence
+ from uuid import UUID
+
+ from tenacity import RetryCallState
+
from langchain_core.documents import Document
+ from langchain_core.messages import BaseMessage
logger = logging.getLogger(__name__)
diff --git a/libs/core/langchain_core/tracers/evaluation.py b/libs/core/langchain_core/tracers/evaluation.py
index 425c1c01222..0a918cd2e9d 100644
--- a/libs/core/langchain_core/tracers/evaluation.py
+++ b/libs/core/langchain_core/tracers/evaluation.py
@@ -5,9 +5,8 @@ from __future__ import annotations
import logging
import threading
import weakref
-from collections.abc import Sequence
from concurrent.futures import Future, ThreadPoolExecutor, wait
-from typing import Any, Optional, Union, cast
+from typing import TYPE_CHECKING, Any, Optional, Union, cast
from uuid import UUID
import langsmith
@@ -17,7 +16,11 @@ from langchain_core.tracers import langchain as langchain_tracer
from langchain_core.tracers.base import BaseTracer
from langchain_core.tracers.context import tracing_v2_enabled
from langchain_core.tracers.langchain import _get_executor
-from langchain_core.tracers.schemas import Run
+
+if TYPE_CHECKING:
+ from collections.abc import Sequence
+
+ from langchain_core.tracers.schemas import Run
logger = logging.getLogger(__name__)
diff --git a/libs/core/langchain_core/tracers/event_stream.py b/libs/core/langchain_core/tracers/event_stream.py
index b7f3db8595d..2a5f6b5f840 100644
--- a/libs/core/langchain_core/tracers/event_stream.py
+++ b/libs/core/langchain_core/tracers/event_stream.py
@@ -5,7 +5,6 @@ from __future__ import annotations
import asyncio
import contextlib
import logging
-from collections.abc import AsyncIterator, Iterator, Sequence
from typing import (
TYPE_CHECKING,
Any,
@@ -37,13 +36,15 @@ from langchain_core.runnables.utils import (
_RootEventFilter,
)
from langchain_core.tracers._streaming import _StreamingCallbackHandler
-from langchain_core.tracers.log_stream import LogEntry
from langchain_core.tracers.memory_stream import _MemoryStream
from langchain_core.utils.aiter import aclosing, py_anext
if TYPE_CHECKING:
+ from collections.abc import AsyncIterator, Iterator, Sequence
+
from langchain_core.documents import Document
from langchain_core.runnables import Runnable, RunnableConfig
+ from langchain_core.tracers.log_stream import LogEntry
logger = logging.getLogger(__name__)
diff --git a/libs/core/langchain_core/tracers/langchain.py b/libs/core/langchain_core/tracers/langchain.py
index 56981bc81d8..7cdf5c92a32 100644
--- a/libs/core/langchain_core/tracers/langchain.py
+++ b/libs/core/langchain_core/tracers/langchain.py
@@ -22,12 +22,12 @@ from tenacity import (
from langchain_core.env import get_runtime_environment
from langchain_core.load import dumpd
-from langchain_core.outputs import ChatGenerationChunk, GenerationChunk
from langchain_core.tracers.base import BaseTracer
from langchain_core.tracers.schemas import Run
if TYPE_CHECKING:
from langchain_core.messages import BaseMessage
+ from langchain_core.outputs import ChatGenerationChunk, GenerationChunk
logger = logging.getLogger(__name__)
_LOGGED = set()
diff --git a/libs/core/langchain_core/tracers/log_stream.py b/libs/core/langchain_core/tracers/log_stream.py
index 2284ff7022f..1e0c26aa795 100644
--- a/libs/core/langchain_core/tracers/log_stream.py
+++ b/libs/core/langchain_core/tracers/log_stream.py
@@ -5,8 +5,8 @@ import contextlib
import copy
import threading
from collections import defaultdict
-from collections.abc import AsyncIterator, Iterator, Sequence
from typing import (
+ TYPE_CHECKING,
Any,
Literal,
Optional,
@@ -14,7 +14,6 @@ from typing import (
Union,
overload,
)
-from uuid import UUID
import jsonpatch # type: ignore[import]
from typing_extensions import NotRequired, TypedDict
@@ -23,11 +22,16 @@ from langchain_core.load import dumps
from langchain_core.load.load import load
from langchain_core.outputs import ChatGenerationChunk, GenerationChunk
from langchain_core.runnables import Runnable, RunnableConfig, ensure_config
-from langchain_core.runnables.utils import Input, Output
from langchain_core.tracers._streaming import _StreamingCallbackHandler
from langchain_core.tracers.base import BaseTracer
from langchain_core.tracers.memory_stream import _MemoryStream
-from langchain_core.tracers.schemas import Run
+
+if TYPE_CHECKING:
+ from collections.abc import AsyncIterator, Iterator, Sequence
+ from uuid import UUID
+
+ from langchain_core.runnables.utils import Input, Output
+ from langchain_core.tracers.schemas import Run
class LogEntry(TypedDict):
diff --git a/libs/core/langchain_core/tracers/root_listeners.py b/libs/core/langchain_core/tracers/root_listeners.py
index 1530598fb75..c08861164ac 100644
--- a/libs/core/langchain_core/tracers/root_listeners.py
+++ b/libs/core/langchain_core/tracers/root_listeners.py
@@ -1,6 +1,5 @@
from collections.abc import Awaitable
-from typing import Callable, Optional, Union
-from uuid import UUID
+from typing import TYPE_CHECKING, Callable, Optional, Union
from langchain_core.runnables.config import (
RunnableConfig,
@@ -10,6 +9,9 @@ from langchain_core.runnables.config import (
from langchain_core.tracers.base import AsyncBaseTracer, BaseTracer
from langchain_core.tracers.schemas import Run
+if TYPE_CHECKING:
+ from uuid import UUID
+
Listener = Union[Callable[[Run], None], Callable[[Run, RunnableConfig], None]]
AsyncListener = Union[
Callable[[Run], Awaitable[None]], Callable[[Run, RunnableConfig], Awaitable[None]]
diff --git a/libs/core/langchain_core/utils/json_schema.py b/libs/core/langchain_core/utils/json_schema.py
index 38fab589909..72f20c38d22 100644
--- a/libs/core/langchain_core/utils/json_schema.py
+++ b/libs/core/langchain_core/utils/json_schema.py
@@ -1,8 +1,10 @@
from __future__ import annotations
-from collections.abc import Sequence
from copy import deepcopy
-from typing import Any, Optional
+from typing import TYPE_CHECKING, Any, Optional
+
+if TYPE_CHECKING:
+ from collections.abc import Sequence
def _retrieve_ref(path: str, schema: dict) -> dict:
diff --git a/libs/core/langchain_core/utils/mustache.py b/libs/core/langchain_core/utils/mustache.py
index ee2ed8f2528..4c42c47fed9 100644
--- a/libs/core/langchain_core/utils/mustache.py
+++ b/libs/core/langchain_core/utils/mustache.py
@@ -8,6 +8,7 @@ import logging
from collections.abc import Iterator, Mapping, Sequence
from types import MappingProxyType
from typing import (
+ TYPE_CHECKING,
Any,
Literal,
Optional,
@@ -15,7 +16,8 @@ from typing import (
cast,
)
-from typing_extensions import TypeAlias
+if TYPE_CHECKING:
+ from typing_extensions import TypeAlias
logger = logging.getLogger(__name__)
diff --git a/libs/core/langchain_core/utils/pydantic.py b/libs/core/langchain_core/utils/pydantic.py
index 6c4b4cb8c2f..0aff29c1bb1 100644
--- a/libs/core/langchain_core/utils/pydantic.py
+++ b/libs/core/langchain_core/utils/pydantic.py
@@ -9,6 +9,7 @@ from contextlib import nullcontext
from functools import lru_cache, wraps
from types import GenericAlias
from typing import (
+ TYPE_CHECKING,
Any,
Callable,
Optional,
@@ -29,13 +30,16 @@ from pydantic import (
from pydantic import (
create_model as _create_model_base,
)
+from pydantic.fields import FieldInfo as FieldInfoV2
from pydantic.json_schema import (
DEFAULT_REF_TEMPLATE,
GenerateJsonSchema,
JsonSchemaMode,
JsonSchemaValue,
)
-from pydantic_core import core_schema
+
+if TYPE_CHECKING:
+ from pydantic_core import core_schema
def get_pydantic_major_version() -> int:
@@ -71,8 +75,8 @@ elif PYDANTIC_MAJOR_VERSION == 2:
from pydantic.v1.fields import FieldInfo as FieldInfoV1 # type: ignore[assignment]
# Union type needs to be last assignment to PydanticBaseModel to make mypy happy.
- PydanticBaseModel = Union[BaseModel, pydantic.BaseModel] # type: ignore
- TypeBaseModel = Union[type[BaseModel], type[pydantic.BaseModel]] # type: ignore
+ PydanticBaseModel = Union[BaseModel, pydantic.BaseModel] # type: ignore[assignment,misc]
+ TypeBaseModel = Union[type[BaseModel], type[pydantic.BaseModel]] # type: ignore[misc]
else:
msg = f"Unsupported Pydantic version: {PYDANTIC_MAJOR_VERSION}"
raise ValueError(msg)
@@ -357,7 +361,6 @@ def _create_subset_model(
if PYDANTIC_MAJOR_VERSION == 2:
from pydantic import BaseModel as BaseModelV2
- from pydantic.fields import FieldInfo as FieldInfoV2
from pydantic.v1 import BaseModel as BaseModelV1
@overload
diff --git a/libs/core/langchain_core/vectorstores/base.py b/libs/core/langchain_core/vectorstores/base.py
index b154a14b981..1cf1fd529fb 100644
--- a/libs/core/langchain_core/vectorstores/base.py
+++ b/libs/core/langchain_core/vectorstores/base.py
@@ -25,7 +25,6 @@ import logging
import math
import warnings
from abc import ABC, abstractmethod
-from collections.abc import Collection, Iterable, Iterator, Sequence
from itertools import cycle
from typing import (
TYPE_CHECKING,
@@ -43,6 +42,8 @@ from langchain_core.retrievers import BaseRetriever, LangSmithRetrieverParams
from langchain_core.runnables.config import run_in_executor
if TYPE_CHECKING:
+ from collections.abc import Collection, Iterable, Iterator, Sequence
+
from langchain_core.callbacks.manager import (
AsyncCallbackManagerForRetrieverRun,
CallbackManagerForRetrieverRun,
diff --git a/libs/core/langchain_core/vectorstores/in_memory.py b/libs/core/langchain_core/vectorstores/in_memory.py
index 5fb385a21c3..daf1dcdfc69 100644
--- a/libs/core/langchain_core/vectorstores/in_memory.py
+++ b/libs/core/langchain_core/vectorstores/in_memory.py
@@ -2,7 +2,6 @@ from __future__ import annotations
import json
import uuid
-from collections.abc import Iterator, Sequence
from pathlib import Path
from typing import (
TYPE_CHECKING,
@@ -13,13 +12,15 @@ from typing import (
from langchain_core._api import deprecated
from langchain_core.documents import Document
-from langchain_core.embeddings import Embeddings
from langchain_core.load import dumpd, load
from langchain_core.vectorstores import VectorStore
from langchain_core.vectorstores.utils import _cosine_similarity as cosine_similarity
from langchain_core.vectorstores.utils import maximal_marginal_relevance
if TYPE_CHECKING:
+ from collections.abc import Iterator, Sequence
+
+ from langchain_core.embeddings import Embeddings
from langchain_core.indexing import UpsertResponse
diff --git a/libs/core/pyproject.toml b/libs/core/pyproject.toml
index bfd3eec1d40..dc4a8bbc473 100644
--- a/libs/core/pyproject.toml
+++ b/libs/core/pyproject.toml
@@ -77,8 +77,9 @@ target-version = "py39"
[tool.ruff.lint]
-select = [ "ANN", "ASYNC", "B", "C4", "COM", "DJ", "E", "EM", "EXE", "F", "FLY", "FURB", "I", "ICN", "INT", "LOG", "N", "NPY", "PD", "PIE", "Q", "RSE", "S", "SIM", "SLOT", "T10", "T201", "TID", "TRY", "UP", "W", "YTT",]
-ignore = [ "ANN401", "COM812", "UP007", "S110", "S112",]
+select = [ "ANN", "ASYNC", "B", "C4", "COM", "DJ", "E", "EM", "EXE", "F", "FLY", "FURB", "I", "ICN", "INT", "LOG", "N", "NPY", "PD", "PIE", "Q", "RSE", "S", "SIM", "SLOT", "T10", "T201", "TC", "TID", "TRY", "UP", "W", "YTT",]
+ignore = [ "ANN401", "COM812", "UP007", "S110", "S112", "TC001", "TC002", "TC003"]
+flake8-type-checking.runtime-evaluated-base-classes = ["pydantic.BaseModel","langchain_core.load.serializable.Serializable","langchain_core.runnables.base.RunnableSerializable"]
flake8-annotations.allow-star-arg-any = true
flake8-annotations.mypy-init-return = true
diff --git a/libs/core/tests/unit_tests/language_models/chat_models/test_base.py b/libs/core/tests/unit_tests/language_models/chat_models/test_base.py
index 420b60cebf3..284ffb23d18 100644
--- a/libs/core/tests/unit_tests/language_models/chat_models/test_base.py
+++ b/libs/core/tests/unit_tests/language_models/chat_models/test_base.py
@@ -2,7 +2,7 @@
import uuid
from collections.abc import AsyncIterator, Iterator
-from typing import Any, Literal, Optional, Union
+from typing import TYPE_CHECKING, Any, Literal, Optional, Union
import pytest
@@ -30,6 +30,9 @@ from tests.unit_tests.fake.callbacks import (
)
from tests.unit_tests.stubs import _any_id_ai_message, _any_id_ai_message_chunk
+if TYPE_CHECKING:
+ from langchain_core.outputs.llm_result import LLMResult
+
@pytest.fixture
def messages() -> list:
diff --git a/libs/core/tests/unit_tests/vectorstores/test_vectorstore.py b/libs/core/tests/unit_tests/vectorstores/test_vectorstore.py
index c9c6592545b..2ebee79cf9d 100644
--- a/libs/core/tests/unit_tests/vectorstores/test_vectorstore.py
+++ b/libs/core/tests/unit_tests/vectorstores/test_vectorstore.py
@@ -7,8 +7,7 @@ the relevant methods.
from __future__ import annotations
import uuid
-from collections.abc import Iterable, Sequence
-from typing import Any, Optional
+from typing import TYPE_CHECKING, Any, Optional
import pytest
@@ -16,6 +15,9 @@ from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings, FakeEmbeddings
from langchain_core.vectorstores import VectorStore
+if TYPE_CHECKING:
+ from collections.abc import Iterable, Sequence
+
class CustomAddTextsVectorstore(VectorStore):
"""A vectorstore that only implements add texts."""
diff --git a/libs/langchain/langchain/chains/flare/base.py b/libs/langchain/langchain/chains/flare/base.py
index 430ad0baf36..04173a6199e 100644
--- a/libs/langchain/langchain/chains/flare/base.py
+++ b/libs/langchain/langchain/chains/flare/base.py
@@ -1,9 +1,9 @@
from __future__ import annotations
+import logging
import re
from typing import Any, Dict, List, Optional, Sequence, Tuple
-import numpy as np
from langchain_core.callbacks import (
CallbackManagerForChainRun,
)
@@ -23,6 +23,8 @@ from langchain.chains.flare.prompts import (
)
from langchain.chains.llm import LLMChain
+logger = logging.getLogger(__name__)
+
def _extract_tokens_and_log_probs(response: AIMessage) -> Tuple[List[str], List[float]]:
"""Extract tokens and log probabilities from chat model response."""
@@ -57,7 +59,24 @@ def _low_confidence_spans(
min_token_gap: int,
num_pad_tokens: int,
) -> List[str]:
- _low_idx = np.where(np.exp(log_probs) < min_prob)[0]
+ try:
+ import numpy as np
+
+ _low_idx = np.where(np.exp(log_probs) < min_prob)[0]
+ except ImportError:
+ logger.warning(
+ "NumPy not found in the current Python environment. FlareChain will use a "
+ "pure Python implementation for internal calculations, which may "
+ "significantly impact performance, especially for large datasets. For "
+ "optimal speed and efficiency, consider installing NumPy: pip install numpy"
+ )
+ import math
+
+ _low_idx = [ # type: ignore[assignment]
+ idx
+ for idx, log_prob in enumerate(log_probs)
+ if math.exp(log_prob) < min_prob
+ ]
low_idx = [i for i in _low_idx if re.search(r"\w", tokens[i])]
if len(low_idx) == 0:
return []
diff --git a/libs/langchain/langchain/chains/hyde/base.py b/libs/langchain/langchain/chains/hyde/base.py
index f48ca4db6b6..0dade1ab9d0 100644
--- a/libs/langchain/langchain/chains/hyde/base.py
+++ b/libs/langchain/langchain/chains/hyde/base.py
@@ -5,9 +5,9 @@ https://arxiv.org/abs/2212.10496
from __future__ import annotations
+import logging
from typing import Any, Dict, List, Optional
-import numpy as np
from langchain_core.callbacks import CallbackManagerForChainRun
from langchain_core.embeddings import Embeddings
from langchain_core.language_models import BaseLanguageModel
@@ -20,6 +20,8 @@ from langchain.chains.base import Chain
from langchain.chains.hyde.prompts import PROMPT_MAP
from langchain.chains.llm import LLMChain
+logger = logging.getLogger(__name__)
+
class HypotheticalDocumentEmbedder(Chain, Embeddings):
"""Generate hypothetical document for query, and then embed that.
@@ -54,7 +56,22 @@ class HypotheticalDocumentEmbedder(Chain, Embeddings):
def combine_embeddings(self, embeddings: List[List[float]]) -> List[float]:
"""Combine embeddings into final embeddings."""
- return list(np.array(embeddings).mean(axis=0))
+ try:
+ import numpy as np
+
+ return list(np.array(embeddings).mean(axis=0))
+ except ImportError:
+ logger.warning(
+ "NumPy not found in the current Python environment. "
+ "HypotheticalDocumentEmbedder will use a pure Python implementation "
+ "for internal calculations, which may significantly impact "
+ "performance, especially for large datasets. For optimal speed and "
+ "efficiency, consider installing NumPy: pip install numpy"
+ )
+ if not embeddings:
+ return []
+ num_vectors = len(embeddings)
+ return [sum(dim_values) / num_vectors for dim_values in zip(*embeddings)]
def embed_query(self, text: str) -> List[float]:
"""Generate a hypothetical document and embedded it."""
diff --git a/libs/langchain/langchain/evaluation/embedding_distance/base.py b/libs/langchain/langchain/evaluation/embedding_distance/base.py
index 4a1340d0695..c3b3f805bde 100644
--- a/libs/langchain/langchain/evaluation/embedding_distance/base.py
+++ b/libs/langchain/langchain/evaluation/embedding_distance/base.py
@@ -1,9 +1,11 @@
"""A chain for comparing the output of two models using embeddings."""
+import functools
+import logging
from enum import Enum
+from importlib import util
from typing import Any, Dict, List, Optional
-import numpy as np
from langchain_core.callbacks.manager import (
AsyncCallbackManagerForChainRun,
CallbackManagerForChainRun,
@@ -18,6 +20,34 @@ from langchain.evaluation.schema import PairwiseStringEvaluator, StringEvaluator
from langchain.schema import RUN_KEY
+def _import_numpy() -> Any:
+ try:
+ import numpy as np
+
+ return np
+ except ImportError as e:
+ raise ImportError(
+ "Could not import numpy, please install with `pip install numpy`."
+ ) from e
+
+
+logger = logging.getLogger(__name__)
+
+
+@functools.lru_cache(maxsize=1)
+def _check_numpy() -> bool:
+ if bool(util.find_spec("numpy")):
+ return True
+ logger.warning(
+ "NumPy not found in the current Python environment. "
+ "langchain will use a pure Python implementation for embedding distance "
+ "operations, which may significantly impact performance, especially for large "
+ "datasets. For optimal speed and efficiency, consider installing NumPy: "
+ "pip install numpy"
+ )
+ return False
+
+
def _embedding_factory() -> Embeddings:
"""Create an Embeddings object.
Returns:
@@ -158,7 +188,7 @@ class _EmbeddingDistanceChainMixin(Chain):
raise ValueError(f"Invalid metric: {metric}")
@staticmethod
- def _cosine_distance(a: np.ndarray, b: np.ndarray) -> np.ndarray:
+ def _cosine_distance(a: Any, b: Any) -> Any:
"""Compute the cosine distance between two vectors.
Args:
@@ -179,7 +209,7 @@ class _EmbeddingDistanceChainMixin(Chain):
return 1.0 - cosine_similarity(a, b)
@staticmethod
- def _euclidean_distance(a: np.ndarray, b: np.ndarray) -> np.floating:
+ def _euclidean_distance(a: Any, b: Any) -> Any:
"""Compute the Euclidean distance between two vectors.
Args:
@@ -189,10 +219,15 @@ class _EmbeddingDistanceChainMixin(Chain):
Returns:
np.floating: The Euclidean distance.
"""
- return np.linalg.norm(a - b)
+ if _check_numpy():
+ import numpy as np
+
+ return np.linalg.norm(a - b)
+
+ return sum((x - y) * (x - y) for x, y in zip(a, b)) ** 0.5
@staticmethod
- def _manhattan_distance(a: np.ndarray, b: np.ndarray) -> np.floating:
+ def _manhattan_distance(a: Any, b: Any) -> Any:
"""Compute the Manhattan distance between two vectors.
Args:
@@ -202,10 +237,14 @@ class _EmbeddingDistanceChainMixin(Chain):
Returns:
np.floating: The Manhattan distance.
"""
- return np.sum(np.abs(a - b))
+ if _check_numpy():
+ np = _import_numpy()
+ return np.sum(np.abs(a - b))
+
+ return sum(abs(x - y) for x, y in zip(a, b))
@staticmethod
- def _chebyshev_distance(a: np.ndarray, b: np.ndarray) -> np.floating:
+ def _chebyshev_distance(a: Any, b: Any) -> Any:
"""Compute the Chebyshev distance between two vectors.
Args:
@@ -215,10 +254,14 @@ class _EmbeddingDistanceChainMixin(Chain):
Returns:
np.floating: The Chebyshev distance.
"""
- return np.max(np.abs(a - b))
+ if _check_numpy():
+ np = _import_numpy()
+ return np.max(np.abs(a - b))
+
+ return max(abs(x - y) for x, y in zip(a, b))
@staticmethod
- def _hamming_distance(a: np.ndarray, b: np.ndarray) -> np.floating:
+ def _hamming_distance(a: Any, b: Any) -> Any:
"""Compute the Hamming distance between two vectors.
Args:
@@ -228,9 +271,13 @@ class _EmbeddingDistanceChainMixin(Chain):
Returns:
np.floating: The Hamming distance.
"""
- return np.mean(a != b)
+ if _check_numpy():
+ np = _import_numpy()
+ return np.mean(a != b)
- def _compute_score(self, vectors: np.ndarray) -> float:
+ return sum(1 for x, y in zip(a, b) if x != y) / len(a)
+
+ def _compute_score(self, vectors: Any) -> float:
"""Compute the score based on the distance metric.
Args:
@@ -240,8 +287,11 @@ class _EmbeddingDistanceChainMixin(Chain):
float: The computed score.
"""
metric = self._get_metric(self.distance_metric)
- score = metric(vectors[0].reshape(1, -1), vectors[1].reshape(1, -1)).item()
- return score
+ if _check_numpy() and isinstance(vectors, _import_numpy().ndarray):
+ score = metric(vectors[0].reshape(1, -1), vectors[1].reshape(1, -1)).item()
+ else:
+ score = metric(vectors[0], vectors[1])
+ return float(score)
class EmbeddingDistanceEvalChain(_EmbeddingDistanceChainMixin, StringEvaluator):
@@ -292,9 +342,12 @@ class EmbeddingDistanceEvalChain(_EmbeddingDistanceChainMixin, StringEvaluator):
Returns:
Dict[str, Any]: The computed score.
"""
- vectors = np.array(
- self.embeddings.embed_documents([inputs["prediction"], inputs["reference"]])
+ vectors = self.embeddings.embed_documents(
+ [inputs["prediction"], inputs["reference"]]
)
+ if _check_numpy():
+ np = _import_numpy()
+ vectors = np.array(vectors)
score = self._compute_score(vectors)
return {"score": score}
@@ -313,13 +366,15 @@ class EmbeddingDistanceEvalChain(_EmbeddingDistanceChainMixin, StringEvaluator):
Returns:
Dict[str, Any]: The computed score.
"""
- embedded = await self.embeddings.aembed_documents(
+ vectors = await self.embeddings.aembed_documents(
[
inputs["prediction"],
inputs["reference"],
]
)
- vectors = np.array(embedded)
+ if _check_numpy():
+ np = _import_numpy()
+ vectors = np.array(vectors)
score = self._compute_score(vectors)
return {"score": score}
@@ -432,14 +487,15 @@ class PairwiseEmbeddingDistanceEvalChain(
Returns:
Dict[str, Any]: The computed score.
"""
- vectors = np.array(
- self.embeddings.embed_documents(
- [
- inputs["prediction"],
- inputs["prediction_b"],
- ]
- )
+ vectors = self.embeddings.embed_documents(
+ [
+ inputs["prediction"],
+ inputs["prediction_b"],
+ ]
)
+ if _check_numpy():
+ np = _import_numpy()
+ vectors = np.array(vectors)
score = self._compute_score(vectors)
return {"score": score}
@@ -458,13 +514,15 @@ class PairwiseEmbeddingDistanceEvalChain(
Returns:
Dict[str, Any]: The computed score.
"""
- embedded = await self.embeddings.aembed_documents(
+ vectors = await self.embeddings.aembed_documents(
[
inputs["prediction"],
inputs["prediction_b"],
]
)
- vectors = np.array(embedded)
+ if _check_numpy():
+ np = _import_numpy()
+ vectors = np.array(vectors)
score = self._compute_score(vectors)
return {"score": score}
diff --git a/libs/langchain/langchain/retrievers/document_compressors/embeddings_filter.py b/libs/langchain/langchain/retrievers/document_compressors/embeddings_filter.py
index f7acd800ac1..8cb6b082f9c 100644
--- a/libs/langchain/langchain/retrievers/document_compressors/embeddings_filter.py
+++ b/libs/langchain/langchain/retrievers/document_compressors/embeddings_filter.py
@@ -1,6 +1,5 @@
from typing import Callable, Dict, Optional, Sequence
-import numpy as np
from langchain_core.callbacks.manager import Callbacks
from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
@@ -69,6 +68,13 @@ class EmbeddingsFilter(BaseDocumentCompressor):
"To use please install langchain-community "
"with `pip install langchain-community`."
)
+
+ try:
+ import numpy as np
+ except ImportError as e:
+ raise ImportError(
+ "Could not import numpy, please install with `pip install numpy`."
+ ) from e
stateful_documents = get_stateful_documents(documents)
embedded_documents = _get_embeddings_from_stateful_docs(
self.embeddings, stateful_documents
@@ -104,6 +110,13 @@ class EmbeddingsFilter(BaseDocumentCompressor):
"To use please install langchain-community "
"with `pip install langchain-community`."
)
+
+ try:
+ import numpy as np
+ except ImportError as e:
+ raise ImportError(
+ "Could not import numpy, please install with `pip install numpy`."
+ ) from e
stateful_documents = get_stateful_documents(documents)
embedded_documents = await _aget_embeddings_from_stateful_docs(
self.embeddings, stateful_documents
diff --git a/libs/langchain/pyproject.toml b/libs/langchain/pyproject.toml
index 2b181170089..4bec2664c7d 100644
--- a/libs/langchain/pyproject.toml
+++ b/libs/langchain/pyproject.toml
@@ -14,8 +14,6 @@ dependencies = [
"SQLAlchemy<3,>=1.4",
"requests<3,>=2",
"PyYAML>=5.3",
- "numpy<2,>=1.26.4; python_version < \"3.12\"",
- "numpy<3,>=1.26.2; python_version >= \"3.12\"",
"async-timeout<5.0.0,>=4.0.0; python_version < \"3.11\"",
]
name = "langchain"
@@ -74,6 +72,7 @@ test = [
"langchain-openai",
"toml>=0.10.2",
"packaging>=24.2",
+ "numpy<3,>=1.26.4",
]
codespell = ["codespell<3.0.0,>=2.2.0"]
test_integration = [
@@ -102,6 +101,7 @@ typing = [
"mypy-protobuf<4.0.0,>=3.0.0",
"langchain-core",
"langchain-text-splitters",
+ "numpy<3,>=1.26.4",
]
dev = [
"jupyter<2.0.0,>=1.0.0",
diff --git a/libs/langchain/tests/unit_tests/callbacks/test_file.py b/libs/langchain/tests/unit_tests/callbacks/test_file.py
new file mode 100644
index 00000000000..7d739af8a65
--- /dev/null
+++ b/libs/langchain/tests/unit_tests/callbacks/test_file.py
@@ -0,0 +1,45 @@
+import pathlib
+from typing import Any, Dict, List, Optional
+
+import pytest
+
+from langchain.callbacks import FileCallbackHandler
+from langchain.chains.base import CallbackManagerForChainRun, Chain
+
+
+class FakeChain(Chain):
+ """Fake chain class for testing purposes."""
+
+ be_correct: bool = True
+ the_input_keys: List[str] = ["foo"]
+ the_output_keys: List[str] = ["bar"]
+
+ @property
+ def input_keys(self) -> List[str]:
+ """Input keys."""
+ return self.the_input_keys
+
+ @property
+ def output_keys(self) -> List[str]:
+ """Output key of bar."""
+ return self.the_output_keys
+
+ def _call(
+ self,
+ inputs: Dict[str, str],
+ run_manager: Optional[CallbackManagerForChainRun] = None,
+ ) -> Dict[str, str]:
+ return {"bar": "bar"}
+
+
+def test_filecallback(capsys: pytest.CaptureFixture, tmp_path: pathlib.Path) -> Any:
+ """Test the file callback handler."""
+ p = tmp_path / "output.log"
+ handler = FileCallbackHandler(str(p))
+ chain_test = FakeChain(callbacks=[handler])
+ chain_test.invoke({"foo": "bar"})
+ # Assert the output is as expected
+ assert p.read_text() == (
+ "\n\n\x1b[1m> Entering new FakeChain "
+ "chain...\x1b[0m\n\n\x1b[1m> Finished chain.\x1b[0m\n"
+ )
diff --git a/libs/langchain/tests/unit_tests/test_dependencies.py b/libs/langchain/tests/unit_tests/test_dependencies.py
index e04ebbd00c1..579601c6e11 100644
--- a/libs/langchain/tests/unit_tests/test_dependencies.py
+++ b/libs/langchain/tests/unit_tests/test_dependencies.py
@@ -37,7 +37,6 @@ def test_required_dependencies(uv_conf: Mapping[str, Any]) -> None:
"langchain-core",
"langchain-text-splitters",
"langsmith",
- "numpy",
"pydantic",
"requests",
]
@@ -82,5 +81,6 @@ def test_test_group_dependencies(uv_conf: Mapping[str, Any]) -> None:
"requests-mock",
# TODO: temporary hack since cffi 1.17.1 doesn't work with py 3.9.
"cffi",
+ "numpy",
]
)
diff --git a/libs/langchain/uv.lock b/libs/langchain/uv.lock
index 8fb43b3cd29..5a90ecd654c 100644
--- a/libs/langchain/uv.lock
+++ b/libs/langchain/uv.lock
@@ -2247,8 +2247,6 @@ dependencies = [
{ name = "langchain-core" },
{ name = "langchain-text-splitters" },
{ name = "langsmith" },
- { name = "numpy", version = "1.26.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.12'" },
- { name = "numpy", version = "2.2.3", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.12'" },
{ name = "pydantic" },
{ name = "pyyaml" },
{ name = "requests" },
@@ -2329,6 +2327,8 @@ test = [
{ name = "langchain-tests" },
{ name = "langchain-text-splitters" },
{ name = "lark" },
+ { name = "numpy", version = "1.26.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.12'" },
+ { name = "numpy", version = "2.2.3", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.12'" },
{ name = "packaging" },
{ name = "pandas" },
{ name = "pytest" },
@@ -2359,6 +2359,8 @@ typing = [
{ name = "langchain-text-splitters" },
{ name = "mypy" },
{ name = "mypy-protobuf" },
+ { name = "numpy", version = "1.26.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.12'" },
+ { name = "numpy", version = "2.2.3", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.12'" },
{ name = "types-chardet" },
{ name = "types-pytz" },
{ name = "types-pyyaml" },
@@ -2389,8 +2391,6 @@ requires-dist = [
{ name = "langchain-together", marker = "extra == 'together'" },
{ name = "langchain-xai", marker = "extra == 'xai'" },
{ name = "langsmith", specifier = ">=0.1.17,<0.4" },
- { name = "numpy", marker = "python_full_version < '3.12'", specifier = ">=1.26.4,<2" },
- { name = "numpy", marker = "python_full_version >= '3.12'", specifier = ">=1.26.2,<3" },
{ name = "pydantic", specifier = ">=2.7.4,<3.0.0" },
{ name = "pyyaml", specifier = ">=5.3" },
{ name = "requests", specifier = ">=2,<3" },
@@ -2422,6 +2422,7 @@ test = [
{ name = "langchain-tests", editable = "../standard-tests" },
{ name = "langchain-text-splitters", editable = "../text-splitters" },
{ name = "lark", specifier = ">=1.1.5,<2.0.0" },
+ { name = "numpy", specifier = ">=1.26.4,<3" },
{ name = "packaging", specifier = ">=24.2" },
{ name = "pandas", specifier = ">=2.0.0,<3.0.0" },
{ name = "pytest", specifier = ">=8,<9" },
@@ -2452,6 +2453,7 @@ typing = [
{ name = "langchain-text-splitters", editable = "../text-splitters" },
{ name = "mypy", specifier = ">=1.10,<2.0" },
{ name = "mypy-protobuf", specifier = ">=3.0.0,<4.0.0" },
+ { name = "numpy", specifier = ">=1.26.4,<3" },
{ name = "types-chardet", specifier = ">=5.0.4.6,<6.0.0.0" },
{ name = "types-pytz", specifier = ">=2023.3.0.0,<2024.0.0.0" },
{ name = "types-pyyaml", specifier = ">=6.0.12.2,<7.0.0.0" },
diff --git a/libs/packages.yml b/libs/packages.yml
index 6f433f16c98..b39a0984260 100644
--- a/libs/packages.yml
+++ b/libs/packages.yml
@@ -462,8 +462,11 @@ packages:
- name: langchain-permit
path: .
repo: permitio/langchain-permit
+- name: langchain-pymupdf4llm
+ path: .
+ repo: lakinduboteju/langchain-pymupdf4llm
- name: langchain-writer
path: .
repo: writer/langchain-writer
downloads: 0
- downloads_updated_at: '2025-02-24T13:19:19.816059+00:00'
\ No newline at end of file
+ downloads_updated_at: '2025-02-24T13:19:19.816059+00:00'