docs: Add doc for Vectorize provider (#30436)

This pull request adds documentation and a tutorial for integrating the [Vectorize](https://vectorize.io/) service with LangChain. The most important changes include adding a new documentation page for Vectorize and creating a Jupyter notebook that demonstrates how to use the Vectorize retriever. The source code for the langchain-vectorize package can be found [here](https://github.com/vectorize-io/integrations-python/tree/main/langchain). Previews: * https://langchain-git-fork-cbornet-vectorize-langchain.vercel.app/docs/integrations/providers/vectorize/ * https://langchain-git-fork-cbornet-vectorize-langchain.vercel.app/docs/integrations/retrievers/vectorize/ Documentation updates: * [`docs/docs/integrations/providers/vectorize.mdx`](diffhunk://#diff-7e00d4ce4768f73b4d381a7c7b1f94d138f1b27ebd08e3666b942630a0285606R1-R40): Added a new documentation page for Vectorize, including an overview of its features, installation instructions, and a basic usage example. Tutorial updates: * [`docs/docs/integrations/retrievers/vectorize.ipynb`](diffhunk://#diff-ba5bb9a1b4586db7740944b001bcfeadc88be357640ded0c82a329b11d8d6e29R1-R294): Created a Jupyter notebook tutorial that shows how to set up the Vectorize environment, create a RAG pipeline, and use the LangChain Vectorize retriever. The notebook includes steps for account creation, token generation, environment setup, and pipeline deployment.
2025-08-08 20:41:52 +00:00 · 2025-03-28 20:25:21 +01:00 · 2025-03-28 20:25:21 +01:00 · 86beb64b50
commit 86beb64b50
parent 6f8735592b
2 changed files with 465 additions and 0 deletions
--- a/docs/docs/integrations/providers/vectorize.mdx
+++ b/docs/docs/integrations/providers/vectorize.mdx
@ -0,0 +1,40 @@
+# Vectorize
+
+> [Vectorize](https://vectorize.io/) helps you build AI apps faster and with less hassle.
+> It automates data extraction, finds the best vectorization strategy using RAG evaluation,
+> and lets you quickly deploy real-time RAG pipelines for your unstructured data.
+> Your vector search indexes stay up-to-date, and it integrates with your existing vector database,
+> so you maintain full control of your data.
+> Vectorize handles the heavy lifting, freeing you to focus on building robust AI solutions without getting bogged down by data management.
+
+# Installation and Setup
+
+Install the following Python package:
+```bash
+pip install langchain-vectorize
+```
+
+Sign up for a free Vectorize account [here](https://platform.vectorize.io/)
+Generate an access token in the [Access Token](https://docs.vectorize.io/rag-pipelines/retrieval-endpoint#access-tokens) section
+Gather your organization ID. From the browser url, extract the UUID from the URL after /organization/
+
+Set up the following variables:
+```python
+VECTORIZE_ORG_ID="your-organization-id"
+VECTORIZE_API_TOKEN="your-api-token"
+```
+
+## Retriever
+
+```python
+from langchain_vectorize import VectorizeRetriever
+
+retriever = VectorizeRetriever(
+    api_token=VECTORIZE_API_TOKEN,
+    organization=VECTORIZE_ORG_ID,
+    pipeline_id="...",
+)
+retriever.invoke("query")
+```
+
+Learn more in the [example notebook](/docs/integrations/retrievers/vectorize).
--- a/docs/docs/integrations/retrievers/vectorize.ipynb
+++ b/docs/docs/integrations/retrievers/vectorize.ipynb
@ -0,0 +1,425 @@
+{
+ "cells": [
+  {
+   "metadata": {},
+   "cell_type": "raw",
+   "source": [
+    "---\n",
+    "sidebar_label: Vectorize\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "zvHrM3wa7IE1"
+   },
+   "source": [
+    "# VectorizeRetriever\n",
+    "\n",
+    "This notebook shows how to use the LangChain Vectorize retriever.\n",
+    "\n",
+    "> [Vectorize](https://vectorize.io/) helps you build AI apps faster and with less hassle.\n",
+    "> It automates data extraction, finds the best vectorization strategy using RAG evaluation,\n",
+    "> and lets you quickly deploy real-time RAG pipelines for your unstructured data.\n",
+    "> Your vector search indexes stay up-to-date, and it integrates with your existing vector database,\n",
+    "> so you maintain full control of your data.\n",
+    "> Vectorize handles the heavy lifting, freeing you to focus on building robust AI solutions without getting bogged down by data management.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "r-RswOO5o4K_"
+   },
+   "source": [
+    "## Setup\n",
+    "\n",
+    "In the following steps, we'll setup the Vectorize environment and create a RAG pipeline.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "FhvmvFKh4Rlh"
+   },
+   "source": [
+    "### Create a Vectorize Account & Get Your Access Token\n",
+    "\n",
+    "Sign up for a free Vectorize account [here](https://platform.vectorize.io/)\n",
+    "Generate an access token in the [Access Token](https://docs.vectorize.io/rag-pipelines/retrieval-endpoint#access-tokens) section\n",
+    "Gather your organization ID. From the browser url, extract the UUID from the URL after /organization/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "L2SULMfWpWFX"
+   },
+   "source": [
+    "### Configure token and organization ID\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "BnF8KoDZpg2O"
+   },
+   "outputs": [],
+   "source": [
+    "import getpass\n",
+    "\n",
+    "VECTORIZE_ORG_ID = getpass.getpass(\"Enter Vectorize organization ID: \")\n",
+    "VECTORIZE_API_TOKEN = getpass.getpass(\"Enter Vectorize API Token: \")"
+   ]
+  },
+  {
+   "metadata": {
+    "id": "JdZ5vlzjoDVr"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "### Installation\n",
+    "\n",
+    "This retriever lives in the `langchain-vectorize` package:"
+   ]
+  },
+  {
+   "metadata": {
+    "id": "IJFmtvDLn5R3"
+   },
+   "cell_type": "code",
+   "outputs": [],
+   "execution_count": null,
+   "source": "!pip install -qU langchain-vectorize"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "Oj10Moznpz67"
+   },
+   "source": "### Download a PDF file"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "eLbbTPytrgNw"
+   },
+   "outputs": [],
+   "source": "!wget \"https://raw.githubusercontent.com/vectorize-io/vectorize-clients/refs/tags/python-0.1.3/tests/python/tests/research.pdf\""
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "7g54J6awtshs"
+   },
+   "source": "### Initialize the vectorize client"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "9Fr4yz5CrFWP"
+   },
+   "outputs": [],
+   "source": [
+    "import vectorize_client as v\n",
+    "\n",
+    "api = v.ApiClient(v.Configuration(access_token=VECTORIZE_API_TOKEN))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "wPDoeqETxJrS"
+   },
+   "source": "### Create a File Upload Source Connector"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "9yEARIcFue5N"
+   },
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import os\n",
+    "\n",
+    "import urllib3\n",
+    "\n",
+    "connectors_api = v.ConnectorsApi(api)\n",
+    "response = connectors_api.create_source_connector(\n",
+    "    VECTORIZE_ORG_ID, [{\"type\": \"FILE_UPLOAD\", \"name\": \"From API\"}]\n",
+    ")\n",
+    "source_connector_id = response.connectors[0].id"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "yU3lS6dpxZnQ"
+   },
+   "source": "### Upload the PDF file"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "OIiMIZ8ZxUYF"
+   },
+   "outputs": [],
+   "source": [
+    "file_path = \"research.pdf\"\n",
+    "\n",
+    "http = urllib3.PoolManager()\n",
+    "uploads_api = v.UploadsApi(api)\n",
+    "metadata = {\"created-from-api\": True}\n",
+    "\n",
+    "upload_response = uploads_api.start_file_upload_to_connector(\n",
+    "    VECTORIZE_ORG_ID,\n",
+    "    source_connector_id,\n",
+    "    v.StartFileUploadToConnectorRequest(\n",
+    "        name=file_path.split(\"/\")[-1],\n",
+    "        content_type=\"application/pdf\",\n",
+    "        # add additional metadata that will be stored along with each chunk in the vector database\n",
+    "        metadata=json.dumps(metadata),\n",
+    "    ),\n",
+    ")\n",
+    "\n",
+    "with open(file_path, \"rb\") as f:\n",
+    "    response = http.request(\n",
+    "        \"PUT\",\n",
+    "        upload_response.upload_url,\n",
+    "        body=f,\n",
+    "        headers={\n",
+    "            \"Content-Type\": \"application/pdf\",\n",
+    "            \"Content-Length\": str(os.path.getsize(file_path)),\n",
+    "        },\n",
+    "    )\n",
+    "\n",
+    "if response.status != 200:\n",
+    "    print(\"Upload failed: \", response.data)\n",
+    "else:\n",
+    "    print(\"Upload successful\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "PdJJfOhfxiIo"
+   },
+   "source": "### Connect to the AI Platform and Vector Database"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "0ZSGhXJfxjBb"
+   },
+   "outputs": [],
+   "source": [
+    "ai_platforms = connectors_api.get_ai_platform_connectors(VECTORIZE_ORG_ID)\n",
+    "builtin_ai_platform = [\n",
+    "    c.id for c in ai_platforms.ai_platform_connectors if c.type == \"VECTORIZE\"\n",
+    "][0]\n",
+    "\n",
+    "vector_databases = connectors_api.get_destination_connectors(VECTORIZE_ORG_ID)\n",
+    "builtin_vector_db = [\n",
+    "    c.id for c in vector_databases.destination_connectors if c.type == \"VECTORIZE\"\n",
+    "][0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "JWoL-kqQxs5H"
+   },
+   "source": "### Configure and Deploy the Pipeline"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "hze9vJbQxvqA"
+   },
+   "outputs": [],
+   "source": [
+    "pipelines = v.PipelinesApi(api)\n",
+    "response = pipelines.create_pipeline(\n",
+    "    VECTORIZE_ORG_ID,\n",
+    "    v.PipelineConfigurationSchema(\n",
+    "        source_connectors=[\n",
+    "            v.SourceConnectorSchema(\n",
+    "                id=source_connector_id, type=\"FILE_UPLOAD\", config={}\n",
+    "            )\n",
+    "        ],\n",
+    "        destination_connector=v.DestinationConnectorSchema(\n",
+    "            id=builtin_vector_db, type=\"VECTORIZE\", config={}\n",
+    "        ),\n",
+    "        ai_platform=v.AIPlatformSchema(\n",
+    "            id=builtin_ai_platform, type=\"VECTORIZE\", config={}\n",
+    "        ),\n",
+    "        pipeline_name=\"My Pipeline From API\",\n",
+    "        schedule=v.ScheduleSchema(type=\"manual\"),\n",
+    "    ),\n",
+    ")\n",
+    "pipeline_id = response.data.id"
+   ]
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "### Configure tracing (optional)\n",
+    "\n",
+    "If you want to get automated tracing from individual queries, you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
+   ]
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "outputs": [],
+   "execution_count": null,
+   "source": [
+    "# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
+    "# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "5ULion9wyj6T"
+   },
+   "source": "## Instantiation"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "9D-QfiW7yoe0"
+   },
+   "outputs": [],
+   "source": [
+    "from langchain_vectorize.retrievers import VectorizeRetriever\n",
+    "\n",
+    "retriever = VectorizeRetriever(\n",
+    "    api_token=VECTORIZE_API_TOKEN,\n",
+    "    organization=VECTORIZE_ORG_ID,\n",
+    "    pipeline_id=pipeline_id,\n",
+    ")"
+   ]
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "## Usage\n",
+    "\n"
+   ]
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "outputs": [],
+   "execution_count": null,
+   "source": [
+    "query = \"Apple Shareholders equity\"\n",
+    "retriever.invoke(query, num_results=2)"
+   ]
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "## Use within a chain\n",
+    "\n",
+    "Like other retrievers, VectorizeRetriever can be incorporated into LLM applications via [chains](/docs/how_to/sequence/).\n",
+    "\n",
+    "We will need a LLM or chat model:\n",
+    "\n",
+    "import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
+    "\n",
+    "<ChatModelTabs customVarName=\"llm\" />"
+   ]
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "outputs": [],
+   "execution_count": null,
+   "source": [
+    "# | output: false\n",
+    "# | echo: false\n",
+    "\n",
+    "from langchain_openai import ChatOpenAI\n",
+    "\n",
+    "llm = ChatOpenAI(model=\"gpt-3.5-turbo-0125\", temperature=0)"
+   ]
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "outputs": [],
+   "execution_count": null,
+   "source": [
+    "from langchain_core.output_parsers import StrOutputParser\n",
+    "from langchain_core.prompts import ChatPromptTemplate\n",
+    "from langchain_core.runnables import RunnablePassthrough\n",
+    "\n",
+    "prompt = ChatPromptTemplate.from_template(\n",
+    "    \"\"\"Answer the question based only on the context provided.\n",
+    "\n",
+    "Context: {context}\n",
+    "\n",
+    "Question: {question}\"\"\"\n",
+    ")\n",
+    "\n",
+    "\n",
+    "def format_docs(docs):\n",
+    "    return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
+    "\n",
+    "\n",
+    "chain = (\n",
+    "    {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
+    "    | prompt\n",
+    "    | llm\n",
+    "    | StrOutputParser()\n",
+    ")"
+   ]
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "outputs": [],
+   "execution_count": null,
+   "source": "chain.invoke(\"...\")"
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "## API reference\n",
+    "\n",
+    "For detailed documentation of all VectorizeRetriever features and configurations head to the [API reference](https://python.langchain.com/api_reference/vectorize/langchain_vectorize.retrievers.VectorizeRetriever.html)."
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}