docs: Add doc for Vectorize provider (#30436)

This pull request adds documentation and a tutorial for integrating the
[Vectorize](https://vectorize.io/) service with LangChain. The most
important changes include adding a new documentation page for Vectorize
and creating a Jupyter notebook that demonstrates how to use the
Vectorize retriever.

The source code for the langchain-vectorize package can be found
[here](https://github.com/vectorize-io/integrations-python/tree/main/langchain).

Previews:
*
https://langchain-git-fork-cbornet-vectorize-langchain.vercel.app/docs/integrations/providers/vectorize/
*
https://langchain-git-fork-cbornet-vectorize-langchain.vercel.app/docs/integrations/retrievers/vectorize/

Documentation updates:

*
[`docs/docs/integrations/providers/vectorize.mdx`](diffhunk://#diff-7e00d4ce4768f73b4d381a7c7b1f94d138f1b27ebd08e3666b942630a0285606R1-R40):
Added a new documentation page for Vectorize, including an overview of
its features, installation instructions, and a basic usage example.

Tutorial updates:

*
[`docs/docs/integrations/retrievers/vectorize.ipynb`](diffhunk://#diff-ba5bb9a1b4586db7740944b001bcfeadc88be357640ded0c82a329b11d8d6e29R1-R294):
Created a Jupyter notebook tutorial that shows how to set up the
Vectorize environment, create a RAG pipeline, and use the LangChain
Vectorize retriever. The notebook includes steps for account creation,
token generation, environment setup, and pipeline deployment.
This commit is contained in:
Christophe Bornet 2025-03-28 20:25:21 +01:00 committed by GitHub
parent 6f8735592b
commit 86beb64b50
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 465 additions and 0 deletions

View File

@ -0,0 +1,40 @@
# Vectorize
> [Vectorize](https://vectorize.io/) helps you build AI apps faster and with less hassle.
> It automates data extraction, finds the best vectorization strategy using RAG evaluation,
> and lets you quickly deploy real-time RAG pipelines for your unstructured data.
> Your vector search indexes stay up-to-date, and it integrates with your existing vector database,
> so you maintain full control of your data.
> Vectorize handles the heavy lifting, freeing you to focus on building robust AI solutions without getting bogged down by data management.
# Installation and Setup
Install the following Python package:
```bash
pip install langchain-vectorize
```
Sign up for a free Vectorize account [here](https://platform.vectorize.io/)
Generate an access token in the [Access Token](https://docs.vectorize.io/rag-pipelines/retrieval-endpoint#access-tokens) section
Gather your organization ID. From the browser url, extract the UUID from the URL after /organization/
Set up the following variables:
```python
VECTORIZE_ORG_ID="your-organization-id"
VECTORIZE_API_TOKEN="your-api-token"
```
## Retriever
```python
from langchain_vectorize import VectorizeRetriever
retriever = VectorizeRetriever(
api_token=VECTORIZE_API_TOKEN,
organization=VECTORIZE_ORG_ID,
pipeline_id="...",
)
retriever.invoke("query")
```
Learn more in the [example notebook](/docs/integrations/retrievers/vectorize).

View File

@ -0,0 +1,425 @@
{
"cells": [
{
"metadata": {},
"cell_type": "raw",
"source": [
"---\n",
"sidebar_label: Vectorize\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zvHrM3wa7IE1"
},
"source": [
"# VectorizeRetriever\n",
"\n",
"This notebook shows how to use the LangChain Vectorize retriever.\n",
"\n",
"> [Vectorize](https://vectorize.io/) helps you build AI apps faster and with less hassle.\n",
"> It automates data extraction, finds the best vectorization strategy using RAG evaluation,\n",
"> and lets you quickly deploy real-time RAG pipelines for your unstructured data.\n",
"> Your vector search indexes stay up-to-date, and it integrates with your existing vector database,\n",
"> so you maintain full control of your data.\n",
"> Vectorize handles the heavy lifting, freeing you to focus on building robust AI solutions without getting bogged down by data management.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "r-RswOO5o4K_"
},
"source": [
"## Setup\n",
"\n",
"In the following steps, we'll setup the Vectorize environment and create a RAG pipeline.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FhvmvFKh4Rlh"
},
"source": [
"### Create a Vectorize Account & Get Your Access Token\n",
"\n",
"Sign up for a free Vectorize account [here](https://platform.vectorize.io/)\n",
"Generate an access token in the [Access Token](https://docs.vectorize.io/rag-pipelines/retrieval-endpoint#access-tokens) section\n",
"Gather your organization ID. From the browser url, extract the UUID from the URL after /organization/"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "L2SULMfWpWFX"
},
"source": [
"### Configure token and organization ID\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "BnF8KoDZpg2O"
},
"outputs": [],
"source": [
"import getpass\n",
"\n",
"VECTORIZE_ORG_ID = getpass.getpass(\"Enter Vectorize organization ID: \")\n",
"VECTORIZE_API_TOKEN = getpass.getpass(\"Enter Vectorize API Token: \")"
]
},
{
"metadata": {
"id": "JdZ5vlzjoDVr"
},
"cell_type": "markdown",
"source": [
"### Installation\n",
"\n",
"This retriever lives in the `langchain-vectorize` package:"
]
},
{
"metadata": {
"id": "IJFmtvDLn5R3"
},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": "!pip install -qU langchain-vectorize"
},
{
"cell_type": "markdown",
"metadata": {
"id": "Oj10Moznpz67"
},
"source": "### Download a PDF file"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eLbbTPytrgNw"
},
"outputs": [],
"source": "!wget \"https://raw.githubusercontent.com/vectorize-io/vectorize-clients/refs/tags/python-0.1.3/tests/python/tests/research.pdf\""
},
{
"cell_type": "markdown",
"metadata": {
"id": "7g54J6awtshs"
},
"source": "### Initialize the vectorize client"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "9Fr4yz5CrFWP"
},
"outputs": [],
"source": [
"import vectorize_client as v\n",
"\n",
"api = v.ApiClient(v.Configuration(access_token=VECTORIZE_API_TOKEN))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wPDoeqETxJrS"
},
"source": "### Create a File Upload Source Connector"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "9yEARIcFue5N"
},
"outputs": [],
"source": [
"import json\n",
"import os\n",
"\n",
"import urllib3\n",
"\n",
"connectors_api = v.ConnectorsApi(api)\n",
"response = connectors_api.create_source_connector(\n",
" VECTORIZE_ORG_ID, [{\"type\": \"FILE_UPLOAD\", \"name\": \"From API\"}]\n",
")\n",
"source_connector_id = response.connectors[0].id"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yU3lS6dpxZnQ"
},
"source": "### Upload the PDF file"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "OIiMIZ8ZxUYF"
},
"outputs": [],
"source": [
"file_path = \"research.pdf\"\n",
"\n",
"http = urllib3.PoolManager()\n",
"uploads_api = v.UploadsApi(api)\n",
"metadata = {\"created-from-api\": True}\n",
"\n",
"upload_response = uploads_api.start_file_upload_to_connector(\n",
" VECTORIZE_ORG_ID,\n",
" source_connector_id,\n",
" v.StartFileUploadToConnectorRequest(\n",
" name=file_path.split(\"/\")[-1],\n",
" content_type=\"application/pdf\",\n",
" # add additional metadata that will be stored along with each chunk in the vector database\n",
" metadata=json.dumps(metadata),\n",
" ),\n",
")\n",
"\n",
"with open(file_path, \"rb\") as f:\n",
" response = http.request(\n",
" \"PUT\",\n",
" upload_response.upload_url,\n",
" body=f,\n",
" headers={\n",
" \"Content-Type\": \"application/pdf\",\n",
" \"Content-Length\": str(os.path.getsize(file_path)),\n",
" },\n",
" )\n",
"\n",
"if response.status != 200:\n",
" print(\"Upload failed: \", response.data)\n",
"else:\n",
" print(\"Upload successful\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PdJJfOhfxiIo"
},
"source": "### Connect to the AI Platform and Vector Database"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0ZSGhXJfxjBb"
},
"outputs": [],
"source": [
"ai_platforms = connectors_api.get_ai_platform_connectors(VECTORIZE_ORG_ID)\n",
"builtin_ai_platform = [\n",
" c.id for c in ai_platforms.ai_platform_connectors if c.type == \"VECTORIZE\"\n",
"][0]\n",
"\n",
"vector_databases = connectors_api.get_destination_connectors(VECTORIZE_ORG_ID)\n",
"builtin_vector_db = [\n",
" c.id for c in vector_databases.destination_connectors if c.type == \"VECTORIZE\"\n",
"][0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JWoL-kqQxs5H"
},
"source": "### Configure and Deploy the Pipeline"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "hze9vJbQxvqA"
},
"outputs": [],
"source": [
"pipelines = v.PipelinesApi(api)\n",
"response = pipelines.create_pipeline(\n",
" VECTORIZE_ORG_ID,\n",
" v.PipelineConfigurationSchema(\n",
" source_connectors=[\n",
" v.SourceConnectorSchema(\n",
" id=source_connector_id, type=\"FILE_UPLOAD\", config={}\n",
" )\n",
" ],\n",
" destination_connector=v.DestinationConnectorSchema(\n",
" id=builtin_vector_db, type=\"VECTORIZE\", config={}\n",
" ),\n",
" ai_platform=v.AIPlatformSchema(\n",
" id=builtin_ai_platform, type=\"VECTORIZE\", config={}\n",
" ),\n",
" pipeline_name=\"My Pipeline From API\",\n",
" schedule=v.ScheduleSchema(type=\"manual\"),\n",
" ),\n",
")\n",
"pipeline_id = response.data.id"
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"### Configure tracing (optional)\n",
"\n",
"If you want to get automated tracing from individual queries, you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
]
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": [
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5ULion9wyj6T"
},
"source": "## Instantiation"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "9D-QfiW7yoe0"
},
"outputs": [],
"source": [
"from langchain_vectorize.retrievers import VectorizeRetriever\n",
"\n",
"retriever = VectorizeRetriever(\n",
" api_token=VECTORIZE_API_TOKEN,\n",
" organization=VECTORIZE_ORG_ID,\n",
" pipeline_id=pipeline_id,\n",
")"
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Usage\n",
"\n"
]
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": [
"query = \"Apple Shareholders equity\"\n",
"retriever.invoke(query, num_results=2)"
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Use within a chain\n",
"\n",
"Like other retrievers, VectorizeRetriever can be incorporated into LLM applications via [chains](/docs/how_to/sequence/).\n",
"\n",
"We will need a LLM or chat model:\n",
"\n",
"import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
"\n",
"<ChatModelTabs customVarName=\"llm\" />"
]
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": [
"# | output: false\n",
"# | echo: false\n",
"\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-3.5-turbo-0125\", temperature=0)"
]
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": [
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"\n",
"prompt = ChatPromptTemplate.from_template(\n",
" \"\"\"Answer the question based only on the context provided.\n",
"\n",
"Context: {context}\n",
"\n",
"Question: {question}\"\"\"\n",
")\n",
"\n",
"\n",
"def format_docs(docs):\n",
" return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
"\n",
"\n",
"chain = (\n",
" {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | llm\n",
" | StrOutputParser()\n",
")"
]
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": "chain.invoke(\"...\")"
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all VectorizeRetriever features and configurations head to the [API reference](https://python.langchain.com/api_reference/vectorize/langchain_vectorize.retrievers.VectorizeRetriever.html)."
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}