From a4ef8304807f929731b0fa12bad661e4a06283ef Mon Sep 17 00:00:00 2001 From: Eugene Yurtsev Date: Tue, 13 Aug 2024 20:21:36 -0400 Subject: [PATCH] docs: update integration docs for openai embeddings (#25249) Related issue: https://github.com/langchain-ai/langchain/issues/24856 ```json { "provider": "openai", "js": true, "local": false, "serializable": false, "async_native": true } ``` --------- Co-authored-by: Isaac Francisco <78627776+isahers1@users.noreply.github.com> Co-authored-by: isaac hershenson --- .../integrations/text_embedding/openai.ipynb | 349 ++++++++---------- 1 file changed, 162 insertions(+), 187 deletions(-) diff --git a/docs/docs/integrations/text_embedding/openai.ipynb b/docs/docs/integrations/text_embedding/openai.ipynb index 7d71663e533..84c6ac1d75d 100644 --- a/docs/docs/integrations/text_embedding/openai.ipynb +++ b/docs/docs/integrations/text_embedding/openai.ipynb @@ -2,42 +2,88 @@ "cells": [ { "cell_type": "raw", - "id": "ae8077b8", - "metadata": { - "vscode": { - "languageId": "raw" - } - }, + "id": "afaf8039", + "metadata": {}, "source": [ "---\n", + "sidebar_label: OpenAI\n", "keywords: [openaiembeddings]\n", "---" ] }, { "cell_type": "markdown", - "id": "278b6c63", + "id": "9a3d6f34", "metadata": {}, "source": [ - "# OpenAI\n", + "# OpenAIEmbeddings\n", "\n", - "Let's load the OpenAI Embedding class." + "This will help you get started with OpenAI embedding models using LangChain. For detailed documentation on `OpenAIEmbeddings` features and configuration options, please refer to the [API reference](https://api.python.langchain.com/en/latest/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html).\n", + "\n", + "\n", + "## Overview\n", + "### Integration details\n", + "\n", + "import { ItemTable } from \"@theme/FeatureTables\";\n", + "\n", + "\n", + "\n", + "## Setup\n", + "\n", + "To access OpenAI embedding models you'll need to create a/an OpenAI account, get an API key, and install the `langchain-openai` integration package.\n", + "\n", + "### Credentials\n", + "\n", + "Head to [platform.openai.com](https://platform.openai.com) to sign up to OpenAI and generate an API key. Once you’ve done this set the OPENAI_API_KEY environment variable:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "36521c2a", + "metadata": {}, + "outputs": [], + "source": [ + "import getpass\n", + "import os\n", + "\n", + "if not os.getenv(\"OPENAI_API_KEY\"):\n", + " os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter your OpenAI API key: \")" ] }, { "cell_type": "markdown", - "id": "40ff98ff-58e9-4716-8788-227a5c3f473d", + "id": "c84fb993", "metadata": {}, "source": [ - "## Setup\n", + "If you want to get automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "39a4953b", + "metadata": {}, + "outputs": [], + "source": [ + "# os.environ[\"LANGCHAIN_TRACING_V2\"] = \"true\"\n", + "# os.environ[\"LANGCHAIN_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")" + ] + }, + { + "cell_type": "markdown", + "id": "d9664366", + "metadata": {}, + "source": [ + "### Installation\n", "\n", - "First we install langchain-openai and set the required env vars" + "The LangChain OpenAI integration lives in the `langchain-openai` package:" ] }, { "cell_type": "code", "execution_count": null, - "id": "c66c4613-6c67-40ca-b3b1-c026750d1742", + "id": "64853226", "metadata": {}, "outputs": [], "source": [ @@ -45,171 +91,55 @@ ] }, { - "cell_type": "code", - "execution_count": null, - "id": "62e3710e-55a0-44fb-ba51-2f1d520dfc38", + "cell_type": "markdown", + "id": "45dd1724", "metadata": {}, - "outputs": [], "source": [ - "import getpass\n", - "import os\n", + "## Instantiation\n", "\n", - "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass()" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "0be1af71", - "metadata": {}, - "outputs": [], - "source": [ - "from langchain_openai import OpenAIEmbeddings" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "2c66e5da", - "metadata": {}, - "outputs": [], - "source": [ - "embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "01370375", - "metadata": {}, - "outputs": [], - "source": [ - "text = \"This is a test document.\"" - ] - }, - { - "cell_type": "markdown", - "id": "f012c222-3fa9-470a-935c-758b2048d9af", - "metadata": {}, - "source": [ - "## Usage\n", - "### Embed query" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "bfb6142c", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Warning: model not found. Using cl100k_base encoding.\n" - ] - } - ], - "source": [ - "query_result = embeddings.embed_query(text)" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "91bc875d-829b-4c3d-8e6f-fc2dda30a3bd", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[-0.014380056377383358,\n", - " -0.027191711627651764,\n", - " -0.020042716111860304,\n", - " 0.057301379620345545,\n", - " -0.022267658631828974]" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "query_result[:5]" - ] - }, - { - "cell_type": "markdown", - "id": "6b733391-1e23-438b-a6bc-0d77eed9426e", - "metadata": {}, - "source": [ - "## Embed documents" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "0356c3b7", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Warning: model not found. Using cl100k_base encoding.\n" - ] - } - ], - "source": [ - "doc_result = embeddings.embed_documents([text])" + "Now we can instantiate our model object and generate chat completions:" ] }, { "cell_type": "code", "execution_count": 10, - "id": "a4b0d49e-0c73-44b6-aed5-5b426564e085", + "id": "9ea7a09b", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[-0.014380056377383358,\n", - " -0.027191711627651764,\n", - " -0.020042716111860304,\n", - " 0.057301379620345545,\n", - " -0.022267658631828974]" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "doc_result[0][:5]" + "from langchain_openai import OpenAIEmbeddings\n", + "\n", + "embeddings = OpenAIEmbeddings(\n", + " model=\"text-embedding-3-large\",\n", + " # With the `text-embedding-3` class\n", + " # of models, you can specify the size\n", + " # of the embeddings you want returned.\n", + " # dimensions=1024\n", + ")" ] }, { "cell_type": "markdown", - "id": "e7dc464a-6fa2-4cff-ab2e-49a0566d819b", + "id": "77d271b6", "metadata": {}, "source": [ - "## Specify dimensions\n", + "## Indexing and Retrieval\n", "\n", - "With the `text-embedding-3` class of models, you can specify the size of the embeddings you want returned. For example by default `text-embedding-3-large` returned embeddings of dimension 3072:" + "Embedding models are often used in retrieval-augmented generation (RAG) flows, both as part of indexing data as well as later retrieving it. For more detailed instructions, please see our RAG tutorials under the [working with external knowledge tutorials](/docs/tutorials/#working-with-external-knowledge).\n", + "\n", + "Below, see how to index and retrieve data using the `embeddings` object we initialized above. In this example, we will index and retrieve a sample document in the `InMemoryVectorStore`." ] }, { "cell_type": "code", "execution_count": 11, - "id": "f7be1e7b-54c6-4893-b8ad-b872e6705735", + "id": "d817716b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "3072" + "'LangChain is the framework for building context-aware reasoning applications'" ] }, "execution_count": 11, @@ -218,61 +148,111 @@ } ], "source": [ - "len(doc_result[0])" + "# Create a vector store with a sample text\n", + "from langchain_core.vectorstores import InMemoryVectorStore\n", + "\n", + "text = \"LangChain is the framework for building context-aware reasoning applications\"\n", + "\n", + "vectorstore = InMemoryVectorStore.from_texts(\n", + " [text],\n", + " embedding=embeddings,\n", + ")\n", + "\n", + "# Use the vectorstore as a retriever\n", + "retriever = vectorstore.as_retriever()\n", + "\n", + "# Retrieve the most similar text\n", + "retrieved_documents = retriever.invoke(\"What is LangChain?\")\n", + "\n", + "# show the retrieved document's content\n", + "retrieved_documents[0].page_content" ] }, { "cell_type": "markdown", - "id": "33287142-0835-4958-962f-385ae4447431", + "id": "e02b9855", "metadata": {}, "source": [ - "But by passing in `dimensions=1024` we can reduce the size of our embeddings to 1024:" + "## Direct Usage\n", + "\n", + "Under the hood, the vectorstore and retriever implementations are calling `embeddings.embed_documents(...)` and `embeddings.embed_query(...)` to create embeddings for the text(s) used in `from_texts` and retrieval `invoke` operations, respectively.\n", + "\n", + "You can directly call these methods to get embeddings for your own use cases.\n", + "\n", + "### Embed single texts\n", + "\n", + "You can embed single texts or documents with `embed_query`:" ] }, { "cell_type": "code", - "execution_count": 15, - "id": "854ee772-2de9-4a83-84e0-908033d98e4e", - "metadata": {}, - "outputs": [], - "source": [ - "embeddings_1024 = OpenAIEmbeddings(model=\"text-embedding-3-large\", dimensions=1024)" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "3b464396-8d94-478b-8329-849b56e1ae23", + "execution_count": 12, + "id": "0d2befcd", "metadata": {}, "outputs": [ { - "name": "stderr", + "name": "stdout", "output_type": "stream", "text": [ - "Warning: model not found. Using cl100k_base encoding.\n" + "[-0.019276829436421394, 0.0037708976306021214, -0.03294256329536438, 0.0037671267054975033, 0.008175\n" ] - }, - { - "data": { - "text/plain": [ - "1024" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" } ], "source": [ - "len(embeddings_1024.embed_documents([text])[0])" + "single_vector = embeddings.embed_query(text)\n", + "print(str(single_vector)[:100]) # Show the first 100 characters of the vector" + ] + }, + { + "cell_type": "markdown", + "id": "1b5a7d03", + "metadata": {}, + "source": [ + "### Embed multiple texts\n", + "\n", + "You can embed multiple texts with `embed_documents`:" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "2f4d6e97", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[-0.019260549917817116, 0.0037612367887049913, -0.03291035071015358, 0.003757466096431017, 0.0082049\n", + "[-0.010181212797760963, 0.023419594392180443, -0.04215526953339577, -0.001532090245746076, -0.023573\n" + ] + } + ], + "source": [ + "text2 = (\n", + " \"LangGraph is a library for building stateful, multi-actor applications with LLMs\"\n", + ")\n", + "two_vectors = embeddings.embed_documents([text, text2])\n", + "for vector in two_vectors:\n", + " print(str(vector)[:100]) # Show the first 100 characters of the vector" + ] + }, + { + "cell_type": "markdown", + "id": "98785c12", + "metadata": {}, + "source": [ + "## API Reference\n", + "\n", + "For detailed documentation on `OpenAIEmbeddings` features and configuration options, please refer to the [API reference](https://api.python.langchain.com/en/latest/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html).\n" ] } ], "metadata": { "kernelspec": { - "display_name": "poetry-venv", + "display_name": "Python 3 (ipykernel)", "language": "python", - "name": "poetry-venv" + "name": "python3" }, "language_info": { "codemirror_mode": { @@ -284,12 +264,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.1" - }, - "vscode": { - "interpreter": { - "hash": "e971737741ff4ec9aff7dc6155a1060a59a8a6d52c757dbbe66bf8ee389494b1" - } + "version": "3.11.4" } }, "nbformat": 4,