ai21[minor]: AI21 Labs Semantic Text Splitter support (#19510)

Description: Added support for AI21 Labs model - Segmentation, as a Text Splitter Dependencies: ai21, langchain-text-splitter Twitter handle: https://github.com/AI21Labs --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com>
2025-08-13 14:50:00 +00:00 · 2024-03-26 03:39:37 +02:00 · 2024-03-26 03:39:37 +02:00 · 55db737302
commit 55db737302
parent b2a11ce686
11 changed files with 976 additions and 15 deletions
--- a/docs/docs/integrations/document_transformers/ai21_semantic_text_splitter.ipynb
+++ b/docs/docs/integrations/document_transformers/ai21_semantic_text_splitter.ipynb
@ -0,0 +1,466 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b9bba344bbe0b4bd",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# AI21SemanticTextSplitter\n",
    "\n",
    "This example goes over how to use AI21SemanticTextSplitter in LangChain."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8e4cdb63fbc34ec",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b09bb1cd2c7e036a",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pip install langchain-ai21"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba1d80fe8d82be89",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Environment Setup\n",
    "\n",
    "We'll need to get a AI21 API key and set the AI21_API_KEY environment variable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "844b8f744d22bcb6",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "os.environ[\"AI21_API_KEY\"] = getpass()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e670b278e6b2b9e",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Example Usages"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f61c5c981f01ad31",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Splitting text by semantic meaning"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7da988112712cf3",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "This example shows how to use AI21SemanticTextSplitter to split a text into chunks based on semantic meaning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1d82b65c9b8684f3",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from langchain_ai21 import AI21SemanticTextSplitter\n",
    "\n",
    "TEXT = (\n",
    "    \"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, \"\n",
    "    \"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\\n\"\n",
    "    \"Imagine a company that employs hundreds of thousands of employees. In today's information \"\n",
    "    \"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise \"\n",
    "    \"here, given that some of these documents are long and convoluted on purpose (did you know that \"\n",
    "    \"reading through all your privacy policies would take almost a quarter of a year?). Aside from \"\n",
    "    \"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of \"\n",
    "    \"Employees Read Their Employment Contracts Entirely Before Signing!).\\nThis is where AI-driven summarization \"\n",
    "    \"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, \"\n",
    "    \"users can (ideally) quickly extract relevant information from a text. With large language models, \"\n",
    "    \"the development of those tools is easier than ever, and you can offer your users a summary that is \"\n",
    "    \"specifically tailored to their preferences.\\nLarge language models naturally follow patterns in input \"\n",
    "    \"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed \"\n",
    "    'them with several examples in the input (\"few-shot prompt\"), so they can follow through. '\n",
    "    \"The process of creating the correct prompt for your problem is called prompt engineering, \"\n",
    "    \"and you can read more about it here.\"\n",
    ")\n",
    "\n",
    "semantic_text_splitter = AI21SemanticTextSplitter()\n",
    "chunks = semantic_text_splitter.split_text(TEXT)\n",
    "\n",
    "print(f\"The text has been split into {len(chunks)} chunks.\")\n",
    "for chunk in chunks:\n",
    "    print(chunk)\n",
    "    print(\"====\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e8d1fcf818a8a81",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Splitting text by semantic meaning with merge"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c307abbc216fe89f",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "This example shows how to use AI21SemanticTextSplitter to split a text into chunks based on semantic meaning, then merging the chunks based on `chunk_size`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5651c581fcc1ff02",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from langchain_ai21 import AI21SemanticTextSplitter\n",
    "\n",
    "TEXT = (\n",
    "    \"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, \"\n",
    "    \"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\\n\"\n",
    "    \"Imagine a company that employs hundreds of thousands of employees. In today's information \"\n",
    "    \"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise \"\n",
    "    \"here, given that some of these documents are long and convoluted on purpose (did you know that \"\n",
    "    \"reading through all your privacy policies would take almost a quarter of a year?). Aside from \"\n",
    "    \"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of \"\n",
    "    \"Employees Read Their Employment Contracts Entirely Before Signing!).\\nThis is where AI-driven summarization \"\n",
    "    \"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, \"\n",
    "    \"users can (ideally) quickly extract relevant information from a text. With large language models, \"\n",
    "    \"the development of those tools is easier than ever, and you can offer your users a summary that is \"\n",
    "    \"specifically tailored to their preferences.\\nLarge language models naturally follow patterns in input \"\n",
    "    \"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed \"\n",
    "    'them with several examples in the input (\"few-shot prompt\"), so they can follow through. '\n",
    "    \"The process of creating the correct prompt for your problem is called prompt engineering, \"\n",
    "    \"and you can read more about it here.\"\n",
    ")\n",
    "\n",
    "semantic_text_splitter_chunks = AI21SemanticTextSplitter(chunk_size=1000)\n",
    "chunks = semantic_text_splitter_chunks.split_text(TEXT)\n",
    "\n",
    "print(f\"The text has been split into {len(chunks)} chunks.\")\n",
    "for chunk in chunks:\n",
    "    print(chunk)\n",
    "    print(\"====\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b464db855e547cbb",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Splitting text to documents"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4410e8467012b193",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "This example shows how to use AI21SemanticTextSplitter to split a text into Documents based on semantic meaning. The metadata will contain a type for each document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3cf131d9be910115",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from langchain_ai21 import AI21SemanticTextSplitter\n",
    "\n",
    "TEXT = (\n",
    "    \"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, \"\n",
    "    \"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\\n\"\n",
    "    \"Imagine a company that employs hundreds of thousands of employees. In today's information \"\n",
    "    \"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise \"\n",
    "    \"here, given that some of these documents are long and convoluted on purpose (did you know that \"\n",
    "    \"reading through all your privacy policies would take almost a quarter of a year?). Aside from \"\n",
    "    \"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of \"\n",
    "    \"Employees Read Their Employment Contracts Entirely Before Signing!).\\nThis is where AI-driven summarization \"\n",
    "    \"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, \"\n",
    "    \"users can (ideally) quickly extract relevant information from a text. With large language models, \"\n",
    "    \"the development of those tools is easier than ever, and you can offer your users a summary that is \"\n",
    "    \"specifically tailored to their preferences.\\nLarge language models naturally follow patterns in input \"\n",
    "    \"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed \"\n",
    "    'them with several examples in the input (\"few-shot prompt\"), so they can follow through. '\n",
    "    \"The process of creating the correct prompt for your problem is called prompt engineering, \"\n",
    "    \"and you can read more about it here.\"\n",
    ")\n",
    "\n",
    "semantic_text_splitter = AI21SemanticTextSplitter()\n",
    "documents = semantic_text_splitter.split_text_to_documents(TEXT)\n",
    "\n",
    "print(f\"The text has been split into {len(documents)} Documents.\")\n",
    "for doc in documents:\n",
    "    print(f\"type: {doc.metadata['source_type']}\")\n",
    "    print(f\"text: {doc.page_content}\")\n",
    "    print(\"====\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b544ba21335d01a6",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Creating Documents with Metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c67f8c3ad89b8ad2",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "This example shows how to use AI21SemanticTextSplitter to create Documents from texts, and adding custom Metadata to each Document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe222d0e85249bda",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from langchain_ai21 import AI21SemanticTextSplitter\n",
    "\n",
    "TEXT = (\n",
    "    \"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, \"\n",
    "    \"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\\n\"\n",
    "    \"Imagine a company that employs hundreds of thousands of employees. In today's information \"\n",
    "    \"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise \"\n",
    "    \"here, given that some of these documents are long and convoluted on purpose (did you know that \"\n",
    "    \"reading through all your privacy policies would take almost a quarter of a year?). Aside from \"\n",
    "    \"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of \"\n",
    "    \"Employees Read Their Employment Contracts Entirely Before Signing!).\\nThis is where AI-driven summarization \"\n",
    "    \"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, \"\n",
    "    \"users can (ideally) quickly extract relevant information from a text. With large language models, \"\n",
    "    \"the development of those tools is easier than ever, and you can offer your users a summary that is \"\n",
    "    \"specifically tailored to their preferences.\\nLarge language models naturally follow patterns in input \"\n",
    "    \"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed \"\n",
    "    'them with several examples in the input (\"few-shot prompt\"), so they can follow through. '\n",
    "    \"The process of creating the correct prompt for your problem is called prompt engineering, \"\n",
    "    \"and you can read more about it here.\"\n",
    ")\n",
    "\n",
    "semantic_text_splitter = AI21SemanticTextSplitter()\n",
    "texts = [TEXT]\n",
    "documents = semantic_text_splitter.create_documents(\n",
    "    texts=texts, metadatas=[{\"pikachu\": \"pika pika\"}]\n",
    ")\n",
    "\n",
    "print(f\"The text has been split into {len(documents)} Documents.\")\n",
    "for doc in documents:\n",
    "    print(f\"metadata: {doc.metadata}\")\n",
    "    print(f\"text: {doc.page_content}\")\n",
    "    print(\"====\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8b5682c34142319",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Splitting text to documents with start index"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "359ea797c03ece85",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "This example shows how to use AI21SemanticTextSplitter to split a text into Documents based on semantic meaning. The metadata will contain a start index for each document.\n",
    "**Note** that the start index provides an indication of the order of the chunks rather than the actual start index for each chunk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2dc39002f0c25784",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from langchain_ai21 import AI21SemanticTextSplitter\n",
    "\n",
    "TEXT = (\n",
    "    \"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, \"\n",
    "    \"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\\n\"\n",
    "    \"Imagine a company that employs hundreds of thousands of employees. In today's information \"\n",
    "    \"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise \"\n",
    "    \"here, given that some of these documents are long and convoluted on purpose (did you know that \"\n",
    "    \"reading through all your privacy policies would take almost a quarter of a year?). Aside from \"\n",
    "    \"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of \"\n",
    "    \"Employees Read Their Employment Contracts Entirely Before Signing!).\\nThis is where AI-driven summarization \"\n",
    "    \"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, \"\n",
    "    \"users can (ideally) quickly extract relevant information from a text. With large language models, \"\n",
    "    \"the development of those tools is easier than ever, and you can offer your users a summary that is \"\n",
    "    \"specifically tailored to their preferences.\\nLarge language models naturally follow patterns in input \"\n",
    "    \"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed \"\n",
    "    'them with several examples in the input (\"few-shot prompt\"), so they can follow through. '\n",
    "    \"The process of creating the correct prompt for your problem is called prompt engineering, \"\n",
    "    \"and you can read more about it here.\"\n",
    ")\n",
    "\n",
    "semantic_text_splitter = AI21SemanticTextSplitter(add_start_index=True)\n",
    "documents = semantic_text_splitter.create_documents(texts=[TEXT])\n",
    "print(f\"The text has been split into {len(documents)} Documents.\")\n",
    "for doc in documents:\n",
    "    print(f\"start_index: {doc.metadata['start_index']}\")\n",
    "    print(f\"text: {doc.page_content}\")\n",
    "    print(\"====\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b62939cc5803b9fb",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Splitting documents"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44162d340c0de5fb",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "This example shows how to use AI21SemanticTextSplitter to split a list of Documents into chunks based on semantic meaning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8950c8e4e1208bf6",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from langchain_ai21 import AI21SemanticTextSplitter\n",
    "from langchain_core.documents import Document\n",
    "\n",
    "TEXT = (\n",
    "    \"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, \"\n",
    "    \"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\\n\"\n",
    "    \"Imagine a company that employs hundreds of thousands of employees. In today's information \"\n",
    "    \"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise \"\n",
    "    \"here, given that some of these documents are long and convoluted on purpose (did you know that \"\n",
    "    \"reading through all your privacy policies would take almost a quarter of a year?). Aside from \"\n",
    "    \"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of \"\n",
    "    \"Employees Read Their Employment Contracts Entirely Before Signing!).\\nThis is where AI-driven summarization \"\n",
    "    \"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, \"\n",
    "    \"users can (ideally) quickly extract relevant information from a text. With large language models, \"\n",
    "    \"the development of those tools is easier than ever, and you can offer your users a summary that is \"\n",
    "    \"specifically tailored to their preferences.\\nLarge language models naturally follow patterns in input \"\n",
    "    \"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed \"\n",
    "    'them with several examples in the input (\"few-shot prompt\"), so they can follow through. '\n",
    "    \"The process of creating the correct prompt for your problem is called prompt engineering, \"\n",
    "    \"and you can read more about it here.\"\n",
    ")\n",
    "\n",
    "semantic_text_splitter = AI21SemanticTextSplitter()\n",
    "document = Document(page_content=TEXT, metadata={\"hello\": \"goodbye\"})\n",
    "documents = semantic_text_splitter.split_documents([document])\n",
    "print(f\"The document list has been split into {len(documents)} Documents.\")\n",
    "for doc in documents:\n",
    "    print(f\"text: {doc.page_content}\")\n",
    "    print(f\"metadata: {doc.metadata}\")\n",
    "    print(\"====\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8f911b8d9ec22e5",
   "metadata": {
    "collapsed": false
   },
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/docs/docs/modules/data_connection/document_transformers/index.mdx
+++ b/docs/docs/modules/data_connection/document_transformers/index.mdx
@ -44,6 +44,7 @@ LangChain offers many different types of text splitters. These all live in the `
 | Token     | Tokens                                |               | Splits text on tokens. There exist a few different ways to measure tokens.                                                                                                              |
 | Character | A user defined character              |               | Splits text based on a user defined character. One of the simpler methods.                                                                                                              |
 | [Experimental] Semantic Chunker | Sentences             |               | First splits on sentences. Then combines ones next to each other if they are semantically similar enough. Taken from [Greg Kamradt](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb)                                                                                                              |
 | [AI21 Semantic Text Splitter](/docs/integrations/document_transformers/ai21_semantic_text_splitter) | Semantics             |    ✅           | Identifies distinct topics that form coherent pieces of text and splits along those. |
 ## Evaluate text splitters
--- a/libs/partners/ai21/README.md
+++ b/libs/partners/ai21/README.md
@ -102,4 +102,20 @@ chain = tsm | StrOutputParser()
 response = chain.invoke(
    {"context": "Your context", "question": "Your question"},
 )
 ```
 ## Text Splitters
 ### Semantic Text Splitter
 You can use AI21's semantic text splitter to split a text into segments.
 Instead of merely using punctuation and newlines to divide the text, it identifies distinct topics that will work well together and will form a coherent piece of text.
 For a list for examples, see [this page](https://github.com/langchain-ai/langchain/blob/master/docs/docs/modules/data_connection/document_transformers/semantic_text_splitter.ipynb).
 ```python
 from langchain_ai21 import AI21SemanticTextSplitter
 splitter = AI21SemanticTextSplitter()
 response = splitter.split_text("Your text")
 ```
--- a/libs/partners/ai21/langchain_ai21/init.py
+++ b/libs/partners/ai21/langchain_ai21/init.py
@ -2,10 +2,12 @@ from langchain_ai21.chat_models import ChatAI21
 from langchain_ai21.contextual_answers import AI21ContextualAnswers
 from langchain_ai21.embeddings import AI21Embeddings
 from langchain_ai21.llms import AI21LLM
 from langchain_ai21.semantic_text_splitter import AI21SemanticTextSplitter
 __all__ = [
    "AI21LLM",
    "ChatAI21",
    "AI21Embeddings",
    "AI21ContextualAnswers",
    "AI21SemanticTextSplitter",
 ]
--- a/libs/partners/ai21/langchain_ai21/semantic_text_splitter.py
+++ b/libs/partners/ai21/langchain_ai21/semantic_text_splitter.py
@ -0,0 +1,158 @@
 import copy
 import logging
 import re
 from typing import (
    Any,
    Iterable,
    List,
    Optional,
 )
 from ai21.models import DocumentType
 from langchain_core.documents import Document
 from langchain_core.pydantic_v1 import SecretStr
 from langchain_text_splitters import TextSplitter
 from langchain_ai21.ai21_base import AI21Base
 logger = logging.getLogger(__name__)
 class AI21SemanticTextSplitter(TextSplitter):
    """Splitting text into coherent and readable units,
    based on distinct topics and lines
    """
    def __init__(
        self,
        chunk_size: int = 0,
        chunk_overlap: int = 0,
        client: Optional[Any] = None,
        api_key: Optional[SecretStr] = None,
        api_host: Optional[str] = None,
        timeout_sec: Optional[float] = None,
        num_retries: Optional[int] = None,
        **kwargs: Any,
    ) -> None:
        """Create a new TextSplitter."""
        super().__init__(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            **kwargs,
        )
        self._segmentation = AI21Base(
            client=client,
            api_key=api_key,
            api_host=api_host,
            timeout_sec=timeout_sec,
            num_retries=num_retries,
        ).client.segmentation
    def split_text(self, source: str) -> List[str]:
        """Split text into multiple components.
        Args:
            source: Specifies the text input for text segmentation
        """
        response = self._segmentation.create(
            source=source, source_type=DocumentType.TEXT
        )
        segments = [segment.segment_text for segment in response.segments]
        if self._chunk_size > 0:
            return self._merge_splits_no_seperator(segments)
        return segments
    def split_text_to_documents(self, source: str) -> List[Document]:
        """Split text into multiple documents.
        Args:
            source: Specifies the text input for text segmentation
        """
        response = self._segmentation.create(
            source=source, source_type=DocumentType.TEXT
        )
        return [
            Document(
                page_content=segment.segment_text,
                metadata={"source_type": segment.segment_type},
            )
            for segment in response.segments
        ]
    def create_documents(
        self, texts: List[str], metadatas: Optional[List[dict]] = None
    ) -> List[Document]:
        """Create documents from a list of texts."""
        _metadatas = metadatas or [{}] * len(texts)
        documents = []
        for i, text in enumerate(texts):
            normalized_text = self._normalized_text(text)
            index = 0
            previous_chunk_len = 0
            for chunk in self.split_text_to_documents(text):
                # merge metadata from user (if exists) and from segmentation api
                metadata = copy.deepcopy(_metadatas[i])
                metadata.update(chunk.metadata)
                if self._add_start_index:
                    # find the start index of the chunk
                    offset = index + previous_chunk_len - self._chunk_overlap
                    normalized_chunk = self._normalized_text(chunk.page_content)
                    index = normalized_text.find(normalized_chunk, max(0, offset))
                    metadata["start_index"] = index
                    previous_chunk_len = len(normalized_chunk)
                documents.append(
                    Document(
                        page_content=chunk.page_content,
                        metadata=metadata,
                    )
                )
        return documents
    def _normalized_text(self, string: str) -> str:
        """Use regular expression to replace sequences of '\n'"""
        return re.sub(r"\s+", "", string)
    def _merge_splits(self, splits: Iterable[str], separator: str) -> List[str]:
        """This method overrides the default implementation of TextSplitter"""
        return self._merge_splits_no_seperator(splits)
    def _merge_splits_no_seperator(self, splits: Iterable[str]) -> List[str]:
        """Merge splits into chunks.
        If the segment size is bigger than chunk_size,
        it will be left as is (won't be cut to match to chunk_size).
        If the segment size is smaller than chunk_size,
        it will be merged with the next segment until the chunk_size is reached.
        """
        chunks = []
        current_chunk = ""
        for split in splits:
            split_len = self._length_function(split)
            if split_len > self._chunk_size:
                logger.warning(
                    f"Split of length {split_len}"
                    f"exceeds chunk size {self._chunk_size}."
                )
            if self._length_function(current_chunk) + split_len > self._chunk_size:
                if current_chunk != "":
                    chunks.append(current_chunk)
                    current_chunk = ""
            current_chunk += split
        if current_chunk != "":
            chunks.append(current_chunk)
        return chunks
--- a/libs/partners/ai21/poetry.lock
+++ b/libs/partners/ai21/poetry.lock
@ -300,7 +300,7 @@ files = [
 [[package]]
 name = "langchain-core"
-version = "0.1.30"
+version = "0.1.33"
 description = "Building applications with LLMs through composability"
 optional = false
 python-versions = ">=3.8.1,<4.0"
@ -324,15 +324,32 @@ extended-testing = ["jinja2 (>=3,<4)"]
 type = "directory"
 url = "../../core"
 [[package]]
 name = "langchain-text-splitters"
 version = "0.0.1"
 description = "LangChain text splitting utilities"
 optional = false
 python-versions = ">=3.8.1,<4.0"
 files = [
    {file = "langchain_text_splitters-0.0.1-py3-none-any.whl", hash = "sha256:f5b802f873f5ff6a8b9259ff34d53ed989666ef4e1582e6d1adb3b5520e3839a"},
    {file = "langchain_text_splitters-0.0.1.tar.gz", hash = "sha256:ac459fa98799f5117ad5425a9330b21961321e30bc19a2a2f9f761ddadd62aa1"},
 ]
 [package.dependencies]
 langchain-core = ">=0.1.28,<0.2.0"
 [package.extras]
 extended-testing = ["lxml (>=5.1.0,<6.0.0)"]
 [[package]]
 name = "langsmith"
-version = "0.1.10"
+version = "0.1.23"
 description = "Client library to connect to the LangSmith LLM Tracing and Evaluation Platform."
 optional = false
 python-versions = ">=3.8.1,<4.0"
 files = [
-    {file = "langsmith-0.1.10-py3-none-any.whl", hash = "sha256:2997a80aea60ed235d83502a7ccdc1f62ffb4dd6b3b7dd4218e8fa4de68a6725"},
+    {file = "langsmith-0.1.23-py3-none-any.whl", hash = "sha256:69984268b9867cb31b875965b3f86b6f56ba17dd5454d487d3a1a999bdaeea69"},
-    {file = "langsmith-0.1.10.tar.gz", hash = "sha256:13e7e8b52e694aa4003370cefbb9e79cce3540c65dbf1517902bf7aa4dbbb653"},
+    {file = "langsmith-0.1.23.tar.gz", hash = "sha256:327c66ec0de8c1bc57bfa47bbc70a29ef749e97c3e5571b9baf754d1e0644220"},
 ]
 [package.dependencies]
@ -342,13 +359,13 @@ requests = ">=2,<3"
 [[package]]
 name = "marshmallow"
-version = "3.21.0"
+version = "3.21.1"
 description = "A lightweight library for converting complex datatypes to and from native Python datatypes."
 optional = false
 python-versions = ">=3.8"
 files = [
-    {file = "marshmallow-3.21.0-py3-none-any.whl", hash = "sha256:e7997f83571c7fd476042c2c188e4ee8a78900ca5e74bd9c8097afa56624e9bd"},
+    {file = "marshmallow-3.21.1-py3-none-any.whl", hash = "sha256:f085493f79efb0644f270a9bf2892843142d80d7174bbbd2f3713f2a589dc633"},
-    {file = "marshmallow-3.21.0.tar.gz", hash = "sha256:20f53be28c6e374a711a16165fb22a8dc6003e3f7cda1285e3ca777b9193885b"},
+    {file = "marshmallow-3.21.1.tar.gz", hash = "sha256:4e65e9e0d80fc9e609574b9983cf32579f305c718afb30d7233ab818571768c3"},
 ]
 [package.dependencies]
@ -507,13 +524,13 @@ testing = ["pytest", "pytest-benchmark"]
 [[package]]
 name = "pydantic"
-version = "2.6.3"
+version = "2.6.4"
 description = "Data validation using Python type hints"
 optional = false
 python-versions = ">=3.8"
 files = [
-    {file = "pydantic-2.6.3-py3-none-any.whl", hash = "sha256:72c6034df47f46ccdf81869fddb81aade68056003900a8724a4f160700016a2a"},
+    {file = "pydantic-2.6.4-py3-none-any.whl", hash = "sha256:cc46fce86607580867bdc3361ad462bab9c222ef042d3da86f2fb333e1d916c5"},
-    {file = "pydantic-2.6.3.tar.gz", hash = "sha256:e07805c4c7f5c6826e33a1d4c9d47950d7eaf34868e2690f8594d2e30241f11f"},
+    {file = "pydantic-2.6.4.tar.gz", hash = "sha256:b1704e0847db01817624a6b86766967f552dd9dbf3afba4004409f908dcc84e6"},
 ]
 [package.dependencies]
@ -689,13 +706,13 @@ watchdog = ">=2.0.0"
 [[package]]
 name = "python-dateutil"
-version = "2.8.2"
+version = "2.9.0.post0"
 description = "Extensions to the standard Python datetime module"
 optional = false
 python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,>=2.7"
 files = [
-    {file = "python-dateutil-2.8.2.tar.gz", hash = "sha256:0123cacc1627ae19ddf3c27a5de5bd67ee4586fbdd6440d9748f8abb483d3e86"},
+    {file = "python-dateutil-2.9.0.post0.tar.gz", hash = "sha256:37dd54208da7e1cd875388217d5e00ebd4179249f90fb72437e91a35459a0ad3"},
-    {file = "python_dateutil-2.8.2-py2.py3-none-any.whl", hash = "sha256:961d03dc3453ebbc59dbdea9e4e11c5651520a876d0f4db161e8674aae935da9"},
+    {file = "python_dateutil-2.9.0.post0-py2.py3-none-any.whl", hash = "sha256:a8b2bc7bffae282281c8140a97d3aa9c14da0b136dfe83f850eea9a5f7470427"},
 ]
 [package.dependencies]
@ -726,6 +743,7 @@ files = [
    {file = "PyYAML-6.0.1-cp311-cp311-win_amd64.whl", hash = "sha256:bf07ee2fef7014951eeb99f56f39c9bb4af143d8aa3c21b1677805985307da34"},
    {file = "PyYAML-6.0.1-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:855fb52b0dc35af121542a76b9a84f8d1cd886ea97c84703eaa6d88e37a2ad28"},
    {file = "PyYAML-6.0.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:40df9b996c2b73138957fe23a16a4f0ba614f4c0efce1e9406a184b6d07fa3a9"},
    {file = "PyYAML-6.0.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a08c6f0fe150303c1c6b71ebcd7213c2858041a7e01975da3a99aed1e7a378ef"},
    {file = "PyYAML-6.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6c22bec3fbe2524cde73d7ada88f6566758a8f7227bfbf93a408a9d86bcc12a0"},
    {file = "PyYAML-6.0.1-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:8d4e9c88387b0f5c7d5f281e55304de64cf7f9c0021a3525bd3b1c542da3b0e4"},
    {file = "PyYAML-6.0.1-cp312-cp312-win32.whl", hash = "sha256:d483d2cdf104e7c9fa60c544d92981f12ad66a457afae824d146093b8c294c54"},
@ -1009,4 +1027,4 @@ watchmedo = ["PyYAML (>=3.10)"]
 [metadata]
 lock-version = "2.0"
 python-versions = ">=3.8.1,<4.0"
-content-hash = "3073522be06765f2acb7efea6ed1fcc49eaa05e82534d96fa914899dbbbb541f"
+content-hash = "6ba91e0cf81e177c01efe980cbeedc2fe5a267599ce91c15acbcf2cd34df33dc"
--- a/libs/partners/ai21/pyproject.toml
+++ b/libs/partners/ai21/pyproject.toml
@ -8,6 +8,7 @@ readme = "README.md"
 [tool.poetry.dependencies]
 python = ">=3.8.1,<4.0"
 langchain-core = "^0.1.22"
 langchain-text-splitters = "^0.0.1"
 ai21 = "^2.1.2"
 [tool.poetry.group.test]
--- a/libs/partners/ai21/tests/integration_tests/test_semantic_text_splitter.py
+++ b/libs/partners/ai21/tests/integration_tests/test_semantic_text_splitter.py
@ -0,0 +1,130 @@
 from ai21 import AI21Client
 from langchain_core.documents import Document
 from langchain_ai21 import AI21SemanticTextSplitter
 TEXT = (
    "The original full name of the franchise is Pocket Monsters (ポケットモンスター, "
    "Poketto Monsutā), which was abbreviated to "
    "Pokemon during development of the original games.\n"
    "When the franchise was released internationally, the short form of the title was "
    "used, with an acute accent (´) "
    "over the e to aid in pronunciation.\n"
    "Pokémon refers to both the franchise itself and the creatures within its "
    "fictional universe.\n"
    "As a noun, it is identical in both the singular and plural, as is every "
    "individual species name;[10] it is "
    'grammatically correct to say "one Pokémon" and "many Pokémon", as well '
    'as "one Pikachu" and "many Pikachu".\n'
    "In English, Pokémon may be pronounced either /'powkɛmon/ (poe-keh-mon) or "
    "/'powkɪmon/ (poe-key-mon).\n"
    "The Pokémon franchise is set in a world in which humans coexist with creatures "
    "known as Pokémon.\n"
    "Pokémon Red and Blue contain 151 Pokémon species, with new ones being introduced "
    "in subsequent games; as of December 2023, 1,025 Pokémon species have been "
    "introduced.\n[b] Most Pokémon are inspired by real-world animals;[12] for example,"
    "Pikachu are a yellow mouse-like species[13] with lightning bolt-shaped tails[14] "
    "that possess electrical abilities.[15]\nThe player character takes the role of a "
    "Pokémon Trainer.\nThe Trainer has three primary goals: travel and explore the "
    "Pokémon world; discover and catch each Pokémon species in order to complete their"
    "Pokédex; and train a team of up to six Pokémon at a time and have them engage "
    "in battles.\nMost Pokémon can be caught with spherical devices known as Poké "
    "Balls.\nOnce the opposing Pokémon is sufficiently weakened, the Trainer throws "
    "the Poké Ball against the Pokémon, which is then transformed into a form of "
    "energy and transported into the device.\nOnce the catch is successful, "
    "the Pokémon is tamed and is under the Trainer's command from then on.\n"
    "If the Poké Ball is thrown again, the Pokémon re-materializes into its "
    "original state.\nThe Trainer's Pokémon can engage in battles against opposing "
    "Pokémon, including those in the wild or owned by other Trainers.\nBecause the "
    "franchise is aimed at children, these battles are never presented as overtly "
    "violent and contain no blood or gore.[I]\nPokémon never die in battle, instead "
    "fainting upon being defeated.[20][21][22]\nAfter a Pokémon wins a battle, it "
    "gains experience and becomes stronger.[23] After gaining a certain amount of "
    "experience points, its level increases, as well as one or more of its "
    "statistics.\nAs its level increases, the Pokémon can learn new offensive "
    "and defensive moves to use in battle.[24][25] Furthermore, many species can "
    "undergo a form of spontaneous metamorphosis called Pokémon evolution, and "
    "transform into stronger forms.[26] Most Pokémon will evolve at a certain level, "
    "while others evolve through different means, such as exposure to a certain "
    "item.[27]\n"
 )
 def test_split_text_to_document() -> None:
    segmentation = AI21SemanticTextSplitter()
    segments = segmentation.split_text_to_documents(source=TEXT)
    assert len(segments) > 0
    for segment in segments:
        assert segment.page_content is not None
        assert segment.metadata is not None
 def test_split_text() -> None:
    segmentation = AI21SemanticTextSplitter()
    segments = segmentation.split_text(source=TEXT)
    assert len(segments) > 0
 def test_split_text__when_chunk_size_is_large__should_merge_segments() -> None:
    segmentation_no_merge = AI21SemanticTextSplitter()
    segments_no_merge = segmentation_no_merge.split_text(source=TEXT)
    segmentation_merge = AI21SemanticTextSplitter(chunk_size=1000)
    segments_merge = segmentation_merge.split_text(source=TEXT)
    # Assert that a merge did happen
    assert len(segments_no_merge) > len(segments_merge)
    reconstructed_text_merged = "".join(segments_merge)
    reconstructed_text_non_merged = "".join(segments_no_merge)
    # Assert that the merge did not change the content
    assert reconstructed_text_merged == reconstructed_text_non_merged
 def test_split_text__chunk_size_is_too_small__should_return_non_merged_segments() -> (
    None
 ):
    segmentation_no_merge = AI21SemanticTextSplitter()
    segments_no_merge = segmentation_no_merge.split_text(source=TEXT)
    segmentation_merge = AI21SemanticTextSplitter(chunk_size=10)
    segments_merge = segmentation_merge.split_text(source=TEXT)
    # Assert that a merge did happen
    assert len(segments_no_merge) == len(segments_merge)
    reconstructed_text_merged = "".join(segments_merge)
    reconstructed_text_non_merged = "".join(segments_no_merge)
    # Assert that the merge did not change the content
    assert reconstructed_text_merged == reconstructed_text_non_merged
 def test_split_text__when_chunk_size_set_with_ai21_tokenizer() -> None:
    segmentation_no_merge = AI21SemanticTextSplitter(
        length_function=AI21Client().count_tokens
    )
    segments_no_merge = segmentation_no_merge.split_text(source=TEXT)
    segmentation_merge = AI21SemanticTextSplitter(
        chunk_size=1000, length_function=AI21Client().count_tokens
    )
    segments_merge = segmentation_merge.split_text(source=TEXT)
    # Assert that a merge did happen
    assert len(segments_no_merge) > len(segments_merge)
    reconstructed_text_merged = "".join(segments_merge)
    reconstructed_text_non_merged = "".join(segments_no_merge)
    # Assert that the merge did not change the content
    assert reconstructed_text_merged == reconstructed_text_non_merged
 def test_create_documents() -> None:
    texts = [TEXT]
    segmentation = AI21SemanticTextSplitter()
    documents = segmentation.create_documents(texts=texts)
    assert len(documents) > 0
    for document in documents:
        assert document.page_content is not None
        assert document.metadata is not None
 def test_split_documents() -> None:
    documents = [Document(page_content=TEXT, metadata={"foo": "bar"})]
    segmentation = AI21SemanticTextSplitter()
    segments = segmentation.split_documents(documents=documents)
    assert len(segments) > 0
    for segment in segments:
        assert segment.page_content is not None
        assert segment.metadata is not None
--- a/libs/partners/ai21/tests/unit_tests/conftest.py
+++ b/libs/partners/ai21/tests/unit_tests/conftest.py
@ -16,12 +16,13 @@ from ai21.models import (
    FinishReason,
    Penalty,
    RoleType,
    SegmentationResponse,
 )
 from ai21.models.responses.segmentation_response import Segment
 from pytest_mock import MockerFixture
 DUMMY_API_KEY = "test_api_key"
 BASIC_EXAMPLE_LLM_PARAMETERS = {
    "num_results": 3,
    "max_tokens": 20,
@ -38,6 +39,32 @@ BASIC_EXAMPLE_LLM_PARAMETERS = {
    ),
 }
 SEGMENTS = [
    Segment(
        segment_type="normal_text",
        segment_text=(
            "The original full name of the franchise is Pocket Monsters "
            "(ポケットモンスター, Poketto Monsutā), which was abbreviated to "
            "Pokemon during development of the original games.\n\nWhen the "
            "franchise was released internationally, the short form of the "
            "title was used, with an acute accent (´) over the e to aid "
            "in pronunciation."
        ),
    ),
    Segment(
        segment_type="normal_text",
        segment_text=(
            "Pokémon refers to both the franchise itself and the creatures "
            "within its fictional universe.\n\nAs a noun, it is identical in "
            "both the singular and plural, as is every individual species "
            'name;[10] it is grammatically correct to say "one Pokémon" '
            'and "many Pokémon", as well as "one Pikachu" and "many '
            'Pikachu".\n\nIn English, Pokémon may be pronounced either '
            "/'powkɛmon/ (poe-keh-mon) or /'powkɪmon/ (poe-key-mon)."
        ),
    ),
 ]
 BASIC_EXAMPLE_LLM_PARAMETERS_AS_DICT = {
    "num_results": 3,
@ -125,3 +152,15 @@ def mock_client_with_contextual_answers(mocker: MockerFixture) -> Mock:
    )
    return mock_client
@pytest.fixture
 def mock_client_with_semantic_text_splitter(mocker: MockerFixture) -> Mock:
    mock_client = mocker.MagicMock(spec=AI21Client)
    mock_client.segmentation = mocker.MagicMock()
    mock_client.segmentation.create.return_value = SegmentationResponse(
        id="12345",
        segments=SEGMENTS,
    )
    return mock_client
--- a/libs/partners/ai21/tests/unit_tests/test_imports.py
+++ b/libs/partners/ai21/tests/unit_tests/test_imports.py
@ -5,6 +5,7 @@ EXPECTED_ALL = [
    "ChatAI21",
    "AI21Embeddings",
    "AI21ContextualAnswers",
    "AI21SemanticTextSplitter",
 ]
--- a/libs/partners/ai21/tests/unit_tests/test_semantic_text_splitter.py
+++ b/libs/partners/ai21/tests/unit_tests/test_semantic_text_splitter.py
@ -0,0 +1,129 @@
 from unittest.mock import Mock
 import pytest
 from langchain_ai21 import AI21SemanticTextSplitter
 from tests.unit_tests.conftest import SEGMENTS
 TEXT = (
    "The original full name of the franchise is Pocket Monsters (ポケットモンスター, "
    "Poketto Monsutā), which was abbreviated to "
    "Pokemon during development of the original games.\n"
    "When the franchise was released internationally, the short form of the title was "
    "used, with an acute accent (´) "
    "over the e to aid in pronunciation.\n"
    "Pokémon refers to both the franchise itself and the creatures within its "
    "fictional universe.\n"
    "As a noun, it is identical in both the singular and plural, as is every "
    "individual species name;[10] it is "
    'grammatically correct to say "one Pokémon" and "many Pokémon", as well '
    'as "one Pikachu" and "many Pikachu".\n'
    "In English, Pokémon may be pronounced either /'powkɛmon/ (poe-keh-mon) or "
    "/'powkɪmon/ (poe-key-mon).\n"
    "The Pokémon franchise is set in a world in which humans coexist with creatures "
    "known as Pokémon.\n"
    "Pokémon Red and Blue contain 151 Pokémon species, with new ones being introduced "
    "in subsequent games; as of December 2023, 1,025 Pokémon species have been "
    "introduced.\n[b] Most Pokémon are inspired by real-world animals;[12] for example,"
    "Pikachu are a yellow mouse-like species[13] with lightning bolt-shaped tails[14] "
    "that possess electrical abilities.[15]"
 )
@pytest.mark.parametrize(
    ids=[
        "when_chunk_size_is_zero",
        "when_chunk_size_is_large",
        "when_chunk_size_is_small",
    ],
    argnames=["chunk_size", "expected_segmentation_len"],
    argvalues=[
        (0, 2),
        (1000, 1),
        (10, 2),
    ],
 )
 def test_split_text__on_chunk_size(
    chunk_size: int,
    expected_segmentation_len: int,
    mock_client_with_semantic_text_splitter: Mock,
 ) -> None:
    sts = AI21SemanticTextSplitter(
        chunk_size=chunk_size,
        client=mock_client_with_semantic_text_splitter,
    )
    segments = sts.split_text("This is a test")
    assert len(segments) == expected_segmentation_len
 def test_split_text__on_large_chunk_size__should_merge_chunks(
    mock_client_with_semantic_text_splitter: Mock,
 ) -> None:
    sts_no_merge = AI21SemanticTextSplitter(
        client=mock_client_with_semantic_text_splitter
    )
    sts_merge = AI21SemanticTextSplitter(
        client=mock_client_with_semantic_text_splitter,
        chunk_size=1000,
    )
    segments_no_merge = sts_no_merge.split_text("This is a test")
    segments_merge = sts_merge.split_text("This is a test")
    assert len(segments_merge) > 0
    assert len(segments_no_merge) > 0
    assert len(segments_no_merge) > len(segments_merge)
 def test_split_text__on_small_chunk_size__should_not_merge_chunks(
    mock_client_with_semantic_text_splitter: Mock,
 ) -> None:
    sts_no_merge = AI21SemanticTextSplitter(
        client=mock_client_with_semantic_text_splitter
    )
    segments = sts_no_merge.split_text("This is a test")
    assert len(segments) == 2
    for index in range(2):
        assert segments[index] == SEGMENTS[index].segment_text
 def test_create_documents__on_start_index__should_should_add_start_index(
    mock_client_with_semantic_text_splitter: Mock,
 ) -> None:
    sts = AI21SemanticTextSplitter(
        client=mock_client_with_semantic_text_splitter,
        add_start_index=True,
    )
    response = sts.create_documents(texts=[TEXT])
    assert len(response) > 0
    for segment in response:
        assert segment.page_content is not None
        assert segment.metadata is not None
        assert "start_index" in segment.metadata
        assert segment.metadata["start_index"] > -1
 def test_create_documents__when_metadata_from_user__should_add_metadata(
    mock_client_with_semantic_text_splitter: Mock,
 ) -> None:
    sts = AI21SemanticTextSplitter(client=mock_client_with_semantic_text_splitter)
    metadatas = [{"hello": "world"}]
    response = sts.create_documents(texts=[TEXT], metadatas=metadatas)
    assert len(response) > 0
    for index in range(len(response)):
        assert response[index].page_content == SEGMENTS[index].segment_text
        assert len(response[index].metadata) == 2
        assert response[index].metadata["source_type"] == SEGMENTS[index].segment_type
        assert response[index].metadata["hello"] == "world"
 def test_split_text_to_documents__when_metadata_not_passed__should_contain_source_type(
    mock_client_with_semantic_text_splitter: Mock,
 ) -> None:
    sts = AI21SemanticTextSplitter(client=mock_client_with_semantic_text_splitter)
    response = sts.split_text_to_documents(TEXT)
    assert len(response) > 0
    for segment in response:
        assert segment.page_content is not None
        assert segment.metadata is not None
        assert "source_type" in segment.metadata
        assert segment.metadata["source_type"] is not None