Files
langchain/docs/docs/how_to/character_text_splitter.ipynb

152 lines
6.4 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "c3ee8d00",
"metadata": {},
"source": [
"# How to split by character\n",
"\n",
"This is the simplest method. This splits based on a given character sequence, which defaults to `\"\\n\\n\"`. Chunk length is measured by number of characters.\n",
"\n",
"1. How the text is split: by single character separator.\n",
"2. How the chunk size is measured: by number of characters.\n",
"\n",
"To obtain the string content directly, use `.split_text`.\n",
"\n",
"To create LangChain [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects (e.g., for use in downstream tasks), use `.create_documents`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bf8698ce-44b2-4944-b9a9-254344b537af",
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain-text-splitters"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "313fb032",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russias Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'\n"
]
}
],
"source": [
"from langchain_text_splitters import CharacterTextSplitter\n",
"\n",
"# Load an example document\n",
"with open(\"state_of_the_union.txt\") as f:\n",
" state_of_the_union = f.read()\n",
"\n",
"text_splitter = CharacterTextSplitter(\n",
" separator=\"\\n\\n\",\n",
" chunk_size=1000,\n",
" chunk_overlap=200,\n",
" length_function=len,\n",
" is_separator_regex=False,\n",
")\n",
"texts = text_splitter.create_documents([state_of_the_union])\n",
"print(texts[0])"
]
},
{
"cell_type": "markdown",
"id": "dadcb9d6",
"metadata": {},
"source": [
"Use `.create_documents` to propagate metadata associated with each document to the output chunks:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "1affda60",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russias Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' metadata={'document': 1}\n"
]
}
],
"source": [
"metadatas = [{\"document\": 1}, {\"document\": 2}]\n",
"documents = text_splitter.create_documents(\n",
" [state_of_the_union, state_of_the_union], metadatas=metadatas\n",
")\n",
"print(documents[0])"
]
},
{
"cell_type": "markdown",
"id": "ee080e12-6f44-4311-b1ef-302520a41d66",
"metadata": {},
"source": [
"Use `.split_text` to obtain the string content directly:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "2a830a9f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russias Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_splitter.split_text(state_of_the_union)[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a9a3b9cd",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}