mirror of
https://github.com/hwchase17/langchain.git
synced 2025-04-28 11:55:21 +00:00
297 lines
10 KiB
Plaintext
297 lines
10 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "db23d51760310705",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Writer Text Splitter\n",
|
|
"\n",
|
|
"This notebook provides a quick overview for getting started with Writer's [text splitter](/docs/concepts/text_splitters/).\n",
|
|
"\n",
|
|
"Writer's [context-aware splitting endpoint](https://dev.writer.com/api-guides/tools#context-aware-text-splitting) provides intelligent text splitting capabilities for long documents (up to 4000 words). Unlike simple character-based splitting, it preserves the semantic meaning and context between chunks, making it ideal for processing long-form content while maintaining coherence. In `langchain-writer`, we provide usage of Writer's context-aware splitting endpoint as a LangChain text splitter.\n",
|
|
"\n",
|
|
"## Overview\n",
|
|
"\n",
|
|
"### Integration details\n",
|
|
"| Class | Package | Local | Serializable | JS support | Package downloads | Package latest |\n",
|
|
"|:-----------------------------------------------------------------------------------------------------------------------------------------|:-----------------| :---: | :---: |:----------:|:------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------:|\n",
|
|
"| [WriterTextSplitter](https://github.com/writer/langchain-writer/blob/main/langchain_writer/text_splitter.py#L11) | [langchain-writer](https://pypi.org/project/langchain-writer/) | ❌ | ❌ | ❌ |  |  |"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c5f08d23df5dc127",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Setup\n",
|
|
"\n",
|
|
"The `WriterTextSplitter` is available in the `langchain-writer` package:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "a8d653f15b7ee32d",
|
|
"metadata": {},
|
|
"source": "%pip install --quiet -U langchain-writer",
|
|
"outputs": [],
|
|
"execution_count": null
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3b9709c26797edf",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Credentials\n",
|
|
"\n",
|
|
"Sign up for [Writer AI Studio](https://app.writer.com/aistudio/signup?utm_campaign=devrel) to generate an API key (you can follow this [Quickstart](https://dev.writer.com/api-guides/quickstart)). Then, set the WRITER_API_KEY environment variable:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "2983e19c9d555e58",
|
|
"metadata": {},
|
|
"source": [
|
|
"import getpass\n",
|
|
"import os\n",
|
|
"\n",
|
|
"if not os.getenv(\"WRITER_API_KEY\"):\n",
|
|
" os.environ[\"WRITER_API_KEY\"] = getpass.getpass(\"Enter your Writer API key: \")"
|
|
],
|
|
"outputs": [],
|
|
"execution_count": null
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "92a22c77f03d43dc",
|
|
"metadata": {},
|
|
"source": [
|
|
"It's also helpful (but not needed) to set up [LangSmith](https://smith.langchain.com/) for best-in-class observability. If you wish to do so, you can set the `LANGCHAIN_TRACING_V2` and `LANGCHAIN_API_KEY` environment variables:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "98d8422ecee77403",
|
|
"metadata": {},
|
|
"source": [
|
|
"# os.environ[\"LANGCHAIN_TRACING_V2\"] = \"true\"\n",
|
|
"# os.environ[\"LANGCHAIN_API_KEY\"] = getpass.getpass()"
|
|
],
|
|
"outputs": [],
|
|
"execution_count": null
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "67ab78950a3da8ba",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Instantiation\n",
|
|
"\n",
|
|
"Instantiate an instance of `WriterTextSplitter` with the `strategy` parameter set to one of the following:\n",
|
|
"\n",
|
|
"- `llm_split`: Uses language model for precise semantic splitting\n",
|
|
"- `fast_split`: Uses heuristic-based approach for quick splitting\n",
|
|
"- `hybrid_split`: Combines both approaches\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "787b3ba8af32533f",
|
|
"metadata": {},
|
|
"source": [
|
|
"from langchain_writer.text_splitter import WriterTextSplitter\n",
|
|
"\n",
|
|
"splitter = WriterTextSplitter(strategy=\"fast_split\")"
|
|
],
|
|
"outputs": [],
|
|
"execution_count": null
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d91c6f752fd31cee",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Usage\n",
|
|
"The `WriterTextSplitter` can be used synchronously or asynchronously.\n",
|
|
"\n",
|
|
"### Synchronous usage\n",
|
|
"To use the `WriterTextSplitter` synchronously, call the `split_text` method with the text you want to split:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "d1a24b81a8a96f09",
|
|
"metadata": {},
|
|
"source": [
|
|
"text = \"\"\"Reeeeeeeeeeeeeeeeeeeeeaally long text you want to divide into smaller chunks. For example you can add a poem multiple times:\n",
|
|
"Two roads diverged in a yellow wood,\n",
|
|
"And sorry I could not travel both\n",
|
|
"And be one traveler, long I stood\n",
|
|
"And looked down one as far as I could\n",
|
|
"To where it bent in the undergrowth;\n",
|
|
"\n",
|
|
"Then took the other, as just as fair,\n",
|
|
"And having perhaps the better claim,\n",
|
|
"Because it was grassy and wanted wear;\n",
|
|
"Though as for that the passing there\n",
|
|
"Had worn them really about the same,\n",
|
|
"\n",
|
|
"And both that morning equally lay\n",
|
|
"In leaves no step had trodden black.\n",
|
|
"Oh, I kept the first for another day!\n",
|
|
"Yet knowing how way leads on to way,\n",
|
|
"I doubted if I should ever come back.\n",
|
|
"\n",
|
|
"I shall be telling this with a sigh\n",
|
|
"Somewhere ages and ages hence:\n",
|
|
"Two roads diverged in a wood, and I—\n",
|
|
"I took the one less traveled by,\n",
|
|
"And that has made all the difference.\n",
|
|
"\n",
|
|
"Two roads diverged in a yellow wood,\n",
|
|
"And sorry I could not travel both\n",
|
|
"And be one traveler, long I stood\n",
|
|
"And looked down one as far as I could\n",
|
|
"To where it bent in the undergrowth;\n",
|
|
"\n",
|
|
"Then took the other, as just as fair,\n",
|
|
"And having perhaps the better claim,\n",
|
|
"Because it was grassy and wanted wear;\n",
|
|
"Though as for that the passing there\n",
|
|
"Had worn them really about the same,\n",
|
|
"\n",
|
|
"And both that morning equally lay\n",
|
|
"In leaves no step had trodden black.\n",
|
|
"Oh, I kept the first for another day!\n",
|
|
"Yet knowing how way leads on to way,\n",
|
|
"I doubted if I should ever come back.\n",
|
|
"\n",
|
|
"I shall be telling this with a sigh\n",
|
|
"Somewhere ages and ages hence:\n",
|
|
"Two roads diverged in a wood, and I—\n",
|
|
"I took the one less traveled by,\n",
|
|
"And that has made all the difference.\n",
|
|
"\n",
|
|
"Two roads diverged in a yellow wood,\n",
|
|
"And sorry I could not travel both\n",
|
|
"And be one traveler, long I stood\n",
|
|
"And looked down one as far as I could\n",
|
|
"To where it bent in the undergrowth;\n",
|
|
"\n",
|
|
"Then took the other, as just as fair,\n",
|
|
"And having perhaps the better claim,\n",
|
|
"Because it was grassy and wanted wear;\n",
|
|
"Though as for that the passing there\n",
|
|
"Had worn them really about the same,\n",
|
|
"\n",
|
|
"And both that morning equally lay\n",
|
|
"In leaves no step had trodden black.\n",
|
|
"Oh, I kept the first for another day!\n",
|
|
"Yet knowing how way leads on to way,\n",
|
|
"I doubted if I should ever come back.\n",
|
|
"\n",
|
|
"I shall be telling this with a sigh\n",
|
|
"Somewhere ages and ages hence:\n",
|
|
"Two roads diverged in a wood, and I—\n",
|
|
"I took the one less traveled by,\n",
|
|
"And that has made all the difference.\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"chunks = splitter.split_text(text)\n",
|
|
"chunks"
|
|
],
|
|
"outputs": [],
|
|
"execution_count": null
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e6d09fcf",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can print the length of the chunks to see how many chunks were created:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "a470daa875d99006",
|
|
"metadata": {},
|
|
"source": [
|
|
"print(len(chunks))"
|
|
],
|
|
"outputs": [],
|
|
"execution_count": null
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f89c048c7d23807a",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Asynchronous usage\n",
|
|
"To use the `WriterTextSplitter` asynchronously, call the `asplit_text` method with the text you want to split:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "e2f7fd52b7188c6c",
|
|
"metadata": {},
|
|
"source": [
|
|
"async_chunks = await splitter.asplit_text(text)\n",
|
|
"async_chunks"
|
|
],
|
|
"outputs": [],
|
|
"execution_count": null
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cb669ce7",
|
|
"metadata": {},
|
|
"source": [
|
|
"Print the length of the chunks to see how many chunks were created:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "a1439db14e687fa4",
|
|
"metadata": {},
|
|
"source": [
|
|
"print(len(async_chunks))"
|
|
],
|
|
"outputs": [],
|
|
"execution_count": null
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ab25a3bed8437a05",
|
|
"metadata": {},
|
|
"source": [
|
|
"## API reference\n",
|
|
"For detailed documentation of all `WriterTextSplitter` features and configurations head to the [API reference](https://python.langchain.com/api_reference/writer/text_splitter/langchain_writer.text_splitter.WriterTextSplitter.html#langchain_writer.text_splitter.WriterTextSplitter).\n",
|
|
"\n",
|
|
"## Additional resources\n",
|
|
"You can find information about Writer's models (including costs, context windows, and supported input types) and tools in the [Writer docs](https://dev.writer.com/home)."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 2
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython2",
|
|
"version": "2.7.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|