Files
langchain/docs/docs/how_to/recursive_json_splitter.ipynb
Erick Friis 21d14549a9 docs: v0.2 docs in master (#21438)
current python.langchain.com is building from branch `v0.1`. Iterate on
v0.2 docs here.

---------

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Co-authored-by: jacoblee93 <jacoblee93@gmail.com>
Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com>
Co-authored-by: Leonid Kuligin <lkuligin@yandex.ru>
Co-authored-by: Averi Kitsch <akitsch@google.com>
Co-authored-by: Nuno Campos <nuno@langchain.dev>
Co-authored-by: Nuno Campos <nuno@boringbits.io>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Martín Gotelli Ferenaz <martingotelliferenaz@gmail.com>
Co-authored-by: Fayfox <admin@fayfox.com>
Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>
Co-authored-by: Dawson Bauer <105886620+djbauer2@users.noreply.github.com>
Co-authored-by: Ravindu Somawansa <ravindu.somawansa@gmail.com>
Co-authored-by: Dhruv Chawla <43818888+Dominastorm@users.noreply.github.com>
Co-authored-by: ccurme <chester.curme@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: WeichenXu <weichen.xu@databricks.com>
Co-authored-by: Benito Geordie <89472452+benitoThree@users.noreply.github.com>
Co-authored-by: kartikTAI <129414343+kartikTAI@users.noreply.github.com>
Co-authored-by: Kartik Sarangmath <kartik@thirdai.com>
Co-authored-by: Sevin F. Varoglu <sfvaroglu@octoml.ai>
Co-authored-by: MacanPN <martin.triska@gmail.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Hyeongchan Kim <kozistr@gmail.com>
Co-authored-by: sdan <git@sdan.io>
Co-authored-by: Guangdong Liu <liugddx@gmail.com>
Co-authored-by: Rahul Triptahi <rahul.psit.ec@gmail.com>
Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: pjb157 <84070455+pjb157@users.noreply.github.com>
Co-authored-by: Eun Hye Kim <ehkim1440@gmail.com>
Co-authored-by: kaijietti <43436010+kaijietti@users.noreply.github.com>
Co-authored-by: Pengcheng Liu <pcliu.fd@gmail.com>
Co-authored-by: Tomer Cagan <tomer@tomercagan.com>
Co-authored-by: Christophe Bornet <cbornet@hotmail.com>
2024-05-08 12:29:59 -07:00

316 lines
9.9 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "a678d550",
"metadata": {},
"source": [
"# How to split JSON data\n",
"\n",
"This json splitter splits json data while allowing control over chunk sizes. It traverses json data depth first and builds smaller json chunks. It attempts to keep nested json objects whole but will split them if needed to keep chunks between a min_chunk_size and the max_chunk_size.\n",
"\n",
"If the value is not a nested json, but rather a very large string the string will not be split. If you need a hard cap on the chunk size consider composing this with a Recursive Text splitter on those chunks. There is an optional pre-processing step to split lists, by first converting them to json (dict) and then splitting them as such.\n",
"\n",
"1. How the text is split: json value.\n",
"2. How the chunk size is measured: by number of characters."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3f335e05-e5ae-44cc-899d-749aa9031a58",
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain-text-splitters"
]
},
{
"cell_type": "markdown",
"id": "a2b3fe87-d230-4cbd-b3ae-01559c5351a3",
"metadata": {},
"source": [
"First we load some json data:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "3390ae1d",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"import requests\n",
"\n",
"# This is a large nested json object and will be loaded as a python dict\n",
"json_data = requests.get(\"https://api.smith.langchain.com/openapi.json\").json()"
]
},
{
"cell_type": "markdown",
"id": "3cdc725d-f4b8-4725-9084-cb395d8ef48b",
"metadata": {},
"source": [
"## Basic usage\n",
"\n",
"Specify `max_chunk_size` to constrain chunk sizes:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "7bfe2c1e",
"metadata": {},
"outputs": [],
"source": [
"from langchain_text_splitters import RecursiveJsonSplitter\n",
"\n",
"splitter = RecursiveJsonSplitter(max_chunk_size=300)"
]
},
{
"cell_type": "markdown",
"id": "e03b79fb-b1c6-4324-a409-86cd3e40cb92",
"metadata": {},
"source": [
"To obtain json chunks, use the `.split_json` method:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "69250bc6-c0f5-40d0-b8ba-7a349236bfd2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'servers': [{'url': 'https://api.smith.langchain.com', 'description': 'LangSmith API endpoint.'}]}\n",
"{'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'], 'summary': 'Read Tracer Session', 'description': 'Get a specific session.', 'operationId': 'read_tracer_session_api_v1_sessions__session_id__get'}}}}\n",
"{'paths': {'/api/v1/sessions/{session_id}': {'get': {'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}}\n"
]
}
],
"source": [
"# Recursively split json data - If you need to access/manipulate the smaller json chunks\n",
"json_chunks = splitter.split_json(json_data=json_data)\n",
"\n",
"for chunk in json_chunks[:3]:\n",
" print(chunk)"
]
},
{
"cell_type": "markdown",
"id": "3f05bc21-227e-4d2c-af51-16d69ad3cd7b",
"metadata": {},
"source": [
"To obtain LangChain [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects, use the `.create_documents` method:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "0839f4f0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='{\"openapi\": \"3.1.0\", \"info\": {\"title\": \"LangSmith\", \"version\": \"0.1.0\"}, \"servers\": [{\"url\": \"https://api.smith.langchain.com\", \"description\": \"LangSmith API endpoint.\"}]}'\n",
"page_content='{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"tags\": [\"tracer-sessions\"], \"summary\": \"Read Tracer Session\", \"description\": \"Get a specific session.\", \"operationId\": \"read_tracer_session_api_v1_sessions__session_id__get\"}}}}'\n",
"page_content='{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"security\": [{\"API Key\": []}, {\"Tenant ID\": []}, {\"Bearer Auth\": []}]}}}}'\n"
]
}
],
"source": [
"# The splitter can also output documents\n",
"docs = splitter.create_documents(texts=[json_data])\n",
"\n",
"for doc in docs[:3]:\n",
" print(doc)"
]
},
{
"cell_type": "markdown",
"id": "677c3dd0-afc7-488a-a58d-b7943814f85d",
"metadata": {},
"source": [
"Or use `.split_text` to obtain string content directly:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "fa0a4d66-b470-404e-918b-6728df3b88b0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\"openapi\": \"3.1.0\", \"info\": {\"title\": \"LangSmith\", \"version\": \"0.1.0\"}, \"servers\": [{\"url\": \"https://api.smith.langchain.com\", \"description\": \"LangSmith API endpoint.\"}]}\n",
"{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"tags\": [\"tracer-sessions\"], \"summary\": \"Read Tracer Session\", \"description\": \"Get a specific session.\", \"operationId\": \"read_tracer_session_api_v1_sessions__session_id__get\"}}}}\n"
]
}
],
"source": [
"texts = splitter.split_text(json_data=json_data)\n",
"\n",
"print(texts[0])\n",
"print(texts[1])"
]
},
{
"cell_type": "markdown",
"id": "7070bf45-b885-4949-b8e0-7d1ea5205d2a",
"metadata": {},
"source": [
"## How to manage chunk sizes from list content\n",
"\n",
"Note that one of the chunks in this example is larger than the specified `max_chunk_size` of 300. Reviewing one of these chunks that was bigger we see there is a list object there:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "86ef3195-375b-4db2-9804-f3fa5a249417",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[171, 231, 126, 469, 210, 213, 237, 271, 191, 232]\n",
"\n",
"{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"parameters\": [{\"name\": \"session_id\", \"in\": \"path\", \"required\": true, \"schema\": {\"type\": \"string\", \"format\": \"uuid\", \"title\": \"Session Id\"}}, {\"name\": \"include_stats\", \"in\": \"query\", \"required\": false, \"schema\": {\"type\": \"boolean\", \"default\": false, \"title\": \"Include Stats\"}}, {\"name\": \"accept\", \"in\": \"header\", \"required\": false, \"schema\": {\"anyOf\": [{\"type\": \"string\"}, {\"type\": \"null\"}], \"title\": \"Accept\"}}]}}}}\n"
]
}
],
"source": [
"print([len(text) for text in texts][:10])\n",
"print()\n",
"print(texts[3])"
]
},
{
"cell_type": "markdown",
"id": "ddc98a1d-05df-48ab-8d17-6e4ee0d9d0cb",
"metadata": {},
"source": [
"The json splitter by default does not split lists.\n",
"\n",
"Specify `convert_lists=True` to preprocess the json, converting list content to dicts with `index:item` as `key:val` pairs:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "992477c2",
"metadata": {},
"outputs": [],
"source": [
"texts = splitter.split_text(json_data=json_data, convert_lists=True)"
]
},
{
"cell_type": "markdown",
"id": "912c20c2-8d05-47a6-bc03-f5c866761dff",
"metadata": {},
"source": [
"Let's look at the size of the chunks. Now they are all under the max"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "7abd43f6-78ab-4a73-853a-a777ab268efc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[176, 236, 141, 203, 212, 221, 210, 213, 242, 291]\n"
]
}
],
"source": [
"print([len(text) for text in texts][:10])"
]
},
{
"cell_type": "markdown",
"id": "3e5753bf-cede-4751-a1c0-c42aca56b88a",
"metadata": {},
"source": [
"The list has been converted to a dict, but retains all the needed contextual information even if split into many chunks:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "d2c2773e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"tags\": {\"0\": \"tracer-sessions\"}, \"summary\": \"Read Tracer Session\", \"description\": \"Get a specific session.\", \"operationId\": \"read_tracer_session_api_v1_sessions__session_id__get\"}}}}\n"
]
}
],
"source": [
"print(texts[1])"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "8963b01a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"tags\": [\"tracer-sessions\"], \"summary\": \"Read Tracer Session\", \"description\": \"Get a specific session.\", \"operationId\": \"read_tracer_session_api_v1_sessions__session_id__get\"}}}}')"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We can also look at the documents\n",
"docs[1]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}