langchain/docs/docs/how_to/recursive_json_splitter.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a678d550",
   "metadata": {},
   "source": [
    "# How to split JSON data\n",
    "\n",
    "This json splitter splits json data while allowing control over chunk sizes. It traverses json data depth first and builds smaller json chunks. It attempts to keep nested json objects whole but will split them if needed to keep chunks between a min_chunk_size and the max_chunk_size.\n",
    "\n",
    "If the value is not a nested json, but rather a very large string the string will not be split. If you need a hard cap on the chunk size consider composing this with a Recursive Text splitter on those chunks. There is an optional pre-processing step to split lists, by first converting them to json (dict) and then splitting them as such.\n",
    "\n",
    "1. How the text is split: json value.\n",
    "2. How the chunk size is measured: by number of characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f335e05-e5ae-44cc-899d-749aa9031a58",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -qU langchain-text-splitters"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2b3fe87-d230-4cbd-b3ae-01559c5351a3",
   "metadata": {},
   "source": [
    "First we load some json data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "3390ae1d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "import requests\n",
    "\n",
    "# This is a large nested json object and will be loaded as a python dict\n",
    "json_data = requests.get(\"https://api.smith.langchain.com/openapi.json\").json()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3cdc725d-f4b8-4725-9084-cb395d8ef48b",
   "metadata": {},
   "source": [
    "## Basic usage\n",
    "\n",
    "Specify `max_chunk_size` to constrain chunk sizes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7bfe2c1e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_text_splitters import RecursiveJsonSplitter\n",
    "\n",
    "splitter = RecursiveJsonSplitter(max_chunk_size=300)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e03b79fb-b1c6-4324-a409-86cd3e40cb92",
   "metadata": {},
   "source": [
    "To obtain json chunks, use the `.split_json` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "69250bc6-c0f5-40d0-b8ba-7a349236bfd2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'servers': [{'url': 'https://api.smith.langchain.com', 'description': 'LangSmith API endpoint.'}]}\n",
      "{'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'], 'summary': 'Read Tracer Session', 'description': 'Get a specific session.', 'operationId': 'read_tracer_session_api_v1_sessions__session_id__get'}}}}\n",
      "{'paths': {'/api/v1/sessions/{session_id}': {'get': {'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}}\n"
     ]
    }
   ],
   "source": [
    "# Recursively split json data - If you need to access/manipulate the smaller json chunks\n",
    "json_chunks = splitter.split_json(json_data=json_data)\n",
    "\n",
    "for chunk in json_chunks[:3]:\n",
    "    print(chunk)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f05bc21-227e-4d2c-af51-16d69ad3cd7b",
   "metadata": {},
   "source": [
    "To obtain LangChain [Document](https://python.langchain.com/v0.2/api_reference/core/documents/langchain_core.documents.base.Document.html) objects, use the `.create_documents` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "0839f4f0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "page_content='{\"openapi\": \"3.1.0\", \"info\": {\"title\": \"LangSmith\", \"version\": \"0.1.0\"}, \"servers\": [{\"url\": \"https://api.smith.langchain.com\", \"description\": \"LangSmith API endpoint.\"}]}'\n",
      "page_content='{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"tags\": [\"tracer-sessions\"], \"summary\": \"Read Tracer Session\", \"description\": \"Get a specific session.\", \"operationId\": \"read_tracer_session_api_v1_sessions__session_id__get\"}}}}'\n",
      "page_content='{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"security\": [{\"API Key\": []}, {\"Tenant ID\": []}, {\"Bearer Auth\": []}]}}}}'\n"
     ]
    }
   ],
   "source": [
    "# The splitter can also output documents\n",
    "docs = splitter.create_documents(texts=[json_data])\n",
    "\n",
    "for doc in docs[:3]:\n",
    "    print(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "677c3dd0-afc7-488a-a58d-b7943814f85d",
   "metadata": {},
   "source": [
    "Or use `.split_text` to obtain string content directly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "fa0a4d66-b470-404e-918b-6728df3b88b0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"openapi\": \"3.1.0\", \"info\": {\"title\": \"LangSmith\", \"version\": \"0.1.0\"}, \"servers\": [{\"url\": \"https://api.smith.langchain.com\", \"description\": \"LangSmith API endpoint.\"}]}\n",
      "{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"tags\": [\"tracer-sessions\"], \"summary\": \"Read Tracer Session\", \"description\": \"Get a specific session.\", \"operationId\": \"read_tracer_session_api_v1_sessions__session_id__get\"}}}}\n"
     ]
    }
   ],
   "source": [
    "texts = splitter.split_text(json_data=json_data)\n",
    "\n",
    "print(texts[0])\n",
    "print(texts[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7070bf45-b885-4949-b8e0-7d1ea5205d2a",
   "metadata": {},
   "source": [
    "## How to manage chunk sizes from list content\n",
    "\n",
    "Note that one of the chunks in this example is larger than the specified `max_chunk_size` of 300. Reviewing one of these chunks that was bigger we see there is a list object there:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "86ef3195-375b-4db2-9804-f3fa5a249417",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[171, 231, 126, 469, 210, 213, 237, 271, 191, 232]\n",
      "\n",
      "{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"parameters\": [{\"name\": \"session_id\", \"in\": \"path\", \"required\": true, \"schema\": {\"type\": \"string\", \"format\": \"uuid\", \"title\": \"Session Id\"}}, {\"name\": \"include_stats\", \"in\": \"query\", \"required\": false, \"schema\": {\"type\": \"boolean\", \"default\": false, \"title\": \"Include Stats\"}}, {\"name\": \"accept\", \"in\": \"header\", \"required\": false, \"schema\": {\"anyOf\": [{\"type\": \"string\"}, {\"type\": \"null\"}], \"title\": \"Accept\"}}]}}}}\n"
     ]
    }
   ],
   "source": [
    "print([len(text) for text in texts][:10])\n",
    "print()\n",
    "print(texts[3])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ddc98a1d-05df-48ab-8d17-6e4ee0d9d0cb",
   "metadata": {},
   "source": [
    "The json splitter by default does not split lists.\n",
    "\n",
    "Specify `convert_lists=True` to preprocess the json, converting list content to dicts with `index:item` as `key:val` pairs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "992477c2",
   "metadata": {},
   "outputs": [],
   "source": [
    "texts = splitter.split_text(json_data=json_data, convert_lists=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "912c20c2-8d05-47a6-bc03-f5c866761dff",
   "metadata": {},
   "source": [
    "Let's look at the size of the chunks. Now they are all under the max"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "7abd43f6-78ab-4a73-853a-a777ab268efc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[176, 236, 141, 203, 212, 221, 210, 213, 242, 291]\n"
     ]
    }
   ],
   "source": [
    "print([len(text) for text in texts][:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e5753bf-cede-4751-a1c0-c42aca56b88a",
   "metadata": {},
   "source": [
    "The list has been converted to a dict, but retains all the needed contextual information even if split into many chunks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "d2c2773e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"tags\": {\"0\": \"tracer-sessions\"}, \"summary\": \"Read Tracer Session\", \"description\": \"Get a specific session.\", \"operationId\": \"read_tracer_session_api_v1_sessions__session_id__get\"}}}}\n"
     ]
    }
   ],
   "source": [
    "print(texts[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "8963b01a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(page_content='{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"tags\": [\"tracer-sessions\"], \"summary\": \"Read Tracer Session\", \"description\": \"Get a specific session.\", \"operationId\": \"read_tracer_session_api_v1_sessions__session_id__get\"}}}}')"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We can also look at the documents\n",
    "docs[1]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}