Add start index to metadata in TextSplitter (#5912)

#### Add start index to metadata in TextSplitter - Modified method `create_documents` to track start position of each chunk - The `start_index` is included in the metadata if the `add_start_index` parameter in the class constructor is set to `True` This enables referencing back to the original document, particularly useful when a specific chunk is retrieved.  #### Who can review? Tag maintainers/contributors who might be interested: @eyurtsev @agola11
2025-09-02 19:47:13 +00:00 · 2023-06-09 02:09:32 -04:00
parent a09a0e3511
commit 2791a753bf
3 changed files with 30 additions and 7 deletions
--- a/docs/modules/indexes/text_splitters/getting_started.ipynb
+++ b/docs/modules/indexes/text_splitters/getting_started.ipynb
@@ -12,7 +12,8 @@
    "\n",
    "- `length_function`: how the length of chunks is calculated. Defaults to just counting number of characters, but it's pretty common to pass a token counter here.\n",
    "- `chunk_size`: the maximum size of your chunks (as measured by the length function).\n",
-    "- `chunk_overlap`: the maximum overlap between chunks. It can be nice to have some overlap to maintain some continuity between chunks (eg do a sliding window)."
+    "- `chunk_overlap`: the maximum overlap between chunks. It can be nice to have some overlap to maintain some continuity between chunks (eg do a sliding window).\n",
+    "- `add_start_index` : wether to include the starting position of each chunk within the original document in the metadata. "
   ]
  },
  {
@@ -49,6 +50,7 @@
    "    chunk_size = 100,\n",
    "    chunk_overlap  = 20,\n",
    "    length_function = len,\n",
+    "    add_start_index = True,\n",
    ")"
   ]
  },
@@ -62,8 +64,8 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and' lookup_str='' metadata={} lookup_index=0\n",
-      "page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.' lookup_str='' metadata={} lookup_index=0\n"
+      "page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and' metadata={'start_index': 0}\n",
+      "page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.' metadata={'start_index': 82}\n"
     ]
    }
   ],
@@ -90,7 +92,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.1"
+   "version": "3.9.16"
  },
  "vscode": {
   "interpreter": {