core[minor]: add new clean up strategy "scoped_full" to indexing (#28505)

~Note that this PR is now Draft, so I didn't add change to `aindex` function and didn't add test codes for my change. After we have an agreement on the direction, I will add commits.~ `batch_size` is very difficult to decide because setting a large number like >10000 will impact VectorDB and RecordManager, while setting a small number will delete records unnecessarily, leading to redundant work, as the `IMPORTANT` section says. On the other hand, we can't use `full` because the loader returns just a subset of the dataset in our use case. I guess many people are in the same situation as us. So, as one of the possible solutions for it, I would like to introduce a new argument, `scoped_full_cleanup`. This argument will be valid only when `claneup` is Full. If True, Full cleanup deletes all documents that haven't been updated AND that are associated with source ids that were seen during indexing. Default is False. This change keeps backward compatibility. --------- Co-authored-by: Eugene Yurtsev <eugene@langchain.dev> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2025-09-01 11:02:37 +00:00 · 2024-12-14 05:35:25 +09:00
parent 4802c31a53
commit 258b3be5ec
3 changed files with 384 additions and 33 deletions
--- a/docs/docs/how_to/indexing.ipynb
+++ b/docs/docs/how_to/indexing.ipynb
@@ -39,19 +39,20 @@
    "| None        | ✅                    | ✅            | ❌                               | ❌                                                 | -                  |\n",
    "| Incremental | ✅                    | ✅            | ❌                               | ✅                                                 | Continuously       |\n",
    "| Full        | ✅                    | ❌            | ✅                               | ✅                                                 | At end of indexing |\n",
+    "| Scoped_Full | ✅                    | ✅            | ❌                               | ✅                                                 | At end of indexing |\n",
    "\n",
    "\n",
    "`None` does not do any automatic clean up, allowing the user to manually do clean up of old content. \n",
    "\n",
-    "`incremental` and `full` offer the following automated clean up:\n",
+    "`incremental`, `full` and `scoped_full` offer the following automated clean up:\n",
    "\n",
-    "* If the content of the source document or derived documents has **changed**, both `incremental` or `full` modes will clean up (delete) previous versions of the content.\n",
-    "* If the source document has been **deleted** (meaning it is not included in the documents currently being indexed), the `full` cleanup mode will delete it from the vector store correctly, but the `incremental` mode will not.\n",
+    "* If the content of the source document or derived documents has **changed**, all 3 modes will clean up (delete) previous versions of the content.\n",
+    "* If the source document has been **deleted** (meaning it is not included in the documents currently being indexed), the `full` cleanup mode will delete it from the vector store correctly, but the `incremental` and `scoped_full` mode will not.\n",
    "\n",
    "When content is mutated (e.g., the source PDF file was revised) there will be a period of time during indexing when both the new and old versions may be returned to the user. This happens after the new content was written, but before the old version was deleted.\n",
    "\n",
    "* `incremental` indexing minimizes this period of time as it is able to do clean up continuously, as it writes.\n",
-    "* `full` mode does the clean up after all batches have been written.\n",
+    "* `full` and `scoped_full` mode does the clean up after all batches have been written.\n",
    "\n",
    "## Requirements\n",
    "\n",
@@ -64,7 +65,7 @@
    "  \n",
    "## Caution\n",
    "\n",
-    "The record manager relies on a time-based mechanism to determine what content can be cleaned up (when using `full` or `incremental` cleanup modes).\n",
+    "The record manager relies on a time-based mechanism to determine what content can be cleaned up (when using `full` or `incremental` or `scoped_full` cleanup modes).\n",
    "\n",
    "If two tasks run back-to-back, and the first task finishes before the clock time changes, then the second task may not be able to clean up content.\n",
    "\n",