docs[experimental]: Make docs clearer and add min_chunk_size (#26398)

Fixes #26171:

- added some clarification text for the keyword argument
`breakpoint_threshold_amount`
- added min_chunk_size: together with `breakpoint_threshold_amount`, too
small/big chunk sizes can be avoided

Note: the langchain-experimental was moved to a separate repo, so only
the doc change stays here.
This commit is contained in:
Tibor Reiss 2024-12-15 22:43:48 +01:00 committed by GitHub
parent d417e4b372
commit 690aa02c31
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -125,9 +125,11 @@
"\n", "\n",
"There are a few ways to determine what that threshold is, which are controlled by the `breakpoint_threshold_type` kwarg.\n", "There are a few ways to determine what that threshold is, which are controlled by the `breakpoint_threshold_type` kwarg.\n",
"\n", "\n",
"Note: if the resulting chunk sizes are too small/big, the additional kwargs `breakpoint_threshold_amount` and `min_chunk_size` can be used for adjustments.\n",
"\n",
"### Percentile\n", "### Percentile\n",
"\n", "\n",
"The default way to split is based on percentile. In this method, all differences between sentences are calculated, and then any difference greater than the X percentile is split." "The default way to split is based on percentile. In this method, all differences between sentences are calculated, and then any difference greater than the X percentile is split. The default value for X is 95.0 and can be adjusted by the keyword argument `breakpoint_threshold_amount` which expects a number between 0.0 and 100.0."
] ]
}, },
{ {
@ -186,7 +188,7 @@
"source": [ "source": [
"### Standard Deviation\n", "### Standard Deviation\n",
"\n", "\n",
"In this method, any difference greater than X standard deviations is split." "In this method, any difference greater than X standard deviations is split. The default value for X is 3.0 and can be adjusted by the keyword argument `breakpoint_threshold_amount`."
] ]
}, },
{ {
@ -245,7 +247,7 @@
"source": [ "source": [
"### Interquartile\n", "### Interquartile\n",
"\n", "\n",
"In this method, the interquartile distance is used to split chunks." "In this method, the interquartile distance is used to split chunks. The interquartile range can be scaled by the keyword argument `breakpoint_threshold_amount`, the default value is 1.5."
] ]
}, },
{ {
@ -306,8 +308,8 @@
"source": [ "source": [
"### Gradient\n", "### Gradient\n",
"\n", "\n",
"In this method, the gradient of distance is used to split chunks along with the percentile method.\n", "In this method, the gradient of distance is used to split chunks along with the percentile method. This method is useful when chunks are highly correlated with each other or specific to a domain e.g. legal or medical. The idea is to apply anomaly detection on gradient array so that the distribution become wider and easy to identify boundaries in highly semantic data.\n",
"This method is useful when chunks are highly correlated with each other or specific to a domain e.g. legal or medical. The idea is to apply anomaly detection on gradient array so that the distribution become wider and easy to identify boundaries in highly semantic data." "Similar to the percentile method, the split can be adjusted by the keyword argument `breakpoint_threshold_amount` which expects a number between 0.0 and 100.0, the default value is 95.0."
] ]
}, },
{ {