langchain

mirror of https://github.com/hwchase17/langchain.git synced 2026-05-12 17:57:22 +00:00

Files

Raviraj 858ce264ef SemanticChunker : Feature Addition ("Semantic Splitting with gradient") (#22895 )

```SemanticChunker``` currently provide three methods to split the texts semantically:
- percentile
- standard_deviation
- interquartile

I propose new method ```gradient```. In this method, the gradient of distance is used to split chunks along with the percentile method (technically) . This method is useful when chunks are highly correlated with each other or specific to a domain e.g. legal or medical. The idea is to apply anomaly detection on gradient array so that the distribution become wider and easy to identify boundaries in highly semantic data.
I have tested this merge on a set of 10 domain specific documents (mostly legal).

Details : 
    - **Issue:** Improvement
    - **Dependencies:** NA
    - **Twitter handle:** [x.com/prajapat_ravi](https://x.com/prajapat_ravi)


@hwchase17

---------

Co-authored-by: Raviraj Prajapat <raviraj.prajapat@sirionlabs.com>
Co-authored-by: isaac hershenson <ihershenson@hmc.edu>

2024-06-17 21:01:08 -07:00

cli

cli[minor]: remove redefined DEFAULT_GIT_REF (#21471 )

2024-06-14 15:49:15 -07:00

community

LanceDB integration update (#22869 )

2024-06-17 20:54:26 -07:00

core

Include "no escape" and "inverted section" mustache vars in Prompt.input_variables and Prompt.input_schema (#22981 )

2024-06-17 19:24:13 -07:00

experimental

SemanticChunker : Feature Addition ("Semantic Splitting with gradient") (#22895 )