mirror of https://github.com/hwchase17/langchain.git synced 2026-01-16 08:07:23 +00:00

Files

Giulio Zani 9f0b63dba0 experimental[patch]: Fixes issue #17060 (#17062 )

As described in issue #17060, in the case in which text has only one
sentence the following function fails. Checking for that and adding a
return case fixed the issue.

```python
    def split_text(self, text: str) -> List[str]:
        """Split text into multiple components."""
        # Splitting the essay on '.', '?', and '!'
        single_sentences_list = re.split(r"(?<=[.?!])\s+", text)
        sentences = [
            {"sentence": x, "index": i} for i, x in enumerate(single_sentences_list)
        ]
        sentences = combine_sentences(sentences)
        embeddings = self.embeddings.embed_documents(
            [x["combined_sentence"] for x in sentences]
        )
        for i, sentence in enumerate(sentences):
            sentence["combined_sentence_embedding"] = embeddings[i]
        distances, sentences = calculate_cosine_distances(sentences)
        start_index = 0

        # Create a list to hold the grouped sentences
        chunks = []
        breakpoint_percentile_threshold = 95
        breakpoint_distance_threshold = np.percentile(
            distances, breakpoint_percentile_threshold
        )  # If you want more chunks, lower the percentile cutoff

        indices_above_thresh = [
            i for i, x in enumerate(distances) if x > breakpoint_distance_threshold
        ]  # The indices of those breakpoints on your list

        # Iterate through the breakpoints to slice the sentences
        for index in indices_above_thresh:
            # The end index is the current breakpoint
            end_index = index

            # Slice the sentence_dicts from the current start index to the end index
            group = sentences[start_index : end_index + 1]
            combined_text = " ".join([d["sentence"] for d in group])
            chunks.append(combined_text)

            # Update the start index for the next group
            start_index = index + 1

        # The last group, if any sentences remain
        if start_index < len(sentences):
            combined_text = " ".join([d["sentence"] for d in sentences[start_index:]])
            chunks.append(combined_text)
        return chunks
```

Co-authored-by: Giulio Zani <salamanderxing@Giulios-MBP.homenet.telecomitalia.it>

2024-02-05 16:18:57 -08:00

langchain_experimental

experimental[patch]: Fixes issue #17060 (#17062 )

2024-02-05 16:18:57 -08:00

scripts

infra: import checking bugfix (#14569 )

2023-12-11 15:53:51 -08:00

tests

langchain[patch], experimental[patch]: update utilities imports (#15438 )

2024-01-03 02:18:15 -05:00

_test_minimum_requirements.txt

infra: bump exp min test reqs (#16884 )

2024-02-01 08:35:21 -08:00

LICENSE

Library Licenses (#13300 )

2023-11-28 17:34:27 -08:00

Makefile

create mypy cache dir if it doesn't exist (#14579 )

2023-12-12 15:34:50 -08:00

poetry.lock

infra: bump exp min test reqs (#16884 )

2024-02-01 08:35:21 -08:00

poetry.toml

…

pyproject.toml

infra: bump exp min test reqs (#16884 )

2024-02-01 08:35:21 -08:00

README.md

…

README.md

🦜️🧪 LangChain Experimental

This package holds experimental LangChain code, intended for research and experimental uses.

Warning

Portions of the code in this package may be dangerous if not properly deployed in a sandboxed environment. Please be wary of deploying experimental code to production unless you've taken appropriate precautions and have already discussed it with your security team.

Some of the code here may be marked with security notices. However, given the exploratory and experimental nature of the code in this package, the lack of a security notice on a piece of code does not mean that the code in question does not require additional security considerations in order to be safe to use.

README.md Unescape Escape

🦜️🧪 LangChain Experimental

README.md