text-splitters: add pydocstyle linting (#28127)

As seen in #23188, turned on Google-style docstrings by enabling `pydocstyle` linting in the `text-splitters` package. Each resulting linting error was addressed differently: ignored, resolved, suppressed, and missing docstrings were added. Fixes one of the checklist items from #25154, similar to #25939 in `core` package. Ran `make format`, `make lint` and `make test` from the root of the package `text-splitters` to ensure no issues were found. --------- Co-authored-by: Erick Friis <erick@langchain.dev>
2025-09-10 07:21:03 +00:00 · 2024-12-08 22:01:03 -08:00
parent b53f07bfb9
commit 90f162efb6
9 changed files with 194 additions and 27 deletions
--- a/libs/text-splitters/langchain_text_splitters/base.py
+++ b/libs/text-splitters/langchain_text_splitters/base.py
@@ -249,6 +249,21 @@ class TokenTextSplitter(TextSplitter):
        self._disallowed_special = disallowed_special

    def split_text(self, text: str) -> List[str]:
+        """Splits the input text into smaller chunks based on tokenization.
+
+        This method uses a custom tokenizer configuration to encode the input text
+        into tokens, processes the tokens in chunks of a specified size with overlap,
+        and decodes them back into text chunks. The splitting is performed using the
+        `split_text_on_tokens` function.
+
+        Args:
+            text (str): The input text to be split into smaller chunks.
+
+        Returns:
+            List[str]: A list of text chunks, where each chunk is derived from a portion
+            of the input text based on the tokenization and chunking rules.
+        """
+
        def _encode(_text: str) -> List[int]:
            return self._tokenizer.encode(
                _text,