github/langchain

mirror of https://github.com/hwchase17/langchain.git synced 2026-02-22 15:13:08 +00:00

Files

History

Anthony Shaw fe88710ff8 docs: Embellish article on splitting by tokens with more examples and missing details (#18997 )

**Description**

This PR adds some missing details from the "Split by tokens" page in the
documentation. Specifically:

- The `.from_tiktoken_encoder()` class methods for both the
`CharacterTextSplitter` and `RecursiveCharacterTextSplitter` default to
the old `gpt-2` encoding. I've added a comment to suggest specifying
`model_name` or `encoding`
- The docs didn't mention that the `from_tiktoken_encoder()` class
method passes additional kwargs down to the constructor of the splitter.
I only discovered this by reading the source code
- Added an example of using the `.from_tiktoken_encoder()` class method
with `RecursiveCharacterTextSplitter` which is the recommended approach
for most scenarios above `CharacterTextSplitter`
- Added a warning that `TokenTextSplitter` can split characters which
have multiple tokens (e.g. 猫 has 3 cl100k_base tokens) between multiple
chunks which creates malformed Unicode strings and should not be used in
these situations.

Side note: I think the default argument of `gpt2` for
`.from_tiktoken_encoder()` should be updated?

**Twitter handle** anthonypjshaw

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>

2024-04-25 17:39:29 -07:00

..

community[patch], langchain[minor]: Add retriever self_query and score_threshold in DingoDB (#18106 )

2024-04-25 17:39:07 -07:00

👥 Update LangChain people data (#18473 )

2024-04-25 17:39:07 -07:00

docs: Embellish article on splitting by tokens with more examples and missing details (#18997 )

2024-04-25 17:39:29 -07:00

docs[minor]ci[minor]: Add script & CI to check recurring links daily (#19100 )

2024-04-25 17:39:12 -07:00

docs[patch]: properly load/use env vars (#18942 )

2024-04-25 17:39:10 -07:00

docs: Add graph construction docs (#18904 )

2024-04-25 17:39:11 -07:00

.gitignore

docs[minor]: Swap gtag for supabase (#18937 )

2024-04-25 17:39:10 -07:00

.local_build.sh

docs: partner packages (#16960 )

2024-02-02 15:12:21 -08:00

.yarnrc.yml

docs[minor]: Add thumbs up/down to all docs pages (#18526 )

2024-04-25 17:39:07 -07:00

babel.config.js

…

code-block-loader.js

…

docusaurus.config.js

docs[patch]: properly load/use env vars (#18942 )

2024-04-25 17:39:10 -07:00

package.json

docs[minor]ci[minor]: Add script & CI to check recurring links daily (#19100 )

2024-04-25 17:39:12 -07:00

README.md

docs: developer docs (#14776 )

2023-12-17 12:55:49 -08:00

settings.ini

…

sidebars.js

docs: Toolkits menu (#16217 )

2024-02-08 14:52:26 -08:00

vercel_build.sh

docs: fix vercel build script (#19090 )

2024-04-25 17:39:12 -07:00

vercel_requirements.txt

infra: docs build install community editable (#14739 )

2023-12-14 16:13:09 -08:00

vercel.json

docs: providers update 4 (#18540 )

2024-04-25 17:39:10 -07:00

yarn.lock

docs[minor]ci[minor]: Add script & CI to check recurring links daily (#19100 )

2024-04-25 17:39:12 -07:00

README.md

LangChain Documentation

For more information on contributing to our documentation, see the Documentation Contributing Guide