langchain/docs/modules/indexes/text_splitters/examples/huggingface_length_function.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "13dc0983",
   "metadata": {},
   "source": [
    "# Hugging Face Length Function\n",
    "Most LLMs are constrained by the number of tokens that you can pass in, which is not the same as the number of characters. In order to get a more accurate estimate, we can use Hugging Face tokenizers to count the text length.\n",
    "\n",
    "1. How the text is split: by character passed in\n",
    "2. How the chunk size is measured: by Hugging Face tokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a8ce51d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import GPT2TokenizerFast\n",
    "\n",
    "tokenizer = GPT2TokenizerFast.from_pretrained(\"gpt2\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "388369ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is a long document we can split up.\n",
    "with open('../../../state_of_the_union.txt') as f:\n",
    "    state_of_the_union = f.read()\n",
    "from langchain.text_splitter import CharacterTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ca5e72c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=100, chunk_overlap=0)\n",
    "texts = text_splitter.split_text(state_of_the_union)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "37cdfbeb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n",
      "\n",
      "Last year COVID-19 kept us apart. This year we are finally together again. \n",
      "\n",
      "Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n",
      "\n",
      "With a duty to one another to the American people to the Constitution.\n"
     ]
    }
   ],
   "source": [
    "print(texts[0])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  },
  "vscode": {
   "interpreter": {
    "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}