langchain/docs/docs/integrations/document_transformers/html2text.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fe6e5c82",
   "metadata": {},
   "source": [
    "# HTML to text\n",
    "\n",
    ">[html2text](https://github.com/Alir3z4/html2text/) is a Python package that converts a page of `HTML` into clean, easy-to-read plain `ASCII text`. \n",
    "\n",
    "The ASCII also happens to be a valid `Markdown` (a text-to-HTML format)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ce77e0cb",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install --upgrade --quiet html2text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "8ca0974b",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "USER_AGENT environment variable not set, consider setting it to identify your requests.\n",
      "Fetching pages: 100%|##########| 2/2 [00:00<00:00, 14.74it/s]\n"
     ]
    }
   ],
   "source": [
    "from langchain_community.document_loaders import AsyncHtmlLoader\n",
    "\n",
    "urls = [\"https://www.espn.com\", \"https://lilianweng.github.io/posts/2023-06-23-agent/\"]\n",
    "loader = AsyncHtmlLoader(urls)\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a95a928c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "## Fantasy\n",
      "\n",
      "  * Football\n",
      "\n",
      "  * Baseball\n",
      "\n",
      "  * Basketball\n",
      "\n",
      "  * Hockey\n",
      "\n",
      "## ESPN Sites\n",
      "\n",
      "  * ESPN Deportes\n",
      "\n",
      "  * Andscape\n",
      "\n",
      "  * espnW\n",
      "\n",
      "  * ESPNFC\n",
      "\n",
      "  * X Games\n",
      "\n",
      "  * SEC Network\n",
      "\n",
      "## ESPN Apps\n",
      "\n",
      "  * ESPN\n",
      "\n",
      "  * ESPN Fantasy\n",
      "\n",
      "  * Tournament Challenge\n",
      "\n",
      "## Follow ESPN\n",
      "\n",
      "  * Facebook\n",
      "\n",
      "  * X/Twitter\n",
      "\n",
      "  * Instagram\n",
      "\n",
      "  * Snapchat\n",
      "\n",
      "  * TikTok\n",
      "\n",
      "  * YouTube\n",
      "\n",
      "## Fresh updates to our NBA mock draft: Everything we're hearing hours before\n",
      "Round 1\n",
      "\n",
      "With hours until Round 1 begins (8 p.m. ET on ESPN and ABC), ESPN draft\n",
      "insiders Jonathan Givony and Jeremy Woo have new intel on lottery picks and\n",
      "more.\n",
      "\n",
      "2hJonathan Givony and Jeremy Woo\n",
      "\n",
      "Illustration by ESPN\n",
      "\n",
      "## From No. 1 to 100: Ranking the 2024 NBA draft prospects\n",
      "\n",
      "Who's No. 1? Where do the Kentucky, Duke and UConn players rank? Here's our\n",
      "final Top 100 Big Board.\n",
      "\n",
      "6hJonathan Givony and Jeremy Woo\n",
      "\n",
      "  * Full draft order: All 58 picks over two rounds\n",
      "  * Trade tracker: Details for all deals\n",
      "\n",
      "  * Betting buzz: Lakers favorites to draft Bronny\n",
      "  * Use our NBA draft simu\n",
      "ent system, LLM functions as the agent's brain,\n",
      "complemented by several key components:\n",
      "\n",
      "  * **Planning**\n",
      "    * Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\n",
      "    * Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\n",
      "  * **Memory**\n",
      "    * Short-term memory: I would consider all the in-context learning (See Prompt Engineering) as utilizing short-term memory of the model to learn.\n",
      "    * Long-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.\n",
      "  * **Tool use**\n",
      "    * The agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including \n"
     ]
    }
   ],
   "source": [
    "from langchain_community.document_transformers import Html2TextTransformer\n",
    "\n",
    "urls = [\"https://www.espn.com\", \"https://lilianweng.github.io/posts/2023-06-23-agent/\"]\n",
    "html2text = Html2TextTransformer()\n",
    "docs_transformed = html2text.transform_documents(docs)\n",
    "\n",
    "print(docs_transformed[0].page_content[1000:2000])\n",
    "\n",
    "print(docs_transformed[1].page_content[1000:2000])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}