From dc7423e88f9c69c0107a690596ba522a63ecb8f7 Mon Sep 17 00:00:00 2001 From: Isaac Francisco <78627776+isahers1@users.noreply.github.com> Date: Thu, 8 Aug 2024 16:33:09 -0700 Subject: [PATCH] [docs]: standardizing document loader integration pages (#25002) --- .../document_loaders/firecrawl.ipynb | 235 ++++++++++-------- .../document_loaders/pypdfloader.ipynb | 182 ++++++++++++++ .../document_loaders/recursive_url.ipynb | 107 ++++---- .../document_loaders/unstructured_file.ipynb | 208 +++++++++------- .../document_loaders/web_base.ipynb | 225 ++++++++++------- .../docs/document_loaders.ipynb | 13 +- 6 files changed, 639 insertions(+), 331 deletions(-) create mode 100644 docs/docs/integrations/document_loaders/pypdfloader.ipynb diff --git a/docs/docs/integrations/document_loaders/firecrawl.ipynb b/docs/docs/integrations/document_loaders/firecrawl.ipynb index d2c8c588e41..f369b7c95c0 100644 --- a/docs/docs/integrations/document_loaders/firecrawl.ipynb +++ b/docs/docs/integrations/document_loaders/firecrawl.ipynb @@ -9,108 +9,68 @@ "[FireCrawl](https://firecrawl.dev/?ref=langchain) crawls and convert any website into LLM-ready data. It crawls all accessible subpages and give you clean markdown and metadata for each. No sitemap required.\n", "\n", "FireCrawl handles complex tasks such as reverse proxies, caching, rate limits, and content blocked by JavaScript. Built by the [mendable.ai](https://mendable.ai) team.\n", - "\n" + "\n", + "## Overview\n", + "### Integration details\n", + "\n", + "| Class | Package | Local | Serializable | [JS support](https://js.langchain.com/v0.2/docs/integrations/document_loaders/web_loaders/firecrawl/)|\n", + "| :--- | :--- | :---: | :---: | :---: |\n", + "| [FireCrawlLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.firecrawl.FireCrawlLoader.html) | [langchain_community](https://api.python.langchain.com/en/latest/community_api_reference.html) | ✅ | ❌ | ✅ | \n", + "### Loader features\n", + "| Source | Document Lazy Loading | Native Async Support\n", + "| :---: | :---: | :---: | \n", + "| FireCrawlLoader | ✅ | ❌ | \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Setup" + "## Setup\n", + "\n", + "### Credentials \n", + "\n", + "You will need to get your own API key. Go to [this page](https://firecrawl.dev) to learn more." ] }, { "cell_type": "code", - "execution_count": 38, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Requirement already satisfied: firecrawl-py in /Users/nicolascamara/anaconda3/envs/langchain/lib/python3.9/site-packages (0.0.5)\n", - "Requirement already satisfied: requests in /Users/nicolascamara/anaconda3/envs/langchain/lib/python3.9/site-packages (from firecrawl-py) (2.31.0)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /Users/nicolascamara/anaconda3/envs/langchain/lib/python3.9/site-packages (from requests->firecrawl-py) (3.3.2)\n", - "Requirement already satisfied: idna<4,>=2.5 in /Users/nicolascamara/anaconda3/envs/langchain/lib/python3.9/site-packages (from requests->firecrawl-py) (3.6)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/nicolascamara/anaconda3/envs/langchain/lib/python3.9/site-packages (from requests->firecrawl-py) (2.0.7)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /Users/nicolascamara/anaconda3/envs/langchain/lib/python3.9/site-packages (from requests->firecrawl-py) (2024.2.2)\n", - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "pip install firecrawl-py" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Usage" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You will need to get your own API key. See https://firecrawl.dev" - ] - }, - { - "cell_type": "code", - "execution_count": 1, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ - "from langchain_community.document_loaders import FireCrawlLoader" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "loader = FireCrawlLoader(\n", - " api_key=\"YOUR_API_KEY\", url=\"https://firecrawl.dev\", mode=\"crawl\"\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "docs = loader.load()" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[Document(page_content='[Skip to content](#skip)\\n\\n[🔥 FireCrawl](/)\\n\\n[Playground](/playground)\\n[Pricing](/pricing)\\n\\n[Log In](/signin)\\n[Log In](/signin)\\n[Sign Up](/signin/signup)\\n\\n![Slack Logo](/images/slack_logo_icon.png)\\n\\nNew message in: #coach-gtm\\n==========================\\n\\n@CoachGTM: Your meeting prep for Pied Piper < > WindFlow Dynamics is ready! Meeting starts in 30 minutes\\n\\nTurn websites into \\n_LLM-ready_ data\\n=====================================\\n\\nCrawl and convert any website into clean markdown\\n\\nTry now (100 free credits)No credit card required\\n\\nA product by\\n\\n[![Mendable Logo](/images/mendable_logo_transparent.png)Mendable](https://mendable.ai)\\n\\n![Mendable Website Image](/mendable-hero-8.png)\\n\\nCrawl, Capture, Clean\\n---------------------\\n\\nWe crawl all accessible subpages and give you clean markdown for each. No sitemap required.\\n\\n \\n [\\\\\\n {\\\\\\n \"url\": \"https://www.mendable.ai/\",\\\\\\n \"markdown\": \"## Welcome to Mendable\\\\\\n Mendable empowers teams with AI-driven solutions - \\\\\\n streamlining sales and support.\"\\\\\\n },\\\\\\n {\\\\\\n \"url\": \"https://www.mendable.ai/features\",\\\\\\n \"markdown\": \"## Features\\\\\\n Discover how Mendable\\'s cutting-edge features can \\\\\\n transform your business operations.\"\\\\\\n },\\\\\\n {\\\\\\n \"url\": \"https://www.mendable.ai/pricing\",\\\\\\n \"markdown\": \"## Pricing Plans\\\\\\n Choose the perfect plan that fits your business needs.\"\\\\\\n },\\\\\\n {\\\\\\n \"url\": \"https://www.mendable.ai/about\",\\\\\\n \"markdown\": \"## About Us\\\\\\n \\\\\\n Learn more about Mendable\\'s mission and the \\\\\\n team behind our innovative platform.\"\\\\\\n },\\\\\\n {\\\\\\n \"url\": \"https://www.mendable.ai/contact\",\\\\\\n \"markdown\": \"## Contact Us\\\\\\n Get in touch with us for any queries or support.\"\\\\\\n },\\\\\\n {\\\\\\n \"url\": \"https://www.mendable.ai/blog\",\\\\\\n \"markdown\": \"## Blog\\\\\\n Stay updated with the latest news and insights from Mendable.\"\\\\\\n }\\\\\\n ]\\n \\n\\nNote: The markdown has been edited for display purposes.\\n\\nWe handle the hard stuff\\n------------------------\\n\\nReverse proxyies, caching, rate limits, js-blocked content and more...\\n\\n#### Crawling\\n\\nFireCrawl crawls all accessible subpages, even without a sitemap.\\n\\n#### Dynamic content\\n\\nFireCrawl gathers data even if a website uses javascript to render content.\\n\\n#### To Markdown\\n\\nFireCrawl returns clean, well formatted markdown - ready for use in LLM applications\\n\\n#### Continuous updates\\n\\nSchedule syncs with FireCrawl. No cron jobs or orchestration required.\\n\\n#### Caching\\n\\nFireCrawl caches content, so you don\\'t have to wait for a full scrape unless new content exists.\\n\\n#### Built for AI\\n\\nBuilt by LLM engineers, for LLM engineers. Giving you clean data the way you want it.\\n\\nPricing Plans\\n=============\\n\\nStarter\\n-------\\n\\n50k credits ($1.00/1k)\\n\\n$50/month\\n\\n* Scrape 50,000 pages\\n* Credits valid for 6 months\\n* 2 simultaneous scrapers\\\\*\\n\\nSubscribe\\n\\nStandard\\n--------\\n\\n500k credits ($0.75/1k)\\n\\n$375/month\\n\\n* Scrape 500,000 pages\\n* Credits valid for 6 months\\n* 4 simultaneous scrapers\\\\*\\n\\nSubscribe\\n\\nScale\\n-----\\n\\n12.5M credits ($0.30/1k)\\n\\n$1,250/month\\n\\n* Scrape 2,500,000 pages\\n* Credits valid for 6 months\\n* 10 simultaneous scrapes\\\\*\\n\\nSubscribe\\n\\n\\\\* a \"scraper\" refers to how many scraper jobs you can simultaneously submit.\\n\\nWhat sites work?\\n----------------\\n\\nFirecrawl is best suited for business websites, docs and help centers.\\n\\nBuisness websites\\n\\nGathering business intelligence or connecting company data to your AI\\n\\nBlogs, Documentation and Help centers\\n\\nGather content from documentation and other textual sources\\n\\nSocial Media\\n\\nComing soon\\n\\n![Feature 01](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fexample-business-2.b6c6b56a.png&w=1920&q=75)\\n\\n![Feature 02](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fexample-docs-sites.11eef02d.png&w=1920&q=75)\\n\\nComing Soon\\n-----------\\n\\n[But I want it now!](https://calendly.com/d/cp3d-rvx-58g/mendable-meeting)\\n\\\\* Schedule a meeting\\n\\n![Feature 04](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fexample-business-2.b6c6b56a.png&w=1920&q=75)\\n\\n![Slack Logo](/images/slack_logo_icon.png)\\n\\nNew message in: #coach-gtm\\n==========================\\n\\n@CoachGTM: Your meeting prep for Pied Piper < > WindFlow Dynamics is ready! Meeting starts in 30 minutes\\n\\n[🔥](/)\\n\\nReady to _Build?_\\n-----------------\\n\\n[Meet with us](https://calendly.com/d/cp3d-rvx-58g/mendable-meeting)\\n\\n[Try 100 queries free](/signin)\\n\\n[Discord](https://discord.gg/gSmWdAkdwd)\\n\\nFAQ\\n---\\n\\nFrequently asked questions about FireCrawl\\n\\nWhat is FireCrawl?\\n\\nFireCrawl is an advanced web crawling and data conversion tool designed to transform any website into clean, LLM-ready markdown. Ideal for AI developers and data scientists, it automates the collection, cleaning, and formatting of web data, streamlining the preparation process for Large Language Model (LLM) applications.\\n\\nHow does FireCrawl handle dynamic content on websites?\\n\\nUnlike traditional web scrapers, FireCrawl is equipped to handle dynamic content rendered with JavaScript. It ensures comprehensive data collection from all accessible subpages, making it a reliable tool for scraping websites that rely heavily on JS for content delivery.\\n\\nCan FireCrawl crawl websites without a sitemap?\\n\\nYes, FireCrawl can access and crawl all accessible subpages of a website, even in the absence of a sitemap. This feature enables users to gather data from a wide array of web sources with minimal setup.\\n\\nWhat formats can FireCrawl convert web data into?\\n\\nFireCrawl specializes in converting web data into clean, well-formatted markdown. This format is particularly suited for LLM applications, offering a structured yet flexible way to represent web content.\\n\\nHow does FireCrawl ensure the cleanliness of the data?\\n\\nFireCrawl employs advanced algorithms to clean and structure the scraped data, removing unnecessary elements and formatting the content into readable markdown. This process ensures that the data is ready for use in LLM applications without further preprocessing.\\n\\nIs FireCrawl suitable for large-scale data scraping projects?\\n\\nAbsolutely. FireCrawl offers various pricing plans, including a Scale plan that supports scraping of millions of pages. With features like caching and scheduled syncs, it\\'s designed to efficiently handle large-scale data scraping and continuous updates, making it ideal for enterprises and large projects.\\n\\nWhat measures does FireCrawl take to handle web scraping challenges like rate limits and caching?\\n\\nFireCrawl is built to navigate common web scraping challenges, including reverse proxies, rate limits, and caching. It smartly manages requests and employs caching techniques to minimize bandwidth usage and avoid triggering anti-scraping mechanisms, ensuring reliable data collection.\\n\\nHow can I try FireCrawl?\\n\\nYou can start with FireCrawl by trying our free trial, which includes 100 pages. This trial allows you to experience firsthand how FireCrawl can streamline your data collection and conversion processes. Sign up and begin transforming web content into LLM-ready data today!\\n\\nWho can benefit from using FireCrawl?\\n\\nFireCrawl is tailored for LLM engineers, data scientists, AI researchers, and developers looking to harness web data for training machine learning models, market research, content aggregation, and more. It simplifies the data preparation process, allowing professionals to focus on insights and model development.\\n\\n[🔥](/)\\n\\n© A product by Mendable.ai - All rights reserved.\\n\\n[Twitter](https://twitter.com/mendableai)\\n[GitHub](https://github.com/sideguide)\\n[Discord](https://discord.gg/gSmWdAkdwd)\\n\\nBacked by![Y Combinator Logo](/images/yc.svg)\\n\\n![SOC 2 Type II](/soc2type2badge.png)\\n\\n###### Company\\n\\n* [About us](#0)\\n \\n* [Diversity & Inclusion](#0)\\n \\n* [Blog](#0)\\n \\n* [Careers](#0)\\n \\n* [Financial statements](#0)\\n \\n\\n###### Resources\\n\\n* [Community](#0)\\n \\n* [Terms of service](#0)\\n \\n* [Collaboration features](#0)\\n \\n\\n###### Legals\\n\\n* [Refund policy](#0)\\n \\n* [Terms & Conditions](#0)\\n \\n* [Privacy policy](#0)\\n \\n* [Brand Kit](#0)', metadata={'title': 'Home - FireCrawl', 'description': 'FireCrawl crawls and converts any website into clean markdown.', 'language': None, 'sourceURL': 'https://firecrawl.dev/'}),\n", - " Document(page_content='[Skip to content](#skip)\\n\\n[🔥 FireCrawl](/)\\n\\n[Playground](/playground)\\n[Pricing](/pricing)\\n\\n[Log In](/signin)\\n[Log In](/signin)\\n[Sign Up](/signin/signup)\\n\\nPricing Plans\\n=============\\n\\nStarter\\n-------\\n\\n50k credits ($1.00/1k)\\n\\n$50/month\\n\\n* Scrape 50,000 pages\\n* Credits valid for 6 months\\n* 2 simultaneous scrapers\\\\*\\n\\nSubscribe\\n\\nStandard\\n--------\\n\\n500k credits ($0.75/1k)\\n\\n$375/month\\n\\n* Scrape 500,000 pages\\n* Credits valid for 6 months\\n* 4 simultaneous scrapers\\\\*\\n\\nSubscribe\\n\\nScale\\n-----\\n\\n12.5M credits ($0.30/1k)\\n\\n$1,250/month\\n\\n* Scrape 2,500,000 pages\\n* Credits valid for 6 months\\n* 10 simultaneous scrapes\\\\*\\n\\nSubscribe\\n\\n\\\\* a \"scraper\" refers to how many scraper jobs you can simultaneously submit.\\n\\n[🔥](/)\\n\\n© A product by Mendable.ai - All rights reserved.\\n\\n[Twitter](https://twitter.com/mendableai)\\n[GitHub](https://github.com/sideguide)\\n[Discord](https://discord.gg/gSmWdAkdwd)\\n\\nBacked by![Y Combinator Logo](/images/yc.svg)\\n\\n![SOC 2 Type II](/soc2type2badge.png)\\n\\n###### Company\\n\\n* [About us](#0)\\n \\n* [Diversity & Inclusion](#0)\\n \\n* [Blog](#0)\\n \\n* [Careers](#0)\\n \\n* [Financial statements](#0)\\n \\n\\n###### Resources\\n\\n* [Community](#0)\\n \\n* [Terms of service](#0)\\n \\n* [Collaboration features](#0)\\n \\n\\n###### Legals\\n\\n* [Refund policy](#0)\\n \\n* [Terms & Conditions](#0)\\n \\n* [Privacy policy](#0)\\n \\n* [Brand Kit](#0)', metadata={'title': 'FireCrawl', 'description': 'Turn any website into LLM-ready data.', 'language': None, 'sourceURL': 'https://firecrawl.dev/pricing'})]" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "docs" + "import getpass\n", + "import os\n", + "\n", + "if \"FIRECRAWL_API_KEY\" not in os.environ:\n", + " os.environ[\"FIRECRAWL_API_KEY\"] = getpass.getpass(\"Enter your Firecrawl API key: \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Modes\n", + "### Installation\n", + "\n", + "You will need to install both the `langchain_community` and `firecrawl-py` pacakges:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -qU firecrawl-py langchain_community" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initialization\n", + "\n", + "### Modes\n", "\n", "- `scrape`: Scrape single url and return the markdown.\n", "- `crawl`: Crawl the url and all accessible sub pages and return the markdown for each one." @@ -118,44 +78,104 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ - "loader = FireCrawlLoader(\n", - " api_key=\"YOUR_API_KEY\",\n", - " url=\"https://firecrawl.dev\",\n", - " mode=\"scrape\",\n", - ")" + "from langchain_community.document_loaders import FireCrawlLoader\n", + "\n", + "loader = FireCrawlLoader(url=\"https://firecrawl.dev\", mode=\"crawl\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load" ] }, { "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [], - "source": [ - "data = loader.load()" - ] - }, - { - "cell_type": "code", - "execution_count": 19, + "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[Document(page_content='[Skip to content](#skip)\\n\\n[🔥 FireCrawl](/)\\n\\n[Playground](/playground)\\n[Pricing](/pricing)\\n\\n[Log In](/signin)\\n[Log In](/signin)\\n[Sign Up](/signin/signup)\\n\\n![Slack Logo](/images/slack_logo_icon.png)\\n\\nNew message in: #coach-gtm\\n==========================\\n\\n@CoachGTM: Your meeting prep for Pied Piper < > WindFlow Dynamics is ready! Meeting starts in 30 minutes\\n\\nTurn websites into \\n_LLM-ready_ data\\n=====================================\\n\\nCrawl and convert any website into clean markdown\\n\\nTry now (100 free credits)No credit card required\\n\\nA product by\\n\\n[![Mendable Logo](/images/mendable_logo_transparent.png)Mendable](https://mendable.ai)\\n\\n![Mendable Website Image](/mendable-hero-8.png)\\n\\nCrawl, Capture, Clean\\n---------------------\\n\\nWe crawl all accessible subpages and give you clean markdown for each. No sitemap required.\\n\\n \\n [\\\\\\n {\\\\\\n \"url\": \"https://www.mendable.ai/\",\\\\\\n \"markdown\": \"## Welcome to Mendable\\\\\\n Mendable empowers teams with AI-driven solutions - \\\\\\n streamlining sales and support.\"\\\\\\n },\\\\\\n {\\\\\\n \"url\": \"https://www.mendable.ai/features\",\\\\\\n \"markdown\": \"## Features\\\\\\n Discover how Mendable\\'s cutting-edge features can \\\\\\n transform your business operations.\"\\\\\\n },\\\\\\n {\\\\\\n \"url\": \"https://www.mendable.ai/pricing\",\\\\\\n \"markdown\": \"## Pricing Plans\\\\\\n Choose the perfect plan that fits your business needs.\"\\\\\\n },\\\\\\n {\\\\\\n \"url\": \"https://www.mendable.ai/about\",\\\\\\n \"markdown\": \"## About Us\\\\\\n \\\\\\n Learn more about Mendable\\'s mission and the \\\\\\n team behind our innovative platform.\"\\\\\\n },\\\\\\n {\\\\\\n \"url\": \"https://www.mendable.ai/contact\",\\\\\\n \"markdown\": \"## Contact Us\\\\\\n Get in touch with us for any queries or support.\"\\\\\\n },\\\\\\n {\\\\\\n \"url\": \"https://www.mendable.ai/blog\",\\\\\\n \"markdown\": \"## Blog\\\\\\n Stay updated with the latest news and insights from Mendable.\"\\\\\\n }\\\\\\n ]\\n \\n\\nNote: The markdown has been edited for display purposes.\\n\\nWe handle the hard stuff\\n------------------------\\n\\nReverse proxyies, caching, rate limits, js-blocked content and more...\\n\\n#### Crawling\\n\\nFireCrawl crawls all accessible subpages, even without a sitemap.\\n\\n#### Dynamic content\\n\\nFireCrawl gathers data even if a website uses javascript to render content.\\n\\n#### To Markdown\\n\\nFireCrawl returns clean, well formatted markdown - ready for use in LLM applications\\n\\n#### Continuous updates\\n\\nSchedule syncs with FireCrawl. No cron jobs or orchestration required.\\n\\n#### Caching\\n\\nFireCrawl caches content, so you don\\'t have to wait for a full scrape unless new content exists.\\n\\n#### Built for AI\\n\\nBuilt by LLM engineers, for LLM engineers. Giving you clean data the way you want it.\\n\\nPricing Plans\\n=============\\n\\nStarter\\n-------\\n\\n50k credits ($1.00/1k)\\n\\n$50/month\\n\\n* Scrape 50,000 pages\\n* Credits valid for 6 months\\n* 2 simultaneous scrapers\\\\*\\n\\nSubscribe\\n\\nStandard\\n--------\\n\\n500k credits ($0.75/1k)\\n\\n$375/month\\n\\n* Scrape 500,000 pages\\n* Credits valid for 6 months\\n* 4 simultaneous scrapers\\\\*\\n\\nSubscribe\\n\\nScale\\n-----\\n\\n12.5M credits ($0.30/1k)\\n\\n$1,250/month\\n\\n* Scrape 2,500,000 pages\\n* Credits valid for 6 months\\n* 10 simultaneous scrapes\\\\*\\n\\nSubscribe\\n\\n\\\\* a \"scraper\" refers to how many scraper jobs you can simultaneously submit.\\n\\nWhat sites work?\\n----------------\\n\\nFirecrawl is best suited for business websites, docs and help centers.\\n\\nBuisness websites\\n\\nGathering business intelligence or connecting company data to your AI\\n\\nBlogs, Documentation and Help centers\\n\\nGather content from documentation and other textual sources\\n\\nSocial Media\\n\\nComing soon\\n\\n![Feature 01](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fexample-business-2.b6c6b56a.png&w=1920&q=75)\\n\\n![Feature 02](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fexample-docs-sites.11eef02d.png&w=1920&q=75)\\n\\nComing Soon\\n-----------\\n\\n[But I want it now!](https://calendly.com/d/cp3d-rvx-58g/mendable-meeting)\\n\\\\* Schedule a meeting\\n\\n![Feature 04](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fexample-business-2.b6c6b56a.png&w=1920&q=75)\\n\\n![Slack Logo](/images/slack_logo_icon.png)\\n\\nNew message in: #coach-gtm\\n==========================\\n\\n@CoachGTM: Your meeting prep for Pied Piper < > WindFlow Dynamics is ready! Meeting starts in 30 minutes\\n\\n[🔥](/)\\n\\nReady to _Build?_\\n-----------------\\n\\n[Meet with us](https://calendly.com/d/cp3d-rvx-58g/mendable-meeting)\\n\\n[Try 100 queries free](/signin)\\n\\n[Discord](https://discord.gg/gSmWdAkdwd)\\n\\nFAQ\\n---\\n\\nFrequently asked questions about FireCrawl\\n\\nWhat is FireCrawl?\\n\\nFireCrawl is an advanced web crawling and data conversion tool designed to transform any website into clean, LLM-ready markdown. Ideal for AI developers and data scientists, it automates the collection, cleaning, and formatting of web data, streamlining the preparation process for Large Language Model (LLM) applications.\\n\\nHow does FireCrawl handle dynamic content on websites?\\n\\nUnlike traditional web scrapers, FireCrawl is equipped to handle dynamic content rendered with JavaScript. It ensures comprehensive data collection from all accessible subpages, making it a reliable tool for scraping websites that rely heavily on JS for content delivery.\\n\\nCan FireCrawl crawl websites without a sitemap?\\n\\nYes, FireCrawl can access and crawl all accessible subpages of a website, even in the absence of a sitemap. This feature enables users to gather data from a wide array of web sources with minimal setup.\\n\\nWhat formats can FireCrawl convert web data into?\\n\\nFireCrawl specializes in converting web data into clean, well-formatted markdown. This format is particularly suited for LLM applications, offering a structured yet flexible way to represent web content.\\n\\nHow does FireCrawl ensure the cleanliness of the data?\\n\\nFireCrawl employs advanced algorithms to clean and structure the scraped data, removing unnecessary elements and formatting the content into readable markdown. This process ensures that the data is ready for use in LLM applications without further preprocessing.\\n\\nIs FireCrawl suitable for large-scale data scraping projects?\\n\\nAbsolutely. FireCrawl offers various pricing plans, including a Scale plan that supports scraping of millions of pages. With features like caching and scheduled syncs, it\\'s designed to efficiently handle large-scale data scraping and continuous updates, making it ideal for enterprises and large projects.\\n\\nWhat measures does FireCrawl take to handle web scraping challenges like rate limits and caching?\\n\\nFireCrawl is built to navigate common web scraping challenges, including reverse proxies, rate limits, and caching. It smartly manages requests and employs caching techniques to minimize bandwidth usage and avoid triggering anti-scraping mechanisms, ensuring reliable data collection.\\n\\nHow can I try FireCrawl?\\n\\nYou can start with FireCrawl by trying our free trial, which includes 100 pages. This trial allows you to experience firsthand how FireCrawl can streamline your data collection and conversion processes. Sign up and begin transforming web content into LLM-ready data today!\\n\\nWho can benefit from using FireCrawl?\\n\\nFireCrawl is tailored for LLM engineers, data scientists, AI researchers, and developers looking to harness web data for training machine learning models, market research, content aggregation, and more. It simplifies the data preparation process, allowing professionals to focus on insights and model development.\\n\\n[🔥](/)\\n\\n© A product by Mendable.ai - All rights reserved.\\n\\n[Twitter](https://twitter.com/mendableai)\\n[GitHub](https://github.com/sideguide)\\n[Discord](https://discord.gg/gSmWdAkdwd)\\n\\nBacked by![Y Combinator Logo](/images/yc.svg)\\n\\n![SOC 2 Type II](/soc2type2badge.png)\\n\\n###### Company\\n\\n* [About us](#0)\\n \\n* [Diversity & Inclusion](#0)\\n \\n* [Blog](#0)\\n \\n* [Careers](#0)\\n \\n* [Financial statements](#0)\\n \\n\\n###### Resources\\n\\n* [Community](#0)\\n \\n* [Terms of service](#0)\\n \\n* [Collaboration features](#0)\\n \\n\\n###### Legals\\n\\n* [Refund policy](#0)\\n \\n* [Terms & Conditions](#0)\\n \\n* [Privacy policy](#0)\\n \\n* [Brand Kit](#0)', metadata={'title': 'Home - FireCrawl', 'description': 'FireCrawl crawls and converts any website into clean markdown.', 'language': None, 'sourceURL': 'https://firecrawl.dev'})]" + "Document(metadata={'ogUrl': 'https://www.firecrawl.dev/', 'title': 'Home - Firecrawl', 'robots': 'follow, index', 'ogImage': 'https://www.firecrawl.dev/og.png?123', 'ogTitle': 'Firecrawl', 'sitemap': {'lastmod': '2024-08-02T12:09:44.527Z', 'priority': 0.5, 'changefreq': 'weekly'}, 'keywords': 'Firecrawl,Markdown,Data,Mendable,Langchain', 'sourceURL': 'https://www.firecrawl.dev/', 'ogSiteName': 'Firecrawl', 'description': 'Firecrawl crawls and converts any website into clean markdown.', 'ogDescription': 'Turn any website into LLM-ready data.', 'pageStatusCode': 500, 'ogLocaleAlternate': []}, page_content='Introducing [Smart Crawl!](https://www.firecrawl.dev/smart-crawl)\\n Join the waitlist to turn any website into an API with AI\\n\\n\\n\\n[🔥 Firecrawl](/)\\n\\n* [Playground](/playground)\\n \\n* [Docs](https://docs.firecrawl.dev)\\n \\n* [Pricing](/pricing)\\n \\n* [Blog](/blog)\\n \\n* Beta Features\\n\\n[Log In](/signin)\\n[Log In](/signin)\\n[Sign Up](/signin/signup)\\n 8.6k\\n\\n[💥 Get 2 months free with yearly plan](/pricing)\\n\\nTurn websites into \\n_LLM-ready_ data\\n=====================================\\n\\nPower your AI apps with clean data crawled from any website. It\\'s also open-source.\\n\\nStart for free (500 credits)Start for free[Talk to us](https://calendly.com/d/cj83-ngq-knk/meet-firecrawl)\\n\\nA product by\\n\\n[![Mendable Logo](https://www.firecrawl.dev/images/mendable_logo_transparent.png)Mendable](https://mendable.ai)\\n\\n![Example Webpage](https://www.firecrawl.dev/multiple-websites.png)\\n\\nCrawl, Scrape, Clean\\n--------------------\\n\\nWe crawl all accessible subpages and give you clean markdown for each. No sitemap required.\\n\\n \\n [\\\\\\n {\\\\\\n \"url\": \"https://www.firecrawl.dev/\",\\\\\\n \"markdown\": \"## Welcome to Firecrawl\\\\\\n Firecrawl is a web scraper that allows you to extract the content of a webpage.\"\\\\\\n },\\\\\\n {\\\\\\n \"url\": \"https://www.firecrawl.dev/features\",\\\\\\n \"markdown\": \"## Features\\\\\\n Discover how Firecrawl\\'s cutting-edge features can \\\\\\n transform your data operations.\"\\\\\\n },\\\\\\n {\\\\\\n \"url\": \"https://www.firecrawl.dev/pricing\",\\\\\\n \"markdown\": \"## Pricing Plans\\\\\\n Choose the perfect plan that fits your needs.\"\\\\\\n },\\\\\\n {\\\\\\n \"url\": \"https://www.firecrawl.dev/about\",\\\\\\n \"markdown\": \"## About Us\\\\\\n Learn more about Firecrawl\\'s mission and the \\\\\\n team behind our innovative platform.\"\\\\\\n }\\\\\\n ]\\n \\n\\nNote: The markdown has been edited for display purposes.\\n\\nTrusted by Top Companies\\n------------------------\\n\\n[![Customer Logo](https://www.firecrawl.dev/logos/zapier.png)](https://www.zapier.com)\\n\\n[![Customer Logo](https://www.firecrawl.dev/logos/gamma.svg)](https://gamma.app)\\n\\n[![Customer Logo](https://www.firecrawl.dev/logos/nvidia-com.png)](https://www.nvidia.com)\\n\\n[![Customer Logo](https://www.firecrawl.dev/logos/teller-io.svg)](https://www.teller.io)\\n\\n[![Customer Logo](https://www.firecrawl.dev/logos/stackai.svg)](https://www.stack-ai.com)\\n\\n[![Customer Logo](https://www.firecrawl.dev/logos/palladiumdigital-co-uk.svg)](https://www.palladiumdigital.co.uk)\\n\\n[![Customer Logo](https://www.firecrawl.dev/logos/worldwide-casting-com.svg)](https://www.worldwide-casting.com)\\n\\n[![Customer Logo](https://www.firecrawl.dev/logos/open-gov-sg.png)](https://www.open.gov.sg)\\n\\n[![Customer Logo](https://www.firecrawl.dev/logos/bain-com.svg)](https://www.bain.com)\\n\\n[![Customer Logo](https://www.firecrawl.dev/logos/demand-io.svg)](https://www.demand.io)\\n\\n[![Customer Logo](https://www.firecrawl.dev/logos/nocodegarden-io.png)](https://www.nocodegarden.io)\\n\\n[![Customer Logo](https://www.firecrawl.dev/logos/cyberagent-co-jp.svg)](https://www.cyberagent.co.jp)\\n\\nIntegrate today\\n---------------\\n\\nEnhance your applications with top-tier web scraping and crawling capabilities.\\n\\n#### Scrape\\n\\nExtract markdown or structured data from websites quickly and efficiently.\\n\\n#### Crawling\\n\\nNavigate and retrieve data from all accessible subpages, even without a sitemap.\\n\\nNode.js\\n\\nPython\\n\\ncURL\\n\\n1\\n\\n2\\n\\n3\\n\\n4\\n\\n5\\n\\n6\\n\\n7\\n\\n8\\n\\n9\\n\\n10\\n\\n// npm install @mendable/firecrawl-js \\n \\nimport FirecrawlApp from \\'@mendable/firecrawl-js\\'; \\n \\nconst app \\\\= new FirecrawlApp({ apiKey: \"fc-YOUR\\\\_API\\\\_KEY\" }); \\n \\n// Scrape a website: \\nconst scrapeResult \\\\= await app.scrapeUrl(\\'firecrawl.dev\\'); \\n \\nconsole.log(scrapeResult.data.markdown)\\n\\n#### Use well-known tools\\n\\nAlready fully integrated with the greatest existing tools and workflows.\\n\\n[![LlamaIndex](https://www.firecrawl.dev/logos/llamaindex.svg)](https://docs.llamaindex.ai/en/stable/examples/data_connectors/WebPageDemo/#using-firecrawl-reader/)\\n[![Langchain](https://www.firecrawl.dev/integrations/langchain.png)](https://python.langchain.com/v0.2/docs/integrations/document_loaders/firecrawl/)\\n[![Dify](https://www.firecrawl.dev/logos/dify.png)](https://dify.ai/blog/dify-ai-blog-integrated-with-firecrawl/)\\n[![Dify](https://www.firecrawl.dev/integrations/langflow_2.png)](https://www.langflow.org/)\\n[![Flowise](https://www.firecrawl.dev/integrations/flowise.png)](https://flowiseai.com/)\\n[![CrewAI](https://www.firecrawl.dev/integrations/crewai.png)](https://crewai.com/)\\n\\n#### Start for free, scale easily\\n\\nKick off your journey for free and scale seamlessly as your project expands.\\n\\n[Try it out](/signin/signup)\\n\\n#### Open-source\\n\\nDeveloped transparently and collaboratively. Join our community of contributors.\\n\\n[Check out our repo](https://github.com/mendableai/firecrawl)\\n\\nWe handle the hard stuff\\n------------------------\\n\\nRotating proxies, caching, rate limits, js-blocked content and more\\n\\n#### Crawling\\n\\nFirecrawl crawls all accessible subpages, even without a sitemap.\\n\\n#### Dynamic content\\n\\nFirecrawl gathers data even if a website uses javascript to render content.\\n\\n#### To Markdown\\n\\nFirecrawl returns clean, well formatted markdown - ready for use in LLM applications\\n\\n#### Crawling Orchestration\\n\\nFirecrawl orchestrates the crawling process in parallel for the fastest results.\\n\\n#### Caching\\n\\nFirecrawl caches content, so you don\\'t have to wait for a full scrape unless new content exists.\\n\\n#### Built for AI\\n\\nBuilt by LLM engineers, for LLM engineers. Giving you clean data the way you want it.\\n\\nOur wall of love\\n\\nDon\\'t take our word for it\\n--------------------------\\n\\n![Greg Kamradt](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-02.0afeb750.jpg&w=96&q=75)\\n\\nGreg Kamradt\\n\\n[@GregKamradt](https://twitter.com/GregKamradt/status/1780300642197840307)\\n\\nLLM structured data via API, handling requests, cleaning, and crawling. Enjoyed the early preview.\\n\\n![Amit Naik](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-03.ff5dbe11.jpg&w=96&q=75)\\n\\nAmit Naik\\n\\n[@suprgeek](https://twitter.com/suprgeek/status/1780338213351035254)\\n\\n#llm success with RAG relies on Retrieval. Firecrawl by @mendableai structures web content for processing. 👏\\n\\n![Jerry Liu](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-04.76bef0df.jpg&w=96&q=75)\\n\\nJerry Liu\\n\\n[@jerryjliu0](https://twitter.com/jerryjliu0/status/1781122933349572772)\\n\\nFirecrawl is awesome 🔥 Turns web pages into structured markdown for LLM apps, thanks to @mendableai.\\n\\n![Bardia Pourvakil](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-01.025350bc.jpeg&w=96&q=75)\\n\\nBardia Pourvakil\\n\\n[@thepericulum](https://twitter.com/thepericulum/status/1781397799487078874)\\n\\nThese guys ship. I wanted types for their node SDK, and less than an hour later, I got them. Can\\'t recommend them enough.\\n\\n![latentsauce 🧘🏽](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-07.c2285d35.jpeg&w=96&q=75)\\n\\nlatentsauce 🧘🏽\\n\\n[@latentsauce](https://twitter.com/latentsauce/status/1781738253927735331)\\n\\nFirecrawl simplifies data preparation significantly, exactly what I was hoping for. Thank you for creating Firecrawl ❤️❤️❤️\\n\\n![Greg Kamradt](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-02.0afeb750.jpg&w=96&q=75)\\n\\nGreg Kamradt\\n\\n[@GregKamradt](https://twitter.com/GregKamradt/status/1780300642197840307)\\n\\nLLM structured data via API, handling requests, cleaning, and crawling. Enjoyed the early preview.\\n\\n![Amit Naik](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-03.ff5dbe11.jpg&w=96&q=75)\\n\\nAmit Naik\\n\\n[@suprgeek](https://twitter.com/suprgeek/status/1780338213351035254)\\n\\n#llm success with RAG relies on Retrieval. Firecrawl by @mendableai structures web content for processing. 👏\\n\\n![Jerry Liu](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-04.76bef0df.jpg&w=96&q=75)\\n\\nJerry Liu\\n\\n[@jerryjliu0](https://twitter.com/jerryjliu0/status/1781122933349572772)\\n\\nFirecrawl is awesome 🔥 Turns web pages into structured markdown for LLM apps, thanks to @mendableai.\\n\\n![Bardia Pourvakil](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-01.025350bc.jpeg&w=96&q=75)\\n\\nBardia Pourvakil\\n\\n[@thepericulum](https://twitter.com/thepericulum/status/1781397799487078874)\\n\\nThese guys ship. I wanted types for their node SDK, and less than an hour later, I got them. Can\\'t recommend them enough.\\n\\n![latentsauce 🧘🏽](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-07.c2285d35.jpeg&w=96&q=75)\\n\\nlatentsauce 🧘🏽\\n\\n[@latentsauce](https://twitter.com/latentsauce/status/1781738253927735331)\\n\\nFirecrawl simplifies data preparation significantly, exactly what I was hoping for. Thank you for creating Firecrawl ❤️❤️❤️\\n\\n![Michael Ning](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-05.76d7cd3e.png&w=96&q=75)\\n\\nMichael Ning\\n\\n[](#)\\n\\nFirecrawl is impressive, saving us 2/3 the tokens and allowing gpt3.5turbo use over gpt4. Major savings in time and money.\\n\\n![Alex Reibman 🖇️](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-06.4ee7cf5a.jpeg&w=96&q=75)\\n\\nAlex Reibman 🖇️\\n\\n[@AlexReibman](https://twitter.com/AlexReibman/status/1780299595484131836)\\n\\nMoved our internal agent\\'s web scraping tool from Apify to Firecrawl because it benchmarked 50x faster with AgentOps.\\n\\n![Michael](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-08.0bed40be.jpeg&w=96&q=75)\\n\\nMichael\\n\\n[@michael\\\\_chomsky](#)\\n\\nI really like some of the design decisions Firecrawl made, so I really want to share with others.\\n\\n![Paul Scott](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-09.d303b5b4.png&w=96&q=75)\\n\\nPaul Scott\\n\\n[@palebluepaul](https://twitter.com/palebluepaul)\\n\\nAppreciating your lean approach, Firecrawl ticks off everything on our list without the cost prohibitive overkill.\\n\\n![Michael Ning](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-05.76d7cd3e.png&w=96&q=75)\\n\\nMichael Ning\\n\\n[](#)\\n\\nFirecrawl is impressive, saving us 2/3 the tokens and allowing gpt3.5turbo use over gpt4. Major savings in time and money.\\n\\n![Alex Reibman 🖇️](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-06.4ee7cf5a.jpeg&w=96&q=75)\\n\\nAlex Reibman 🖇️\\n\\n[@AlexReibman](https://twitter.com/AlexReibman/status/1780299595484131836)\\n\\nMoved our internal agent\\'s web scraping tool from Apify to Firecrawl because it benchmarked 50x faster with AgentOps.\\n\\n![Michael](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-08.0bed40be.jpeg&w=96&q=75)\\n\\nMichael\\n\\n[@michael\\\\_chomsky](#)\\n\\nI really like some of the design decisions Firecrawl made, so I really want to share with others.\\n\\n![Paul Scott](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-09.d303b5b4.png&w=96&q=75)\\n\\nPaul Scott\\n\\n[@palebluepaul](https://twitter.com/palebluepaul)\\n\\nAppreciating your lean approach, Firecrawl ticks off everything on our list without the cost prohibitive overkill.\\n\\nFlexible Pricing\\n----------------\\n\\nStart for free, then scale as you grow\\n\\nYearly (17% off)Yearly (2 months free)Monthly\\n\\nFree Plan\\n---------\\n\\n500 credits\\n\\n$0/month\\n\\n* Scrape 500 pages\\n* 5 /scrape per min\\n* 1 /crawl per min\\n\\nGet Started\\n\\nHobby\\n-----\\n\\n3,000 credits\\n\\n$16/month\\n\\n* Scrape 3,000 pages\\n* 10 /scrape per min\\n* 3 /crawl per min\\n\\nSubscribe\\n\\nStandardMost Popular\\n--------------------\\n\\n100,000 credits\\n\\n$83/month\\n\\n* Scrape 100,000 pages\\n* 50 /scrape per min\\n* 10 /crawl per min\\n\\nSubscribe\\n\\nGrowth\\n------\\n\\n500,000 credits\\n\\n$333/month\\n\\n* Scrape 500,000 pages\\n* 500 /scrape per min\\n* 50 /crawl per min\\n* Priority Support\\n\\nSubscribe\\n\\nEnterprise Plan\\n---------------\\n\\nUnlimited credits. Custom RPMs.\\n\\nTalk to us\\n\\n* Top priority support\\n* Feature Acceleration\\n* SLAs\\n* Account Manager\\n* Custom rate limits volume\\n* Custom concurrency limits\\n* Beta features access\\n* CEO\\'s number\\n\\n\\\\* a /scrape refers to the [scrape](https://docs.firecrawl.dev/api-reference/endpoint/scrape)\\n API endpoint.\\n\\n\\\\* a /crawl refers to the [crawl](https://docs.firecrawl.dev/api-reference/endpoint/crawl)\\n API endpoint.\\n\\nScrape Credits\\n--------------\\n\\nScrape credits are consumed for each API request, varying by endpoint and feature.\\n\\n| Features | Credits per page |\\n| --- | --- |\\n| Scrape(/scrape) | 1 |\\n| Crawl(/crawl) | 1 |\\n| Search(/search) | 1 |\\n| Scrape + LLM extraction (/scrape) | 50 |\\n\\n[](/)\\n\\nReady to _Build?_\\n-----------------\\n\\nStart scraping web data for your AI apps today. \\nNo credit card needed.\\n\\n[Get Started](/signin)\\n\\n[Talk to us](https://calendly.com/d/cj83-ngq-knk/meet-firecrawl)\\n\\nFAQ\\n---\\n\\nFrequently asked questions about Firecrawl\\n\\n#### General\\n\\nWhat is Firecrawl?\\n\\nFirecrawl turns entire websites into clean, LLM-ready markdown or structured data. Scrape, crawl and extract the web with a single API. Ideal for AI companies looking to empower their LLM applications with web data.\\n\\nWhat sites work?\\n\\nFirecrawl is best suited for business websites, docs and help centers. We currently don\\'t support social media platforms.\\n\\nWho can benefit from using Firecrawl?\\n\\nFirecrawl is tailored for LLM engineers, data scientists, AI researchers, and developers looking to harness web data for training machine learning models, market research, content aggregation, and more. It simplifies the data preparation process, allowing professionals to focus on insights and model development.\\n\\nIs Firecrawl open-source?\\n\\nYes, it is. You can check out the repository on GitHub. Keep in mind that this repository is currently in its early stages of development. We are in the process of merging custom modules into this mono repository.\\n\\n#### Scraping & Crawling\\n\\nHow does Firecrawl handle dynamic content on websites?\\n\\nUnlike traditional web scrapers, Firecrawl is equipped to handle dynamic content rendered with JavaScript. It ensures comprehensive data collection from all accessible subpages, making it a reliable tool for scraping websites that rely heavily on JS for content delivery.\\n\\nWhy is it not crawling all the pages?\\n\\nThere are a few reasons why Firecrawl may not be able to crawl all the pages of a website. Some common reasons include rate limiting, and anti-scraping mechanisms, disallowing the crawler from accessing certain pages. If you\\'re experiencing issues with the crawler, please reach out to our support team at hello@firecrawl.com.\\n\\nCan Firecrawl crawl websites without a sitemap?\\n\\nYes, Firecrawl can access and crawl all accessible subpages of a website, even in the absence of a sitemap. This feature enables users to gather data from a wide array of web sources with minimal setup.\\n\\nWhat formats can Firecrawl convert web data into?\\n\\nFirecrawl specializes in converting web data into clean, well-formatted markdown. This format is particularly suited for LLM applications, offering a structured yet flexible way to represent web content.\\n\\nHow does Firecrawl ensure the cleanliness of the data?\\n\\nFirecrawl employs advanced algorithms to clean and structure the scraped data, removing unnecessary elements and formatting the content into readable markdown. This process ensures that the data is ready for use in LLM applications without further preprocessing.\\n\\nIs Firecrawl suitable for large-scale data scraping projects?\\n\\nAbsolutely. Firecrawl offers various pricing plans, including a Scale plan that supports scraping of millions of pages. With features like caching and scheduled syncs, it\\'s designed to efficiently handle large-scale data scraping and continuous updates, making it ideal for enterprises and large projects.\\n\\nDoes it respect robots.txt?\\n\\nYes, Firecrawl crawler respects the rules set in a website\\'s robots.txt file. If you notice any issues with the way Firecrawl interacts with your website, you can adjust the robots.txt file to control the crawler\\'s behavior. Firecrawl user agent name is \\'FirecrawlAgent\\'. If you notice any behavior that is not expected, please let us know at hello@firecrawl.com.\\n\\nWhat measures does Firecrawl take to handle web scraping challenges like rate limits and caching?\\n\\nFirecrawl is built to navigate common web scraping challenges, including reverse proxies, rate limits, and caching. It smartly manages requests and employs caching techniques to minimize bandwidth usage and avoid triggering anti-scraping mechanisms, ensuring reliable data collection.\\n\\nDoes Firecrawl handle captcha or authentication?\\n\\nFirecrawl avoids captcha by using stealth proxyies. When it encounters captcha, it attempts to solve it automatically, but this is not always possible. We are working to add support for more captcha solving methods. Firecrawl can handle authentication by providing auth headers to the API.\\n\\n#### API Related\\n\\nWhere can I find my API key?\\n\\nClick on the dashboard button on the top navigation menu when logged in and you will find your API key in the main screen and under API Keys.\\n\\n#### Billing\\n\\nIs Firecrawl free?\\n\\nFirecrawl is free for the first 500 scraped pages (500 free credits). After that, you can upgrade to our Standard or Scale plans for more credits.\\n\\nIs there a pay per use plan instead of monthly?\\n\\nNo we do not currently offer a pay per use plan, instead you can upgrade to our Standard or Growth plans for more credits and higher rate limits.\\n\\nHow many credit does scraping, crawling, and extraction cost?\\n\\nScraping costs 1 credit per page. Crawling costs 1 credit per page.\\n\\nDo you charge for failed requests (scrape, crawl, extract)?\\n\\nWe do not charge for any failed requests (scrape, crawl, extract). Please contact support at help@firecrawl.dev if you have any questions.\\n\\nWhat payment methods do you accept?\\n\\nWe accept payments through Stripe which accepts most major credit cards, debit cards, and PayPal.\\n\\n[🔥](/)\\n\\n© A product by Mendable.ai - All rights reserved.\\n\\n[StatusStatus](https://firecrawl.betteruptime.com)\\n[Terms of ServiceTerms of Service](/terms-of-service)\\n[Privacy PolicyPrivacy Policy](/privacy-policy)\\n\\n[Twitter](https://twitter.com/mendableai)\\n[GitHub](https://github.com/mendableai)\\n[Discord](https://discord.gg/gSmWdAkdwd)\\n\\n###### Helpful Links\\n\\n* [Status](https://firecrawl.betteruptime.com/)\\n \\n* [Pricing](/pricing)\\n \\n* [Blog](https://www.firecrawl.dev/blog)\\n \\n* [Docs](https://docs.firecrawl.dev)\\n \\n\\nBacked by![Y Combinator Logo](https://www.firecrawl.dev/images/yc.svg)\\n\\n![SOC 2 Type II](https://www.firecrawl.dev/soc2type2badge.png)\\n\\n###### Resources\\n\\n* [Community](#0)\\n \\n* [Terms of service](#0)\\n \\n* [Collaboration features](#0)\\n \\n\\n###### Legals\\n\\n* [Refund policy](#0)\\n \\n* [Terms & Conditions](#0)\\n \\n* [Privacy policy](#0)\\n \\n* [Brand Kit](#0)')" ] }, - "execution_count": 19, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "data" + "docs = loader.load()\n", + "\n", + "docs[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'ogUrl': 'https://www.firecrawl.dev/', 'title': 'Home - Firecrawl', 'robots': 'follow, index', 'ogImage': 'https://www.firecrawl.dev/og.png?123', 'ogTitle': 'Firecrawl', 'sitemap': {'lastmod': '2024-08-02T12:09:44.527Z', 'priority': 0.5, 'changefreq': 'weekly'}, 'keywords': 'Firecrawl,Markdown,Data,Mendable,Langchain', 'sourceURL': 'https://www.firecrawl.dev/', 'ogSiteName': 'Firecrawl', 'description': 'Firecrawl crawls and converts any website into clean markdown.', 'ogDescription': 'Turn any website into LLM-ready data.', 'pageStatusCode': 500, 'ogLocaleAlternate': []}\n" + ] + } + ], + "source": [ + "print(docs[0].metadata)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Lazy Load\n", + "\n", + "You can use lazy loading to minimize memory requirements." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "page = []\n", + "for doc in loader.lazy_load():\n", + " page.append(doc)\n", + " if len(page) >= 10:\n", + " # do some paged operation, e.g.\n", + " # index.upsert(page)\n", + "\n", + " page = []" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(page)" ] }, { @@ -167,6 +187,15 @@ "You can also pass `params` to the loader. This is a dictionary of options to pass to the crawler. See the [FireCrawl API documentation](https://github.com/mendableai/firecrawl-py) for more information.\n", "\n" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## API reference\n", + "\n", + "For detailed documentation of all `FireCrawlLoader` features and configurations head to the API reference: https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.firecrawl.FireCrawlLoader.html" + ] } ], "metadata": { @@ -185,7 +214,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.19" + "version": "3.11.9" } }, "nbformat": 4, diff --git a/docs/docs/integrations/document_loaders/pypdfloader.ipynb b/docs/docs/integrations/document_loaders/pypdfloader.ipynb new file mode 100644 index 00000000000..a10d096121b --- /dev/null +++ b/docs/docs/integrations/document_loaders/pypdfloader.ipynb @@ -0,0 +1,182 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# PyPDFLoader\n", + "\n", + "This notebook provides a quick overview for getting started with `PyPDF` [document loader](https://python.langchain.com/v0.2/docs/concepts/#document-loaders). For detailed documentation of all __ModuleName__Loader features and configurations head to the [API reference](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html).\n", + "\n", + "\n", + "## Overview\n", + "### Integration details\n", + "\n", + "\n", + "| Class | Package | Local | Serializable | JS support|\n", + "| :--- | :--- | :---: | :---: | :---: |\n", + "| [PyPDFLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html) | [langchain_community](https://api.python.langchain.com/en/latest/community_api_reference.html) | ✅ | ❌ | ❌ | \n", + "### Loader features\n", + "| Source | Document Lazy Loading | Native Async Support\n", + "| :---: | :---: | :---: | \n", + "| PyPDFLoader | ✅ | ❌ | \n", + "\n", + "## Setup\n", + "\n", + "### Credentials\n", + "\n", + "No credentials are required to use `PyPDFLoader`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Installation\n", + "\n", + "To use `PyPDFLoader` you need to have the `langchain-community` python package downloaded:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -qU langchain_community" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initialization\n", + "\n", + "Now we can instantiate our model object and load documents:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_community.document_loaders import PyPDFLoader\n", + "\n", + "loader = PyPDFLoader(\n", + " \"./example_data/layout-parser-paper.pdf\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'page': 0}, page_content='LayoutParser : A Unified Toolkit for Deep\\nLearning Based Document Image Analysis\\nZejiang Shen1( \\x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\\nLee4, Jacob Carlson3, and Weining Li5\\n1Allen Institute for AI\\nshannons@allenai.org\\n2Brown University\\nruochen zhang@brown.edu\\n3Harvard University\\n{melissadell,jacob carlson }@fas.harvard.edu\\n4University of Washington\\nbcgl@cs.washington.edu\\n5University of Waterloo\\nw422li@uwaterloo.ca\\nAbstract. Recent advances in document image analysis (DIA) have been\\nprimarily driven by the application of neural networks. Ideally, research\\noutcomes could be easily deployed in production and extended for further\\ninvestigation. However, various factors like loosely organized codebases\\nand sophisticated model configurations complicate the easy reuse of im-\\nportant innovations by a wide audience. Though there have been on-going\\nefforts to improve reusability and simplify deep learning (DL) model\\ndevelopment in disciplines like natural language processing and computer\\nvision, none of them are optimized for challenges in the domain of DIA.\\nThis represents a major gap in the existing toolkit, as DIA is central to\\nacademic research across a wide range of disciplines in the social sciences\\nand humanities. This paper introduces LayoutParser , an open-source\\nlibrary for streamlining the usage of DL in DIA research and applica-\\ntions. The core LayoutParser library comes with a set of simple and\\nintuitive interfaces for applying and customizing DL models for layout de-\\ntection, character recognition, and many other document processing tasks.\\nTo promote extensibility, LayoutParser also incorporates a community\\nplatform for sharing both pre-trained models and full document digiti-\\nzation pipelines. We demonstrate that LayoutParser is helpful for both\\nlightweight and large-scale digitization pipelines in real-word use cases.\\nThe library is publicly available at https://layout-parser.github.io .\\nKeywords: Document Image Analysis ·Deep Learning ·Layout Analysis\\n·Character Recognition ·Open Source library ·Toolkit.\\n1 Introduction\\nDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of\\ndocument image analysis (DIA) tasks including document image classification [ 11,arXiv:2103.15348v2 [cs.CV] 21 Jun 2021')" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "docs = loader.load()\n", + "docs[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'source': './example_data/layout-parser-paper.pdf', 'page': 0}\n" + ] + } + ], + "source": [ + "print(docs[0].metadata)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Lazy Load\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "6" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "page = []\n", + "for doc in loader.lazy_load():\n", + " page.append(doc)\n", + " if len(page) >= 10:\n", + " # do some paged operation, e.g.\n", + " # index.upsert(page)\n", + "\n", + " page = []\n", + "len(page)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## API reference\n", + "\n", + "For detailed documentation of all `PyPDFLoader` features and configurations head to the API reference: https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/docs/integrations/document_loaders/recursive_url.ipynb b/docs/docs/integrations/document_loaders/recursive_url.ipynb index fd86ba68326..310fc33a5e0 100644 --- a/docs/docs/integrations/document_loaders/recursive_url.ipynb +++ b/docs/docs/integrations/document_loaders/recursive_url.ipynb @@ -7,7 +7,18 @@ "source": [ "# Recursive URL\n", "\n", - "The `RecursiveUrlLoader` lets you recursively scrape all child links from a root URL and parse them into Documents." + "The `RecursiveUrlLoader` lets you recursively scrape all child links from a root URL and parse them into Documents.\n", + "\n", + "## Overview\n", + "### Integration details\n", + "\n", + "| Class | Package | Local | Serializable | [JS support](https://js.langchain.com/v0.2/docs/integrations/document_loaders/web_loaders/recursive_url_loader/)|\n", + "| :--- | :--- | :---: | :---: | :---: |\n", + "| [RecursiveUrlLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html) | [langchain_community](https://api.python.langchain.com/en/latest/community_api_reference.html) | ✅ | ❌ | ✅ | \n", + "### Loader features\n", + "| Source | Document Lazy Loading | Native Async Support\n", + "| :---: | :---: | :---: | \n", + "| RecursiveUrlLoader | ✅ | ❌ | \n" ] }, { @@ -17,6 +28,12 @@ "source": [ "## Setup\n", "\n", + "### Credentials\n", + "\n", + "No credentials are required to use the `RecursiveUrlLoader`.\n", + "\n", + "### Installation\n", + "\n", "The `RecursiveUrlLoader` lives in the `langchain-community` package. There's no other required packages, though you will get richer default Document metadata if you have ``beautifulsoup4` installed as well." ] }, @@ -186,6 +203,50 @@ "That certainly looks like HTML that comes from the url https://docs.python.org/3.9/, which is what we expected. Let's now look at some variations we can make to our basic example that can be helpful in different situations. " ] }, + { + "cell_type": "markdown", + "id": "b17b7202", + "metadata": {}, + "source": [ + "## Lazy loading\n", + "\n", + "If we're loading a large number of Documents and our downstream operations can be done over subsets of all loaded Documents, we can lazily load our Documents one at a time to minimize our memory footprint:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4b13e4d1", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/var/folders/4j/2rz3865x6qg07tx43146py8h0000gn/T/ipykernel_73962/2110507528.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features=\"xml\"` into the BeautifulSoup constructor.\n", + " soup = BeautifulSoup(html, \"lxml\")\n" + ] + } + ], + "source": [ + "page = []\n", + "for doc in loader.lazy_load():\n", + " page.append(doc)\n", + " if len(page) >= 10:\n", + " # do some paged operation, e.g.\n", + " # index.upsert(page)\n", + "\n", + " page = []" + ] + }, + { + "cell_type": "markdown", + "id": "fb039682", + "metadata": {}, + "source": [ + "In this example we never have more than 10 Documents loaded into memory at a time." + ] + }, { "cell_type": "markdown", "id": "8f41cc89", @@ -256,50 +317,6 @@ "You can similarly pass in a `metadata_extractor` to customize how Document metadata is extracted from the HTTP response. See the [API reference](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html) for more on this." ] }, - { - "cell_type": "markdown", - "id": "1dddbc94", - "metadata": {}, - "source": [ - "## Lazy loading\n", - "\n", - "If we're loading a large number of Documents and our downstream operations can be done over subsets of all loaded Documents, we can lazily load our Documents one at a time to minimize our memory footprint:" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "7d0114fc", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/var/folders/4j/2rz3865x6qg07tx43146py8h0000gn/T/ipykernel_73962/2110507528.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features=\"xml\"` into the BeautifulSoup constructor.\n", - " soup = BeautifulSoup(html, \"lxml\")\n" - ] - } - ], - "source": [ - "page = []\n", - "for doc in loader.lazy_load():\n", - " page.append(doc)\n", - " if len(page) >= 10:\n", - " # do some paged operation, e.g.\n", - " # index.upsert(page)\n", - "\n", - " page = []" - ] - }, - { - "cell_type": "markdown", - "id": "f88a7c2f-35df-4c3a-b238-f91be2674b96", - "metadata": {}, - "source": [ - "In this example we never have more than 10 Documents loaded into memory at a time." - ] - }, { "cell_type": "markdown", "id": "3e4d1c8f", diff --git a/docs/docs/integrations/document_loaders/unstructured_file.ipynb b/docs/docs/integrations/document_loaders/unstructured_file.ipynb index ffa69b392f1..572845e2fc1 100644 --- a/docs/docs/integrations/document_loaders/unstructured_file.ipynb +++ b/docs/docs/integrations/document_loaders/unstructured_file.ipynb @@ -7,20 +7,41 @@ "source": [ "# Unstructured\n", "\n", - "This notebook covers how to use `Unstructured` package to load files of many types. `Unstructured` currently supports loading of text files, powerpoints, html, pdfs, images, and more.\n", + "This notebook covers how to use `Unstructured` [document loader](https://python.langchain.com/v0.2/docs/concepts/#document-loaders) to load files of many types. `Unstructured` currently supports loading of text files, powerpoints, html, pdfs, images, and more.\n", "\n", - "Please see [this guide](/docs/integrations/providers/unstructured/) for more instructions on setting up Unstructured locally, including setting up required system dependencies." + "Please see [this guide](../../integrations/providers/unstructured.mdx) for more instructions on setting up Unstructured locally, including setting up required system dependencies.\n", + "\n", + "## Overview\n", + "### Integration details\n", + "\n", + "| Class | Package | Local | Serializable | [JS support](https://js.langchain.com/v0.2/docs/integrations/document_loaders/file_loaders/unstructured/)|\n", + "| :--- | :--- | :---: | :---: | :---: |\n", + "| [UnstructuredLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_unstructured.document_loaders.UnstructuredLoader.html) | [langchain_community](https://api.python.langchain.com/en/latest/unstructured_api_reference.html) | ✅ | ❌ | ✅ | \n", + "### Loader features\n", + "| Source | Document Lazy Loading | Native Async Support\n", + "| :---: | :---: | :---: | \n", + "| UnstructuredLoader | ✅ | ❌ | \n", + "\n", + "## Setup\n", + "\n", + "### Credentials\n", + "\n", + "By default, `langchain-unstructured` installs a smaller footprint that requires offloading of the partitioning logic to the Unstructured API, which requires an API key. If you use the local installation, you do not need an API key. To get your API key, head over to [this site](https://unstructured.io) and get an API key, and then set it in the cell below:" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "id": "2886982e", "metadata": {}, "outputs": [], "source": [ - "# Install package, compatible with API partitioning\n", - "%pip install --upgrade --quiet \"langchain-unstructured\"" + "import getpass\n", + "import os\n", + "\n", + "os.environ[\"UNSTRUCTURED_API_KEY\"] = getpass.getpass(\n", + " \"Enter your Unstructured API key: \"\n", + ")" ] }, { @@ -28,15 +49,32 @@ "id": "e75e2a6d", "metadata": {}, "source": [ - "### Local Partitioning (Optional)\n", + "### Installation\n", "\n", - "By default, `langchain-unstructured` installs a smaller footprint that requires\n", - "offloading of the partitioning logic to the Unstructured API, which requires an `api_key`. For\n", - "partitioning using the API, refer to the Unstructured API section below.\n", + "#### Normal Installation\n", "\n", - "If you would like to run the partitioning logic locally, you will need to install\n", - "a combination of system dependencies, as outlined in the \n", - "[Unstructured documentation here](https://docs.unstructured.io/open-source/installation/full-installation).\n", + "The following packages are required to run the rest of this notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d9de83b3", + "metadata": {}, + "outputs": [], + "source": [ + "# Install package, compatible with API partitioning\n", + "%pip install --upgrade --quiet langchain-unstructured unstructured-client unstructured \"unstructured[pdf]\" python-magic" + ] + }, + { + "cell_type": "markdown", + "id": "637eda35", + "metadata": {}, + "source": [ + "#### Installation for Local\n", + "\n", + "If you would like to run the partitioning logic locally, you will need to install a combination of system dependencies, as outlined in the [Unstructured documentation here](https://docs.unstructured.io/open-source/installation/full-installation).\n", "\n", "For example, on Macs you can install the required dependencies with:\n", "\n", @@ -48,7 +86,7 @@ "brew install libxml2 libxslt\n", "```\n", "\n", - "You can install the required `pip` dependencies with:\n", + "You can install the required `pip` dependencies needed for local with:\n", "\n", "```bash\n", "pip install \"langchain-unstructured[local]\"\n", @@ -60,120 +98,117 @@ "id": "a9c1c775", "metadata": {}, "source": [ - "### Quickstart\n", + "## Initialization\n", "\n", - "To simply load a file as a document, you can use the LangChain `DocumentLoader.load` \n", - "interface:" + "The `UnstructuredLoader` allows loading from a variety of different file types. To read all about the `unstructured` package please refer to their [documentation](https://docs.unstructured.io/open-source/introduction/overview)/. In this example, we show loading from both a text file and a PDF file." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "id": "79d3e549", "metadata": {}, "outputs": [], "source": [ "from langchain_unstructured import UnstructuredLoader\n", "\n", - "loader = UnstructuredLoader(\"./example_data/state_of_the_union.txt\")\n", + "file_paths = [\n", + " \"./example_data/layout-parser-paper.pdf\",\n", + " \"./example_data/state_of_the_union.txt\",\n", + "]\n", "\n", - "docs = loader.load()" + "\n", + "loader = UnstructuredLoader(file_paths)" ] }, { "cell_type": "markdown", - "id": "b4ab0a79", + "id": "8b68dcab", "metadata": {}, "source": [ - "### Load list of files" + "## Load" ] }, { "cell_type": "code", - "execution_count": 5, - "id": "092d9a0b", + "execution_count": 2, + "id": "8da59ef8", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO: NumExpr defaulting to 12 threads.\n", + "INFO: pikepdf C++ to Python logger bridge initialized\n" + ] + }, + { + "data": { + "text/plain": [ + "Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-07-25T21:28:58', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}, page_content='1 2 0 2')" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "docs = loader.load()\n", + "\n", + "docs[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "97f7aa1f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "whatsapp_chat.txt : 1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are in\n", - "state_of_the_union.txt : May God bless you all. May God protect our troops.\n" + "{'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-07-25T21:28:58', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}\n" ] } ], "source": [ - "file_paths = [\n", - " \"./example_data/whatsapp_chat.txt\",\n", - " \"./example_data/state_of_the_union.txt\",\n", - "]\n", - "\n", - "loader = UnstructuredLoader(file_paths)\n", - "\n", - "docs = loader.load()\n", - "\n", - "print(docs[0].metadata.get(\"filename\"), \": \", docs[0].page_content[:100])\n", - "print(docs[-1].metadata.get(\"filename\"), \": \", docs[-1].page_content[:100])" + "print(docs[0].metadata)" ] }, { "cell_type": "markdown", - "id": "8de9ef16", + "id": "0d7f991b", "metadata": {}, "source": [ - "## PDF Example\n", - "\n", - "Processing PDF documents works exactly the same way. Unstructured detects the file type and extracts the same types of elements." - ] - }, - { - "cell_type": "markdown", - "id": "672733fd", - "metadata": {}, - "source": [ - "### Define a Partitioning Strategy\n", - "\n", - "Unstructured document loader allow users to pass in a `strategy` parameter that lets Unstructured\n", - "know how to partition pdf and other OCR'd documents. Currently supported strategies are `\"auto\"`,\n", - "`\"hi_res\"`, `\"ocr_only\"`, and `\"fast\"`. Learn more about the different strategies\n", - "[here](https://docs.unstructured.io/open-source/core-functionality/partitioning#partition-pdf). \n", - "\n", - "Not all document types have separate hi res and fast partitioning strategies. For those document types, the `strategy` kwarg is\n", - "ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing\n", - "(i.e. a model for document partitioning). You can see how to apply a strategy to an\n", - "`UnstructuredLoader` below." + "## Lazy Load" ] }, { "cell_type": "code", - "execution_count": 6, - "id": "60685353", + "execution_count": 4, + "id": "b05604d2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'e9fa370aef7ee5c05744eb7bb7d9981b'}, page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a'),\n", - " Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title', 'element_id': 'bde0b230a1aa488e3ce837d33015181b'}, page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis'),\n", - " Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': '54700f902899f0c8c90488fa8d825bce'}, page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5'),\n", - " Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'b650f5867bad9bb4e30384282c79bcfe'}, page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca'),\n", - " Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText', 'element_id': 'cfc957c94fe63c8fd7c7f4bcb56e75a7'}, page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.')]" + "Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-07-25T21:28:58', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}, page_content='1 2 0 2')" ] }, - "execution_count": 6, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "from langchain_unstructured import UnstructuredLoader\n", + "pages = []\n", + "for doc in loader.lazy_load():\n", + " pages.append(doc)\n", "\n", - "loader = UnstructuredLoader(\"./example_data/layout-parser-paper.pdf\", strategy=\"fast\")\n", - "\n", - "docs = loader.load()\n", - "\n", - "docs[5:10]" + "pages[0]" ] }, { @@ -242,23 +277,6 @@ "if you’d like to self-host the Unstructured API or run it locally." ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "6e5fde16", - "metadata": {}, - "outputs": [], - "source": [ - "# Install package\n", - "%pip install \"langchain-unstructured\"\n", - "%pip install \"unstructured-client\"\n", - "\n", - "# Set API key\n", - "import os\n", - "\n", - "os.environ[\"UNSTRUCTURED_API_KEY\"] = \"FAKE_API_KEY\"" - ] - }, { "cell_type": "code", "execution_count": 9, @@ -496,6 +514,16 @@ "print(\"Number of LangChain documents:\", len(docs))\n", "print(\"Length of text in the document:\", len(docs[0].page_content))" ] + }, + { + "cell_type": "markdown", + "id": "ce01aa40", + "metadata": {}, + "source": [ + "## API reference\n", + "\n", + "For detailed documentation of all `UnstructuredLoader` features and configurations head to the API reference: https://api.python.langchain.com/en/latest/document_loaders/langchain_unstructured.document_loaders.UnstructuredLoader.html" + ] } ], "metadata": { @@ -514,7 +542,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.13" + "version": "3.11.9" } }, "nbformat": 4, diff --git a/docs/docs/integrations/document_loaders/web_base.ipynb b/docs/docs/integrations/document_loaders/web_base.ipynb index 3bd81baf7d0..6a233b30e5a 100644 --- a/docs/docs/integrations/document_loaders/web_base.ipynb +++ b/docs/docs/integrations/document_loaders/web_base.ipynb @@ -9,26 +9,63 @@ "\n", "This covers how to use `WebBaseLoader` to load all text from `HTML` webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as `IMSDbLoader`, `AZLyricsLoader`, and `CollegeConfidentialLoader`. \n", "\n", - "If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader` or the faster option `SpiderLoader`.\n" + "If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader` or the faster option `SpiderLoader`.\n", + "\n", + "## Overview\n", + "### Integration details\n", + "\n", + "- TODO: Fill in table features.\n", + "- TODO: Remove JS support link if not relevant, otherwise ensure link is correct.\n", + "- TODO: Make sure API reference links are correct.\n", + "\n", + "| Class | Package | Local | Serializable | JS support|\n", + "| :--- | :--- | :---: | :---: | :---: |\n", + "| [WebBaseLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html) | [langchain_community](https://api.python.langchain.com/en/latest/community_api_reference.html) | ✅ | ❌ | ❌ | \n", + "### Loader features\n", + "| Source | Document Lazy Loading | Native Async Support\n", + "| :---: | :---: | :---: | \n", + "| WebBaseLoader | ✅ | ✅ | \n", + "\n", + "## Setup\n", + "\n", + "### Credentials\n", + "\n", + "`WebBaseLoader` does not require any credentials.\n", + "\n", + "### Installation\n", + "\n", + "To use the `WebBaseLoader` you first need to install the `langchain-community` python package.\n" ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, + "id": "50a6c9b6", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -qU langchain_community" + ] + }, + { + "cell_type": "markdown", + "id": "25465b06", + "metadata": {}, + "source": [ + "## Initialization\n", + "\n", + "Now we can instantiate our model object and load documents:" + ] + }, + { + "cell_type": "code", + "execution_count": null, "id": "00b6de21", "metadata": {}, "outputs": [], "source": [ - "from langchain_community.document_loaders import WebBaseLoader" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "0231df35", - "metadata": {}, - "outputs": [], - "source": [ + "from langchain_community.document_loaders import WebBaseLoader\n", + "\n", "loader = WebBaseLoader(\"https://www.espn.com/\")" ] }, @@ -39,7 +76,29 @@ "source": [ "To bypass SSL verification errors during fetching, you can set the \"verify\" option:\n", "\n", - "loader.requests_kwargs = {'verify':False}" + "loader.requests_kwargs = {'verify':False}\n", + "\n", + "### Initialization with multiple pages\n", + "\n", + "You can also pass in a list of pages to load from." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e86c5d40", + "metadata": {}, + "outputs": [], + "source": [ + "loader_multiple_pages = WebBaseLoader([\"https://www.espn.com/\", \"https://google.com\"])" + ] + }, + { + "cell_type": "markdown", + "id": "1f61081e", + "metadata": {}, + "source": [ + "## Load" ] }, { @@ -47,9 +106,22 @@ "execution_count": 3, "id": "f06bdc4e", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "Document(metadata={'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}, page_content=\"\\n\\n\\n\\n\\n\\n\\n\\n\\nESPN - Serving Sports Fans. Anytime. Anywhere.\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n Skip to main content\\n \\n\\n Skip to navigation\\n \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n<\\n\\n>\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nMenuESPN\\n\\n\\n\\n\\n\\nscores\\n\\n\\n\\nNFLNBAMLBOlympicsSoccerWNBA…BoxingCFLNCAACricketF1GolfHorseLLWSMMANASCARNBA G LeagueNBA Summer LeagueNCAAFNCAAMNCAAWNHLNWSLPLLProfessional WrestlingRacingRN BBRN FBRugbySports BettingTennisX GamesUFLMore ESPNFantasyWatchESPN BETESPN+\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n \\n\\nSubscribe Now\\n\\n\\n\\n\\n\\nBoxing: Crawford vs. Madrimov (ESPN+ PPV)\\n\\n\\n\\n\\n\\n\\n\\nPFL Playoffs: Heavyweights & Women's Flyweights\\n\\n\\n\\n\\n\\n\\n\\nMLB\\n\\n\\n\\n\\n\\n\\n\\nLittle League Baseball: Regionals\\n\\n\\n\\n\\n\\n\\n\\nIn The Arena: Serena Williams\\n\\n\\n\\n\\n\\n\\n\\nThe 30 College Football Playoff Contenders\\n\\n\\nQuick Links\\n\\n\\n\\n\\n2024 Paris Olympics\\n\\n\\n\\n\\n\\n\\n\\nOlympics: Everything to Know\\n\\n\\n\\n\\n\\n\\n\\nMLB Standings\\n\\n\\n\\n\\n\\n\\n\\nSign up: Fantasy Football\\n\\n\\n\\n\\n\\n\\n\\nWNBA Rookie Tracker\\n\\n\\n\\n\\n\\n\\n\\nNBA Free Agency Buzz\\n\\n\\n\\n\\n\\n\\n\\nLittle League Baseball, Softball\\n\\n\\n\\n\\n\\n\\n\\nESPN Radio: Listen Live\\n\\n\\n\\n\\n\\n\\n\\nWatch Golf on ESPN\\n\\n\\n\\n\\n\\n\\nFavorites\\n\\n\\n\\n\\n\\n\\n Manage Favorites\\n \\n\\n\\n\\nCustomize ESPNCreate AccountLog InFantasy\\n\\n\\n\\n\\nFootball\\n\\n\\n\\n\\n\\n\\n\\nBaseball\\n\\n\\n\\n\\n\\n\\n\\nBasketball\\n\\n\\n\\n\\n\\n\\n\\nHockey\\n\\n\\nESPN Sites\\n\\n\\n\\n\\nESPN Deportes\\n\\n\\n\\n\\n\\n\\n\\nAndscape\\n\\n\\n\\n\\n\\n\\n\\nespnW\\n\\n\\n\\n\\n\\n\\n\\nESPNFC\\n\\n\\n\\n\\n\\n\\n\\nX Games\\n\\n\\n\\n\\n\\n\\n\\nSEC Network\\n\\n\\nESPN Apps\\n\\n\\n\\n\\nESPN\\n\\n\\n\\n\\n\\n\\n\\nESPN Fantasy\\n\\n\\n\\n\\n\\n\\n\\nTournament Challenge\\n\\n\\nFollow ESPN\\n\\n\\n\\n\\nFacebook\\n\\n\\n\\n\\n\\n\\n\\nX/Twitter\\n\\n\\n\\n\\n\\n\\n\\nInstagram\\n\\n\\n\\n\\n\\n\\n\\nSnapchat\\n\\n\\n\\n\\n\\n\\n\\nTikTok\\n\\n\\n\\n\\n\\n\\n\\nYouTube\\n\\n\\nCollege football's most entertaining conference? Why the 16-team Big 12 is wiiiiiide open this seasonLong known as one of the sport's most unpredictable conferences, the new-look Big 12 promises another dose of chaos.11hBill ConnellyScott Winters/Icon SportswireUSC, Oregon and the quest to bulk up for the Big TenTo improve on D, the Trojans wanted to bulk up for a new league. They're not the only team trying to do that.10hAdam RittenbergThe 30 teams that can reach the CFPConnelly's best games of the 2024 seasonTOP HEADLINESTeam USA sets world record in 4x400 mixed relayGermany beats France, undefeated in men's hoopsU.S. men's soccer exits Games after Morocco routHungary to protest Khelif's Olympic participationKobe's Staples Center locker sells for record $2.9MDjokovic, Alcaraz to meet again, this time for goldKerr: Team USA lineups based on players' rolesMarchand wins 200m IM; McEvoy takes 50 freeScouting Shedeur Sanders' NFL futureLATEST FROM PARISBreakout stars, best moments and what comes next: Breaking down the Games halfway throughAt the midpoint of the Olympics, we look back at some of the best moments and forward at what's still to come in the second week.35mESPNMustafa Yalcin/Anadolu via Getty ImagesThe numbers behind USA's world record in 4x400 mixed relay4h0:46Medal trackerFull resultsFull coverage of the OlympicsPRESEASON HAS BEGUN!New kickoff rules on display to start Hall of Fame Game19h0:41McAfee on NFL's new kickoff: It looks like a practice drill6h1:11NFL's new kickoff rules debut to mixed reviewsTexans-Bears attracts more bets than MLBTOP RANK BOXINGSATURDAY ON ESPN+ PPVWhy Terence Crawford is playing the long game for a chance to face CaneloCrawford is approaching Saturday's marquee matchup against Israil Madrimov with his sights set on landing a bigger one next: against Canelo Alvarez.10hMike CoppingerMark Robinson/Matchroom BoxingBradley's take: Crawford's power vs. Israil Madrimov's disciplined styleTimothy Bradley Jr. breaks down the junior middleweight title fight.2dTimothy Bradley Jr.Buy Crawford vs. Madrimov on ESPN+ PPVChance to impress for Madrimov -- and UzbekistanHOW FRIDAY WENTMORE FROM THE PARIS OLYMPICSGrant Fisher makes U.S. track history, Marchand wins 4th gold and more Friday at the Paris GamesSha'Carri Richardson made her long-awaited Olympic debut during the women's 100m preliminary round on Friday. Here's everything else you might have missed from Paris.27mESPNGetty ImagesU.S. men's loss to Morocco is a wake-up call before World CupOutclassed by Morocco, the U.S. men's Olympic team can take plenty of lessons with the World Cup on the horizon.4hSam BordenFull coverage of the OlympicsOLYMPIC MEN'S HOOPS SCOREBOARDFRIDAY'S GAMESOLYMPIC STANDOUTSWhy Simone Biles is now Stephen A.'s No. 1 Olympian7h0:58Alcaraz on the cusp of history after securing spot in gold medal match8h0:59Simone Biles' gymnastics titles: Olympics, Worlds, more statsOLYMPIC MEN'S SOCCER SCOREBOARDFRIDAY'S GAMESTRADE DEADLINE FALLOUTOlney: Eight big questions for traded MLB playersCan Jazz Chisholm Jr. handle New York? Is Jack Flaherty healthy? Will Jorge Soler's defense play? Key questions for players in new places.10hBuster OlneyWinslow Townson/Getty ImagesRanking the top prospects who changed teams at the MLB trade deadlineYou know the major leaguers who moved by now -- but what about the potential stars of tomorrow?1dKiley McDanielMLB Power RankingsSeries odds: Dodgers still on top; Phillies, Yanks behind them Top HeadlinesTeam USA sets world record in 4x400 mixed relayGermany beats France, undefeated in men's hoopsU.S. men's soccer exits Games after Morocco routHungary to protest Khelif's Olympic participationKobe's Staples Center locker sells for record $2.9MDjokovic, Alcaraz to meet again, this time for goldKerr: Team USA lineups based on players' rolesMarchand wins 200m IM; McEvoy takes 50 freeScouting Shedeur Sanders' NFL futureFavorites FantasyManage FavoritesFantasy HomeCustomize ESPNCreate AccountLog InICYMI0:47Nelson Palacio rips an incredible goal from outside the boxNelson Palacio scores an outside-of-the-box goal for Real Salt Lake in the 79th minute. \\n\\n\\nMedal Tracker\\n\\n\\n\\nCountries\\nAthletes\\n\\nOverall Medal Leaders43USA36FRA31CHNIndividual Medal LeadersGoldCHN 13FRA 11AUS 11SilverUSA 18FRA 12GBR 10BronzeUSA 16FRA 13CHN 9Overall Medal Leaders4MarchandMarchand3O'CallaghanO'Callaghan3McIntoshMcIntoshIndividual Medal LeadersGoldMarchand 4O'Callaghan 3McIntosh 2SilverSmith 3Huske 2Walsh 2BronzeYufei 3Bhaker 2Haughey 2\\n\\n\\nFull Medal Tracker\\n\\n\\nBest of ESPN+ESPNCollege Football Playoff 2024: 30 teams can reach postseasonHeather Dinich analyzes the teams with at least a 10% chance to make the CFP.AP Photo/Ross D. FranklinNFL Hall of Fame predictions: Who will make the next 10 classes?When will Richard Sherman and Marshawn Lynch make it? Who could join Aaron Donald in 2029? Let's map out each Hall of Fame class until 2034.Thearon W. Henderson/Getty ImagesMLB trade deadline 2024: Ranking prospects who changed teamsYou know the major leaguers who moved by now -- but what about the potential stars of tomorrow who went the other way in those deals? Trending NowIllustration by ESPNRanking the top 100 professional athletes since 2000Who tops our list of the top athletes since 2000? We're unveiling the top 25, including our voters' pick for the No. 1 spot.Photo by Kevin C. Cox/Getty Images2024 NFL offseason recap: Signings, coach moves, new rulesThink you missed something in the NFL offseason? We've got you covered with everything important that has happened since February.Stacy Revere/Getty ImagesTop 25 college football stadiums: Rose Bowl, Michigan and moreFourteen of ESPN's college football writers rank the 25 best stadiums in the sport. Who's No. 1, who missed the cut and what makes these stadiums so special?ESPNInside Nate Robinson's silent battle -- and his fight to liveFor nearly 20 years Nate Robinson has been fighting a silent battle -- one he didn't realize until recently could end his life. Sign up to play the #1 Fantasy game!Create A LeagueJoin Public LeagueReactivate A LeagueMock Draft NowSign up for FREE!Create A LeagueJoin a Public LeagueReactivate a LeaguePractice With a Mock DraftSign up for FREE!Create A LeagueJoin a Public LeagueReactivate a LeaguePractice with a Mock DraftGet a custom ESPN experienceEnjoy the benefits of a personalized accountSelect your favorite leagues, teams and players and get the latest scores, news and updates that matter most to you. \\n\\nESPN+\\n\\n\\n\\n\\nBoxing: Crawford vs. Madrimov (ESPN+ PPV)\\n\\n\\n\\n\\n\\n\\n\\nPFL Playoffs: Heavyweights & Women's Flyweights\\n\\n\\n\\n\\n\\n\\n\\nMLB\\n\\n\\n\\n\\n\\n\\n\\nLittle League Baseball: Regionals\\n\\n\\n\\n\\n\\n\\n\\nIn The Arena: Serena Williams\\n\\n\\n\\n\\n\\n\\n\\nThe 30 College Football Playoff Contenders\\n\\n\\nQuick Links\\n\\n\\n\\n\\n2024 Paris Olympics\\n\\n\\n\\n\\n\\n\\n\\nOlympics: Everything to Know\\n\\n\\n\\n\\n\\n\\n\\nMLB Standings\\n\\n\\n\\n\\n\\n\\n\\nSign up: Fantasy Football\\n\\n\\n\\n\\n\\n\\n\\nWNBA Rookie Tracker\\n\\n\\n\\n\\n\\n\\n\\nNBA Free Agency Buzz\\n\\n\\n\\n\\n\\n\\n\\nLittle League Baseball, Softball\\n\\n\\n\\n\\n\\n\\n\\nESPN Radio: Listen Live\\n\\n\\n\\n\\n\\n\\n\\nWatch Golf on ESPN\\n\\n\\nFantasy\\n\\n\\n\\n\\nFootball\\n\\n\\n\\n\\n\\n\\n\\nBaseball\\n\\n\\n\\n\\n\\n\\n\\nBasketball\\n\\n\\n\\n\\n\\n\\n\\nHockey\\n\\n\\nESPN Sites\\n\\n\\n\\n\\nESPN Deportes\\n\\n\\n\\n\\n\\n\\n\\nAndscape\\n\\n\\n\\n\\n\\n\\n\\nespnW\\n\\n\\n\\n\\n\\n\\n\\nESPNFC\\n\\n\\n\\n\\n\\n\\n\\nX Games\\n\\n\\n\\n\\n\\n\\n\\nSEC Network\\n\\n\\nESPN Apps\\n\\n\\n\\n\\nESPN\\n\\n\\n\\n\\n\\n\\n\\nESPN Fantasy\\n\\n\\n\\n\\n\\n\\n\\nTournament Challenge\\n\\n\\nFollow ESPN\\n\\n\\n\\n\\nFacebook\\n\\n\\n\\n\\n\\n\\n\\nX/Twitter\\n\\n\\n\\n\\n\\n\\n\\nInstagram\\n\\n\\n\\n\\n\\n\\n\\nSnapchat\\n\\n\\n\\n\\n\\n\\n\\nTikTok\\n\\n\\n\\n\\n\\n\\n\\nYouTube\\n\\n\\nTerms of UsePrivacy PolicyYour US State Privacy RightsChildren's Online Privacy PolicyInterest-Based AdsAbout Nielsen MeasurementDo Not Sell or Share My Personal InformationContact UsDisney Ad Sales SiteWork for ESPNCorrectionsESPN BET is owned and operated by PENN Entertainment, Inc. and its subsidiaries ('PENN'). ESPN BET is available in states where PENN is licensed to offer sports wagering. Must be 21+ to wager. If you or someone you know has a gambling problem and wants help, call 1-800-GAMBLER.Copyright: © 2024 ESPN Enterprises, Inc. All rights reserved.\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\")" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "data = loader.load()" + "docs = loader.load()\n", + "\n", + "docs[0]" ] }, { @@ -59,75 +131,15 @@ "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "[Document(page_content=\"\\n\\n\\n\\n\\n\\n\\n\\n\\nESPN - Serving Sports Fans. Anytime. Anywhere.\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n Skip to main content\\n \\n\\n Skip to navigation\\n \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n<\\n\\n>\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nMenuESPN\\n\\n\\nSearch\\n\\n\\n\\nscores\\n\\n\\n\\nNFLNBANCAAMNCAAWNHLSoccer…MLBNCAAFGolfTennisSports BettingBoxingCFLNCAACricketF1HorseLLWSMMANASCARNBA G LeagueOlympic SportsRacingRN BBRN FBRugbyWNBAWorld Baseball ClassicWWEX GamesXFLMore ESPNFantasyListenWatchESPN+\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n \\n\\nSUBSCRIBE NOW\\n\\n\\n\\n\\n\\nNHL: Select Games\\n\\n\\n\\n\\n\\n\\n\\nXFL\\n\\n\\n\\n\\n\\n\\n\\nMLB: Select Games\\n\\n\\n\\n\\n\\n\\n\\nNCAA Baseball\\n\\n\\n\\n\\n\\n\\n\\nNCAA Softball\\n\\n\\n\\n\\n\\n\\n\\nCricket: Select Matches\\n\\n\\n\\n\\n\\n\\n\\nMel Kiper's NFL Mock Draft 3.0\\n\\n\\nQuick Links\\n\\n\\n\\n\\nMen's Tournament Challenge\\n\\n\\n\\n\\n\\n\\n\\nWomen's Tournament Challenge\\n\\n\\n\\n\\n\\n\\n\\nNFL Draft Order\\n\\n\\n\\n\\n\\n\\n\\nHow To Watch NHL Games\\n\\n\\n\\n\\n\\n\\n\\nFantasy Baseball: Sign Up\\n\\n\\n\\n\\n\\n\\n\\nHow To Watch PGA TOUR\\n\\n\\n\\n\\n\\n\\nFavorites\\n\\n\\n\\n\\n\\n\\n Manage Favorites\\n \\n\\n\\n\\nCustomize ESPNSign UpLog InESPN Sites\\n\\n\\n\\n\\nESPN Deportes\\n\\n\\n\\n\\n\\n\\n\\nAndscape\\n\\n\\n\\n\\n\\n\\n\\nespnW\\n\\n\\n\\n\\n\\n\\n\\nESPNFC\\n\\n\\n\\n\\n\\n\\n\\nX Games\\n\\n\\n\\n\\n\\n\\n\\nSEC Network\\n\\n\\nESPN Apps\\n\\n\\n\\n\\nESPN\\n\\n\\n\\n\\n\\n\\n\\nESPN Fantasy\\n\\n\\nFollow ESPN\\n\\n\\n\\n\\nFacebook\\n\\n\\n\\n\\n\\n\\n\\nTwitter\\n\\n\\n\\n\\n\\n\\n\\nInstagram\\n\\n\\n\\n\\n\\n\\n\\nSnapchat\\n\\n\\n\\n\\n\\n\\n\\nYouTube\\n\\n\\n\\n\\n\\n\\n\\nThe ESPN Daily Podcast\\n\\n\\nAre you ready for Opening Day? Here's your guide to MLB's offseason chaosWait, Jacob deGrom is on the Rangers now? Xander Bogaerts and Trea Turner signed where? And what about Carlos Correa? Yeah, you're going to need to read up before Opening Day.12hESPNIllustration by ESPNEverything you missed in the MLB offseason3h2:33World Series odds, win totals, props for every teamPlay fantasy baseball for free!TOP HEADLINESQB Jackson has requested trade from RavensSources: Texas hiring Terry as full-time coachJets GM: No rush on Rodgers; Lamar not optionLove to leave North Carolina, enter transfer portalBelichick to angsty Pats fans: See last 25 yearsEmbiid out, Harden due back vs. Jokic, NuggetsLynch: Purdy 'earned the right' to start for NinersMan Utd, Wrexham plan July friendly in San DiegoOn paper, Padres overtake DodgersLAMAR WANTS OUT OF BALTIMOREMarcus Spears identifies the two teams that need Lamar Jackson the most8h2:00Would Lamar sit out? Will Ravens draft a QB? Jackson trade request insightsLamar Jackson has asked Baltimore to trade him, but Ravens coach John Harbaugh hopes the QB will be back.3hJamison HensleyBallard, Colts will consider trading for QB JacksonJackson to Indy? Washington? Barnwell ranks the QB's trade fitsSNYDER'S TUMULTUOUS 24-YEAR RUNHow Washington’s NFL franchise sank on and off the field under owner Dan SnyderSnyder purchased one of the NFL's marquee franchises in 1999. Twenty-four years later, and with the team up for sale, he leaves a legacy of on-field futility and off-field scandal.13hJohn KeimESPNIOWA STAR STEPS UP AGAINJ-Will: Caitlin Clark is the biggest brand in college sports right now8h0:47'The better the opponent, the better she plays': Clark draws comparisons to TaurasiCaitlin Clark's performance on Sunday had longtime observers going back decades to find comparisons.16hKevin PeltonWOMEN'S ELITE EIGHT SCOREBOARDMONDAY'S GAMESCheck your bracket!NBA DRAFTHow top prospects fared on the road to the Final FourThe 2023 NCAA tournament is down to four teams, and ESPN's Jonathan Givony recaps the players who saw their NBA draft stock change.11hJonathan GivonyAndy Lyons/Getty ImagesTALKING BASKETBALLWhy AD needs to be more assertive with LeBron on the court10h1:33Why Perk won't blame Kyrie for Mavs' woes8h1:48WHERE EVERY TEAM STANDSNew NFL Power Rankings: Post-free-agency 1-32 poll, plus underrated offseason movesThe free agent frenzy has come and gone. Which teams have improved their 2023 outlook, and which teams have taken a hit?12hNFL Nation reportersIllustration by ESPNTHE BUCK STOPS WITH BELICHICKBruschi: Fair to criticize Bill Belichick for Patriots' struggles10h1:27 Top HeadlinesQB Jackson has requested trade from RavensSources: Texas hiring Terry as full-time coachJets GM: No rush on Rodgers; Lamar not optionLove to leave North Carolina, enter transfer portalBelichick to angsty Pats fans: See last 25 yearsEmbiid out, Harden due back vs. Jokic, NuggetsLynch: Purdy 'earned the right' to start for NinersMan Utd, Wrexham plan July friendly in San DiegoOn paper, Padres overtake DodgersFavorites FantasyManage FavoritesFantasy HomeCustomize ESPNSign UpLog InMarch Madness LiveESPNMarch Madness LiveWatch every men's NCAA tournament game live! ICYMI1:42Austin Peay's coach, pitcher and catcher all ejected after retaliation pitchAustin Peay's pitcher, catcher and coach were all ejected after a pitch was thrown at Liberty's Nathan Keeter, who earlier in the game hit a home run and celebrated while running down the third-base line. Men's Tournament ChallengeIllustration by ESPNMen's Tournament ChallengeCheck your bracket(s) in the 2023 Men's Tournament Challenge, which you can follow throughout the Big Dance. Women's Tournament ChallengeIllustration by ESPNWomen's Tournament ChallengeCheck your bracket(s) in the 2023 Women's Tournament Challenge, which you can follow throughout the Big Dance. Best of ESPN+AP Photo/Lynne SladkyFantasy Baseball ESPN+ Cheat Sheet: Sleepers, busts, rookies and closersYou've read their names all preseason long, it'd be a shame to forget them on draft day. The ESPN+ Cheat Sheet is one way to make sure that doesn't happen.Steph Chambers/Getty ImagesPassan's 2023 MLB season preview: Bold predictions and moreOpening Day is just over a week away -- and Jeff Passan has everything you need to know covered from every possible angle.Photo by Bob Kupbens/Icon Sportswire2023 NFL free agency: Best team fits for unsigned playersWhere could Ezekiel Elliott land? Let's match remaining free agents to teams and find fits for two trade candidates.Illustration by ESPN2023 NFL mock draft: Mel Kiper's first-round pick predictionsMel Kiper Jr. makes his predictions for Round 1 of the NFL draft, including projecting a trade in the top five. Trending NowAnne-Marie Sorvin-USA TODAY SBoston Bruins record tracker: Wins, points, milestonesThe B's are on pace for NHL records in wins and points, along with some individual superlatives as well. Follow along here with our updated tracker.Mandatory Credit: William Purnell-USA TODAY Sports2023 NFL full draft order: AFC, NFC team picks for all roundsStarting with the Carolina Panthers at No. 1 overall, here's the entire 2023 NFL draft broken down round by round. How to Watch on ESPN+Gregory Fisher/Icon Sportswire2023 NCAA men's hockey: Results, bracket, how to watchThe matchups in Tampa promise to be thrillers, featuring plenty of star power, high-octane offense and stellar defense.(AP Photo/Koji Sasahara, File)How to watch the PGA Tour, Masters, PGA Championship and FedEx Cup playoffs on ESPN, ESPN+Here's everything you need to know about how to watch the PGA Tour, Masters, PGA Championship and FedEx Cup playoffs on ESPN and ESPN+.Hailie Lynch/XFLHow to watch the XFL: 2023 schedule, teams, players, news, moreEvery XFL game will be streamed on ESPN+. Find out when and where else you can watch the eight teams compete. Sign up to play the #1 Fantasy Baseball GameReactivate A LeagueCreate A LeagueJoin a Public LeaguePractice With a Mock DraftSports BettingAP Photo/Mike KropfMarch Madness betting 2023: Bracket odds, lines, tips, moreThe 2023 NCAA tournament brackets have finally been released, and we have everything you need to know to make a bet on all of the March Madness games. Sign up to play the #1 Fantasy game!Create A LeagueJoin Public LeagueReactivateMock Draft Now\\n\\nESPN+\\n\\n\\n\\n\\nNHL: Select Games\\n\\n\\n\\n\\n\\n\\n\\nXFL\\n\\n\\n\\n\\n\\n\\n\\nMLB: Select Games\\n\\n\\n\\n\\n\\n\\n\\nNCAA Baseball\\n\\n\\n\\n\\n\\n\\n\\nNCAA Softball\\n\\n\\n\\n\\n\\n\\n\\nCricket: Select Matches\\n\\n\\n\\n\\n\\n\\n\\nMel Kiper's NFL Mock Draft 3.0\\n\\n\\nQuick Links\\n\\n\\n\\n\\nMen's Tournament Challenge\\n\\n\\n\\n\\n\\n\\n\\nWomen's Tournament Challenge\\n\\n\\n\\n\\n\\n\\n\\nNFL Draft Order\\n\\n\\n\\n\\n\\n\\n\\nHow To Watch NHL Games\\n\\n\\n\\n\\n\\n\\n\\nFantasy Baseball: Sign Up\\n\\n\\n\\n\\n\\n\\n\\nHow To Watch PGA TOUR\\n\\n\\nESPN Sites\\n\\n\\n\\n\\nESPN Deportes\\n\\n\\n\\n\\n\\n\\n\\nAndscape\\n\\n\\n\\n\\n\\n\\n\\nespnW\\n\\n\\n\\n\\n\\n\\n\\nESPNFC\\n\\n\\n\\n\\n\\n\\n\\nX Games\\n\\n\\n\\n\\n\\n\\n\\nSEC Network\\n\\n\\nESPN Apps\\n\\n\\n\\n\\nESPN\\n\\n\\n\\n\\n\\n\\n\\nESPN Fantasy\\n\\n\\nFollow ESPN\\n\\n\\n\\n\\nFacebook\\n\\n\\n\\n\\n\\n\\n\\nTwitter\\n\\n\\n\\n\\n\\n\\n\\nInstagram\\n\\n\\n\\n\\n\\n\\n\\nSnapchat\\n\\n\\n\\n\\n\\n\\n\\nYouTube\\n\\n\\n\\n\\n\\n\\n\\nThe ESPN Daily Podcast\\n\\n\\nTerms of UsePrivacy PolicyYour US State Privacy RightsChildren's Online Privacy PolicyInterest-Based AdsAbout Nielsen MeasurementDo Not Sell or Share My Personal InformationContact UsDisney Ad Sales SiteWork for ESPNCopyright: © ESPN Enterprises, Inc. All rights reserved.\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\", lookup_str='', metadata={'source': 'https://www.espn.com/'}, lookup_index=0)]" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" + "name": "stdout", + "output_type": "stream", + "text": [ + "{'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}\n" + ] } ], "source": [ - "data" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "878179f7", - "metadata": {}, - "outputs": [], - "source": [ - "\"\"\"\n", - "# Use this piece of code for testing new custom BeautifulSoup parsers\n", - "\n", - "import requests\n", - "from bs4 import BeautifulSoup\n", - "\n", - "html_doc = requests.get(\"{INSERT_NEW_URL_HERE}\")\n", - "soup = BeautifulSoup(html_doc.text, 'html.parser')\n", - "\n", - "# Beautiful soup logic to be exported to langchain_community.document_loaders.webpage.py\n", - "# Example: transcript = soup.select_one(\"td[class='scrtext']\").text\n", - "# BS4 documentation can be found here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/\n", - "\n", - "\"\"\"" - ] - }, - { - "cell_type": "markdown", - "id": "150988e6", - "metadata": {}, - "source": [ - "## Loading multiple webpages\n", - "\n", - "You can also load multiple webpages at once by passing in a list of urls to the loader. This will return a list of documents in the same order as the urls passed in." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "e25bbd3b", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[Document(page_content=\"\\n\\n\\n\\n\\n\\n\\n\\n\\nESPN - Serving Sports Fans. Anytime. Anywhere.\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n Skip to main content\\n \\n\\n Skip to navigation\\n \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n<\\n\\n>\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nMenuESPN\\n\\n\\nSearch\\n\\n\\n\\nscores\\n\\n\\n\\nNFLNBANCAAMNCAAWNHLSoccer…MLBNCAAFGolfTennisSports BettingBoxingCFLNCAACricketF1HorseLLWSMMANASCARNBA G LeagueOlympic SportsRacingRN BBRN FBRugbyWNBAWorld Baseball ClassicWWEX GamesXFLMore ESPNFantasyListenWatchESPN+\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n \\n\\nSUBSCRIBE NOW\\n\\n\\n\\n\\n\\nNHL: Select Games\\n\\n\\n\\n\\n\\n\\n\\nXFL\\n\\n\\n\\n\\n\\n\\n\\nMLB: Select Games\\n\\n\\n\\n\\n\\n\\n\\nNCAA Baseball\\n\\n\\n\\n\\n\\n\\n\\nNCAA Softball\\n\\n\\n\\n\\n\\n\\n\\nCricket: Select Matches\\n\\n\\n\\n\\n\\n\\n\\nMel Kiper's NFL Mock Draft 3.0\\n\\n\\nQuick Links\\n\\n\\n\\n\\nMen's Tournament Challenge\\n\\n\\n\\n\\n\\n\\n\\nWomen's Tournament Challenge\\n\\n\\n\\n\\n\\n\\n\\nNFL Draft Order\\n\\n\\n\\n\\n\\n\\n\\nHow To Watch NHL Games\\n\\n\\n\\n\\n\\n\\n\\nFantasy Baseball: Sign Up\\n\\n\\n\\n\\n\\n\\n\\nHow To Watch PGA TOUR\\n\\n\\n\\n\\n\\n\\nFavorites\\n\\n\\n\\n\\n\\n\\n Manage Favorites\\n \\n\\n\\n\\nCustomize ESPNSign UpLog InESPN Sites\\n\\n\\n\\n\\nESPN Deportes\\n\\n\\n\\n\\n\\n\\n\\nAndscape\\n\\n\\n\\n\\n\\n\\n\\nespnW\\n\\n\\n\\n\\n\\n\\n\\nESPNFC\\n\\n\\n\\n\\n\\n\\n\\nX Games\\n\\n\\n\\n\\n\\n\\n\\nSEC Network\\n\\n\\nESPN Apps\\n\\n\\n\\n\\nESPN\\n\\n\\n\\n\\n\\n\\n\\nESPN Fantasy\\n\\n\\nFollow ESPN\\n\\n\\n\\n\\nFacebook\\n\\n\\n\\n\\n\\n\\n\\nTwitter\\n\\n\\n\\n\\n\\n\\n\\nInstagram\\n\\n\\n\\n\\n\\n\\n\\nSnapchat\\n\\n\\n\\n\\n\\n\\n\\nYouTube\\n\\n\\n\\n\\n\\n\\n\\nThe ESPN Daily Podcast\\n\\n\\nAre you ready for Opening Day? Here's your guide to MLB's offseason chaosWait, Jacob deGrom is on the Rangers now? Xander Bogaerts and Trea Turner signed where? And what about Carlos Correa? Yeah, you're going to need to read up before Opening Day.12hESPNIllustration by ESPNEverything you missed in the MLB offseason3h2:33World Series odds, win totals, props for every teamPlay fantasy baseball for free!TOP HEADLINESQB Jackson has requested trade from RavensSources: Texas hiring Terry as full-time coachJets GM: No rush on Rodgers; Lamar not optionLove to leave North Carolina, enter transfer portalBelichick to angsty Pats fans: See last 25 yearsEmbiid out, Harden due back vs. Jokic, NuggetsLynch: Purdy 'earned the right' to start for NinersMan Utd, Wrexham plan July friendly in San DiegoOn paper, Padres overtake DodgersLAMAR WANTS OUT OF BALTIMOREMarcus Spears identifies the two teams that need Lamar Jackson the most7h2:00Would Lamar sit out? Will Ravens draft a QB? Jackson trade request insightsLamar Jackson has asked Baltimore to trade him, but Ravens coach John Harbaugh hopes the QB will be back.3hJamison HensleyBallard, Colts will consider trading for QB JacksonJackson to Indy? Washington? Barnwell ranks the QB's trade fitsSNYDER'S TUMULTUOUS 24-YEAR RUNHow Washington’s NFL franchise sank on and off the field under owner Dan SnyderSnyder purchased one of the NFL's marquee franchises in 1999. Twenty-four years later, and with the team up for sale, he leaves a legacy of on-field futility and off-field scandal.13hJohn KeimESPNIOWA STAR STEPS UP AGAINJ-Will: Caitlin Clark is the biggest brand in college sports right now8h0:47'The better the opponent, the better she plays': Clark draws comparisons to TaurasiCaitlin Clark's performance on Sunday had longtime observers going back decades to find comparisons.16hKevin PeltonWOMEN'S ELITE EIGHT SCOREBOARDMONDAY'S GAMESCheck your bracket!NBA DRAFTHow top prospects fared on the road to the Final FourThe 2023 NCAA tournament is down to four teams, and ESPN's Jonathan Givony recaps the players who saw their NBA draft stock change.11hJonathan GivonyAndy Lyons/Getty ImagesTALKING BASKETBALLWhy AD needs to be more assertive with LeBron on the court9h1:33Why Perk won't blame Kyrie for Mavs' woes8h1:48WHERE EVERY TEAM STANDSNew NFL Power Rankings: Post-free-agency 1-32 poll, plus underrated offseason movesThe free agent frenzy has come and gone. Which teams have improved their 2023 outlook, and which teams have taken a hit?12hNFL Nation reportersIllustration by ESPNTHE BUCK STOPS WITH BELICHICKBruschi: Fair to criticize Bill Belichick for Patriots' struggles10h1:27 Top HeadlinesQB Jackson has requested trade from RavensSources: Texas hiring Terry as full-time coachJets GM: No rush on Rodgers; Lamar not optionLove to leave North Carolina, enter transfer portalBelichick to angsty Pats fans: See last 25 yearsEmbiid out, Harden due back vs. Jokic, NuggetsLynch: Purdy 'earned the right' to start for NinersMan Utd, Wrexham plan July friendly in San DiegoOn paper, Padres overtake DodgersFavorites FantasyManage FavoritesFantasy HomeCustomize ESPNSign UpLog InMarch Madness LiveESPNMarch Madness LiveWatch every men's NCAA tournament game live! ICYMI1:42Austin Peay's coach, pitcher and catcher all ejected after retaliation pitchAustin Peay's pitcher, catcher and coach were all ejected after a pitch was thrown at Liberty's Nathan Keeter, who earlier in the game hit a home run and celebrated while running down the third-base line. Men's Tournament ChallengeIllustration by ESPNMen's Tournament ChallengeCheck your bracket(s) in the 2023 Men's Tournament Challenge, which you can follow throughout the Big Dance. Women's Tournament ChallengeIllustration by ESPNWomen's Tournament ChallengeCheck your bracket(s) in the 2023 Women's Tournament Challenge, which you can follow throughout the Big Dance. Best of ESPN+AP Photo/Lynne SladkyFantasy Baseball ESPN+ Cheat Sheet: Sleepers, busts, rookies and closersYou've read their names all preseason long, it'd be a shame to forget them on draft day. The ESPN+ Cheat Sheet is one way to make sure that doesn't happen.Steph Chambers/Getty ImagesPassan's 2023 MLB season preview: Bold predictions and moreOpening Day is just over a week away -- and Jeff Passan has everything you need to know covered from every possible angle.Photo by Bob Kupbens/Icon Sportswire2023 NFL free agency: Best team fits for unsigned playersWhere could Ezekiel Elliott land? Let's match remaining free agents to teams and find fits for two trade candidates.Illustration by ESPN2023 NFL mock draft: Mel Kiper's first-round pick predictionsMel Kiper Jr. makes his predictions for Round 1 of the NFL draft, including projecting a trade in the top five. Trending NowAnne-Marie Sorvin-USA TODAY SBoston Bruins record tracker: Wins, points, milestonesThe B's are on pace for NHL records in wins and points, along with some individual superlatives as well. Follow along here with our updated tracker.Mandatory Credit: William Purnell-USA TODAY Sports2023 NFL full draft order: AFC, NFC team picks for all roundsStarting with the Carolina Panthers at No. 1 overall, here's the entire 2023 NFL draft broken down round by round. How to Watch on ESPN+Gregory Fisher/Icon Sportswire2023 NCAA men's hockey: Results, bracket, how to watchThe matchups in Tampa promise to be thrillers, featuring plenty of star power, high-octane offense and stellar defense.(AP Photo/Koji Sasahara, File)How to watch the PGA Tour, Masters, PGA Championship and FedEx Cup playoffs on ESPN, ESPN+Here's everything you need to know about how to watch the PGA Tour, Masters, PGA Championship and FedEx Cup playoffs on ESPN and ESPN+.Hailie Lynch/XFLHow to watch the XFL: 2023 schedule, teams, players, news, moreEvery XFL game will be streamed on ESPN+. Find out when and where else you can watch the eight teams compete. Sign up to play the #1 Fantasy Baseball GameReactivate A LeagueCreate A LeagueJoin a Public LeaguePractice With a Mock DraftSports BettingAP Photo/Mike KropfMarch Madness betting 2023: Bracket odds, lines, tips, moreThe 2023 NCAA tournament brackets have finally been released, and we have everything you need to know to make a bet on all of the March Madness games. Sign up to play the #1 Fantasy game!Create A LeagueJoin Public LeagueReactivateMock Draft Now\\n\\nESPN+\\n\\n\\n\\n\\nNHL: Select Games\\n\\n\\n\\n\\n\\n\\n\\nXFL\\n\\n\\n\\n\\n\\n\\n\\nMLB: Select Games\\n\\n\\n\\n\\n\\n\\n\\nNCAA Baseball\\n\\n\\n\\n\\n\\n\\n\\nNCAA Softball\\n\\n\\n\\n\\n\\n\\n\\nCricket: Select Matches\\n\\n\\n\\n\\n\\n\\n\\nMel Kiper's NFL Mock Draft 3.0\\n\\n\\nQuick Links\\n\\n\\n\\n\\nMen's Tournament Challenge\\n\\n\\n\\n\\n\\n\\n\\nWomen's Tournament Challenge\\n\\n\\n\\n\\n\\n\\n\\nNFL Draft Order\\n\\n\\n\\n\\n\\n\\n\\nHow To Watch NHL Games\\n\\n\\n\\n\\n\\n\\n\\nFantasy Baseball: Sign Up\\n\\n\\n\\n\\n\\n\\n\\nHow To Watch PGA TOUR\\n\\n\\nESPN Sites\\n\\n\\n\\n\\nESPN Deportes\\n\\n\\n\\n\\n\\n\\n\\nAndscape\\n\\n\\n\\n\\n\\n\\n\\nespnW\\n\\n\\n\\n\\n\\n\\n\\nESPNFC\\n\\n\\n\\n\\n\\n\\n\\nX Games\\n\\n\\n\\n\\n\\n\\n\\nSEC Network\\n\\n\\nESPN Apps\\n\\n\\n\\n\\nESPN\\n\\n\\n\\n\\n\\n\\n\\nESPN Fantasy\\n\\n\\nFollow ESPN\\n\\n\\n\\n\\nFacebook\\n\\n\\n\\n\\n\\n\\n\\nTwitter\\n\\n\\n\\n\\n\\n\\n\\nInstagram\\n\\n\\n\\n\\n\\n\\n\\nSnapchat\\n\\n\\n\\n\\n\\n\\n\\nYouTube\\n\\n\\n\\n\\n\\n\\n\\nThe ESPN Daily Podcast\\n\\n\\nTerms of UsePrivacy PolicyYour US State Privacy RightsChildren's Online Privacy PolicyInterest-Based AdsAbout Nielsen MeasurementDo Not Sell or Share My Personal InformationContact UsDisney Ad Sales SiteWork for ESPNCopyright: © ESPN Enterprises, Inc. All rights reserved.\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\", lookup_str='', metadata={'source': 'https://www.espn.com/'}, lookup_index=0),\n", - " Document(page_content='GoogleSearch Images Maps Play YouTube News Gmail Drive More »Web History | Settings | Sign in\\xa0Advanced searchAdvertisingBusiness SolutionsAbout Google© 2023 - Privacy - Terms ', lookup_str='', metadata={'source': 'https://google.com'}, lookup_index=0)]" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "loader = WebBaseLoader([\"https://www.espn.com/\", \"https://google.com\"])\n", - "docs = loader.load()\n", - "docs" + "print(docs[0].metadata)" ] }, { @@ -157,7 +169,7 @@ } ], "source": [ - "%pip install --upgrade --quiet nest_asyncio\n", + "%pip install -qU nest_asyncio\n", "\n", "# fixes a bug with asyncio and jupyter\n", "import nest_asyncio\n", @@ -195,7 +207,7 @@ "id": "e337b130", "metadata": {}, "source": [ - "## Loading a xml file, or using a different BeautifulSoup parser\n", + "### Loading a xml file, or using a different BeautifulSoup parser\n", "\n", "You can also look at `SitemapLoader` for an example of how to load a sitemap file, which is an example of using this feature." ] @@ -226,6 +238,41 @@ "docs" ] }, + { + "cell_type": "markdown", + "id": "5fa40bb1", + "metadata": {}, + "source": [ + "## Lazy Load\n", + "\n", + "You can use lazy loading to only load one page at a time in order to minimize memory requirements." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "303d2f4e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Document(metadata={'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}, page_content=\"\\n\\n\\n\\n\\n\\n\\n\\n\\nESPN - Serving Sports Fans. Anytime. Anywhere.\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n Skip to main content\\n \\n\\n Skip to navigation\\n \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n<\\n\\n>\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nMenuESPN\\n\\n\\n\\n\\n\\nscores\\n\\n\\n\\nNFLNBAMLBOlympicsSoccerWNBA…BoxingCFLNCAACricketF1GolfHorseLLWSMMANASCARNBA G LeagueNBA Summer LeagueNCAAFNCAAMNCAAWNHLNWSLPLLProfessional WrestlingRacingRN BBRN FBRugbySports BettingTennisX GamesUFLMore ESPNFantasyWatchESPN BETESPN+\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n \\n\\nSubscribe Now\\n\\n\\n\\n\\n\\nBoxing: Crawford vs. Madrimov (ESPN+ PPV)\\n\\n\\n\\n\\n\\n\\n\\nPFL Playoffs: Heavyweights & Women's Flyweights\\n\\n\\n\\n\\n\\n\\n\\nMLB\\n\\n\\n\\n\\n\\n\\n\\nLittle League Baseball: Regionals\\n\\n\\n\\n\\n\\n\\n\\nIn The Arena: Serena Williams\\n\\n\\n\\n\\n\\n\\n\\nThe 30 College Football Playoff Contenders\\n\\n\\nQuick Links\\n\\n\\n\\n\\n2024 Paris Olympics\\n\\n\\n\\n\\n\\n\\n\\nOlympics: Everything to Know\\n\\n\\n\\n\\n\\n\\n\\nMLB Standings\\n\\n\\n\\n\\n\\n\\n\\nSign up: Fantasy Football\\n\\n\\n\\n\\n\\n\\n\\nWNBA Rookie Tracker\\n\\n\\n\\n\\n\\n\\n\\nNBA Free Agency Buzz\\n\\n\\n\\n\\n\\n\\n\\nLittle League Baseball, Softball\\n\\n\\n\\n\\n\\n\\n\\nESPN Radio: Listen Live\\n\\n\\n\\n\\n\\n\\n\\nWatch Golf on ESPN\\n\\n\\n\\n\\n\\n\\nFavorites\\n\\n\\n\\n\\n\\n\\n Manage Favorites\\n \\n\\n\\n\\nCustomize ESPNCreate AccountLog InFantasy\\n\\n\\n\\n\\nFootball\\n\\n\\n\\n\\n\\n\\n\\nBaseball\\n\\n\\n\\n\\n\\n\\n\\nBasketball\\n\\n\\n\\n\\n\\n\\n\\nHockey\\n\\n\\nESPN Sites\\n\\n\\n\\n\\nESPN Deportes\\n\\n\\n\\n\\n\\n\\n\\nAndscape\\n\\n\\n\\n\\n\\n\\n\\nespnW\\n\\n\\n\\n\\n\\n\\n\\nESPNFC\\n\\n\\n\\n\\n\\n\\n\\nX Games\\n\\n\\n\\n\\n\\n\\n\\nSEC Network\\n\\n\\nESPN Apps\\n\\n\\n\\n\\nESPN\\n\\n\\n\\n\\n\\n\\n\\nESPN Fantasy\\n\\n\\n\\n\\n\\n\\n\\nTournament Challenge\\n\\n\\nFollow ESPN\\n\\n\\n\\n\\nFacebook\\n\\n\\n\\n\\n\\n\\n\\nX/Twitter\\n\\n\\n\\n\\n\\n\\n\\nInstagram\\n\\n\\n\\n\\n\\n\\n\\nSnapchat\\n\\n\\n\\n\\n\\n\\n\\nTikTok\\n\\n\\n\\n\\n\\n\\n\\nYouTube\\n\\n\\nCollege football's most entertaining conference? Why the 16-team Big 12 is wiiiiiide open this seasonLong known as one of the sport's most unpredictable conferences, the new-look Big 12 promises another dose of chaos.11hBill ConnellyScott Winters/Icon SportswireUSC, Oregon and the quest to bulk up for the Big TenTo improve on D, the Trojans wanted to bulk up for a new league. They're not the only team trying to do that.10hAdam RittenbergThe 30 teams that can reach the CFPConnelly's best games of the 2024 seasonTOP HEADLINESTeam USA sets world record in 4x400 mixed relayGermany beats France, undefeated in men's hoopsU.S. men's soccer exits Games after Morocco routHungary to protest Khelif's Olympic participationKobe's Staples Center locker sells for record $2.9MDjokovic, Alcaraz to meet again, this time for goldKerr: Team USA lineups based on players' rolesMarchand wins 200m IM; McEvoy takes 50 freeScouting Shedeur Sanders' NFL futureLATEST FROM PARISBreakout stars, best moments and what comes next: Breaking down the Games halfway throughAt the midpoint of the Olympics, we look back at some of the best moments and forward at what's still to come in the second week.35mESPNMustafa Yalcin/Anadolu via Getty ImagesThe numbers behind USA's world record in 4x400 mixed relay4h0:46Medal trackerFull resultsFull coverage of the OlympicsPRESEASON HAS BEGUN!New kickoff rules on display to start Hall of Fame Game19h0:41McAfee on NFL's new kickoff: It looks like a practice drill6h1:11NFL's new kickoff rules debut to mixed reviewsTexans-Bears attracts more bets than MLBTOP RANK BOXINGSATURDAY ON ESPN+ PPVWhy Terence Crawford is playing the long game for a chance to face CaneloCrawford is approaching Saturday's marquee matchup against Israil Madrimov with his sights set on landing a bigger one next: against Canelo Alvarez.10hMike CoppingerMark Robinson/Matchroom BoxingBradley's take: Crawford's power vs. Israil Madrimov's disciplined styleTimothy Bradley Jr. breaks down the junior middleweight title fight.2dTimothy Bradley Jr.Buy Crawford vs. Madrimov on ESPN+ PPVChance to impress for Madrimov -- and UzbekistanHOW FRIDAY WENTMORE FROM THE PARIS OLYMPICSGrant Fisher makes U.S. track history, Marchand wins 4th gold and more Friday at the Paris GamesSha'Carri Richardson made her long-awaited Olympic debut during the women's 100m preliminary round on Friday. Here's everything else you might have missed from Paris.31mESPNGetty ImagesU.S. men's loss to Morocco is a wake-up call before World CupOutclassed by Morocco, the U.S. men's Olympic team can take plenty of lessons with the World Cup on the horizon.4hSam BordenFull coverage of the OlympicsOLYMPIC MEN'S HOOPS SCOREBOARDFRIDAY'S GAMESOLYMPIC STANDOUTSWhy Simone Biles is now Stephen A.'s No. 1 Olympian7h0:58Alcaraz on the cusp of history after securing spot in gold medal match8h0:59Simone Biles' gymnastics titles: Olympics, Worlds, more statsOLYMPIC MEN'S SOCCER SCOREBOARDFRIDAY'S GAMESTRADE DEADLINE FALLOUTOlney: Eight big questions for traded MLB playersCan Jazz Chisholm Jr. handle New York? Is Jack Flaherty healthy? Will Jorge Soler's defense play? Key questions for players in new places.10hBuster OlneyWinslow Townson/Getty ImagesRanking the top prospects who changed teams at the MLB trade deadlineYou know the major leaguers who moved by now -- but what about the potential stars of tomorrow?1dKiley McDanielMLB Power RankingsSeries odds: Dodgers still on top; Phillies, Yanks behind them Top HeadlinesTeam USA sets world record in 4x400 mixed relayGermany beats France, undefeated in men's hoopsU.S. men's soccer exits Games after Morocco routHungary to protest Khelif's Olympic participationKobe's Staples Center locker sells for record $2.9MDjokovic, Alcaraz to meet again, this time for goldKerr: Team USA lineups based on players' rolesMarchand wins 200m IM; McEvoy takes 50 freeScouting Shedeur Sanders' NFL futureFavorites FantasyManage FavoritesFantasy HomeCustomize ESPNCreate AccountLog InICYMI0:47Nelson Palacio rips an incredible goal from outside the boxNelson Palacio scores an outside-of-the-box goal for Real Salt Lake in the 79th minute. \\n\\n\\nMedal Tracker\\n\\n\\n\\nCountries\\nAthletes\\n\\nOverall Medal Leaders43USA36FRA31CHNIndividual Medal LeadersGoldCHN 13FRA 11AUS 11SilverUSA 18FRA 12GBR 10BronzeUSA 16FRA 13CHN 9Overall Medal Leaders4MarchandMarchand3O'CallaghanO'Callaghan3McIntoshMcIntoshIndividual Medal LeadersGoldMarchand 4O'Callaghan 3McIntosh 2SilverSmith 3Huske 2Walsh 2BronzeYufei 3Bhaker 2Haughey 2\\n\\n\\nFull Medal Tracker\\n\\n\\nBest of ESPN+ESPNCollege Football Playoff 2024: 30 teams can reach postseasonHeather Dinich analyzes the teams with at least a 10% chance to make the CFP.AP Photo/Ross D. FranklinNFL Hall of Fame predictions: Who will make the next 10 classes?When will Richard Sherman and Marshawn Lynch make it? Who could join Aaron Donald in 2029? Let's map out each Hall of Fame class until 2034.Thearon W. Henderson/Getty ImagesMLB trade deadline 2024: Ranking prospects who changed teamsYou know the major leaguers who moved by now -- but what about the potential stars of tomorrow who went the other way in those deals? Trending NowIllustration by ESPNRanking the top 100 professional athletes since 2000Who tops our list of the top athletes since 2000? We're unveiling the top 25, including our voters' pick for the No. 1 spot.Photo by Kevin C. Cox/Getty Images2024 NFL offseason recap: Signings, coach moves, new rulesThink you missed something in the NFL offseason? We've got you covered with everything important that has happened since February.Stacy Revere/Getty ImagesTop 25 college football stadiums: Rose Bowl, Michigan and moreFourteen of ESPN's college football writers rank the 25 best stadiums in the sport. Who's No. 1, who missed the cut and what makes these stadiums so special?ESPNInside Nate Robinson's silent battle -- and his fight to liveFor nearly 20 years Nate Robinson has been fighting a silent battle -- one he didn't realize until recently could end his life. Sign up to play the #1 Fantasy game!Create A LeagueJoin Public LeagueReactivate A LeagueMock Draft NowSign up for FREE!Create A LeagueJoin a Public LeagueReactivate a LeaguePractice With a Mock DraftSign up for FREE!Create A LeagueJoin a Public LeagueReactivate a LeaguePractice with a Mock DraftGet a custom ESPN experienceEnjoy the benefits of a personalized accountSelect your favorite leagues, teams and players and get the latest scores, news and updates that matter most to you. \\n\\nESPN+\\n\\n\\n\\n\\nBoxing: Crawford vs. Madrimov (ESPN+ PPV)\\n\\n\\n\\n\\n\\n\\n\\nPFL Playoffs: Heavyweights & Women's Flyweights\\n\\n\\n\\n\\n\\n\\n\\nMLB\\n\\n\\n\\n\\n\\n\\n\\nLittle League Baseball: Regionals\\n\\n\\n\\n\\n\\n\\n\\nIn The Arena: Serena Williams\\n\\n\\n\\n\\n\\n\\n\\nThe 30 College Football Playoff Contenders\\n\\n\\nQuick Links\\n\\n\\n\\n\\n2024 Paris Olympics\\n\\n\\n\\n\\n\\n\\n\\nOlympics: Everything to Know\\n\\n\\n\\n\\n\\n\\n\\nMLB Standings\\n\\n\\n\\n\\n\\n\\n\\nSign up: Fantasy Football\\n\\n\\n\\n\\n\\n\\n\\nWNBA Rookie Tracker\\n\\n\\n\\n\\n\\n\\n\\nNBA Free Agency Buzz\\n\\n\\n\\n\\n\\n\\n\\nLittle League Baseball, Softball\\n\\n\\n\\n\\n\\n\\n\\nESPN Radio: Listen Live\\n\\n\\n\\n\\n\\n\\n\\nWatch Golf on ESPN\\n\\n\\nFantasy\\n\\n\\n\\n\\nFootball\\n\\n\\n\\n\\n\\n\\n\\nBaseball\\n\\n\\n\\n\\n\\n\\n\\nBasketball\\n\\n\\n\\n\\n\\n\\n\\nHockey\\n\\n\\nESPN Sites\\n\\n\\n\\n\\nESPN Deportes\\n\\n\\n\\n\\n\\n\\n\\nAndscape\\n\\n\\n\\n\\n\\n\\n\\nespnW\\n\\n\\n\\n\\n\\n\\n\\nESPNFC\\n\\n\\n\\n\\n\\n\\n\\nX Games\\n\\n\\n\\n\\n\\n\\n\\nSEC Network\\n\\n\\nESPN Apps\\n\\n\\n\\n\\nESPN\\n\\n\\n\\n\\n\\n\\n\\nESPN Fantasy\\n\\n\\n\\n\\n\\n\\n\\nTournament Challenge\\n\\n\\nFollow ESPN\\n\\n\\n\\n\\nFacebook\\n\\n\\n\\n\\n\\n\\n\\nX/Twitter\\n\\n\\n\\n\\n\\n\\n\\nInstagram\\n\\n\\n\\n\\n\\n\\n\\nSnapchat\\n\\n\\n\\n\\n\\n\\n\\nTikTok\\n\\n\\n\\n\\n\\n\\n\\nYouTube\\n\\n\\nTerms of UsePrivacy PolicyYour US State Privacy RightsChildren's Online Privacy PolicyInterest-Based AdsAbout Nielsen MeasurementDo Not Sell or Share My Personal InformationContact UsDisney Ad Sales SiteWork for ESPNCorrectionsESPN BET is owned and operated by PENN Entertainment, Inc. and its subsidiaries ('PENN'). ESPN BET is available in states where PENN is licensed to offer sports wagering. Must be 21+ to wager. If you or someone you know has a gambling problem and wants help, call 1-800-GAMBLER.Copyright: © 2024 ESPN Enterprises, Inc. All rights reserved.\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\")" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "page = []\n", + "for doc in loader.lazy_load():\n", + " page.append(doc)\n", + "\n", + "page[0]" + ] + }, { "cell_type": "markdown", "id": "672264ad", @@ -256,6 +303,16 @@ ")\n", "docs = loader.load()" ] + }, + { + "cell_type": "markdown", + "id": "c8bdc93c", + "metadata": {}, + "source": [ + "## API reference\n", + "\n", + "For detailed documentation of all `WebBaseLoader` features and configurations head to the API reference: https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html" + ] } ], "metadata": { @@ -274,7 +331,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.6" + "version": "3.11.9" } }, "nbformat": 4, diff --git a/libs/cli/langchain_cli/integration_template/docs/document_loaders.ipynb b/libs/cli/langchain_cli/integration_template/docs/document_loaders.ipynb index b883e363f2c..67456ef0a3d 100644 --- a/libs/cli/langchain_cli/integration_template/docs/document_loaders.ipynb +++ b/libs/cli/langchain_cli/integration_template/docs/document_loaders.ipynb @@ -17,7 +17,7 @@ "\n", "- TODO: Make sure API reference link is correct.\n", "\n", - "This notebook provides a quick overview for getting started with __ModuleName__ [document loader](/docs/integrations/document_loaders/). For detailed documentation of all __ModuleName__Loader features and configurations head to the [API reference](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.__module_name___loader.__ModuleName__Loader.html).\n", + "This notebook provides a quick overview for getting started with __ModuleName__ [document loader](https://python.langchain.com/v0.2/docs/concepts/#document-loaders). For detailed documentation of all __ModuleName__Loader features and configurations head to the [API reference](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.__module_name___loader.__ModuleName__Loader.html).\n", "\n", "- TODO: Add any other relevant links, like information about underlying API, etc.\n", "\n", @@ -32,7 +32,7 @@ "| :--- | :--- | :---: | :---: | :---: |\n", "| [__ModuleName__Loader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.__module_name__loader.__ModuleName__Loader.html) | [langchain_community](https://api.python.langchain.com/en/latest/community_api_reference.html) | ✅/❌ | beta/❌ | ✅/❌ | \n", "### Loader features\n", - "| Source | Document Lazy Loading | Async Support\n", + "| Source | Document Lazy Loading | Native Async Support\n", "| :---: | :---: | :---: | \n", "| __ModuleName__Loader | ✅/❌ | ✅/❌ | \n", "\n", @@ -65,7 +65,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If you want to get automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:" + "If you want to get automated best in-class tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:" ] }, { @@ -102,7 +102,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Instantiation\n", + "## Initialization\n", "\n", "Now we can instantiate our model object and load documents:\n", "\n", @@ -193,11 +193,6 @@ "\n", "For detailed documentation of all __ModuleName__Loader features and configurations head to the API reference: https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.__module_name___loader.__ModuleName__Loader.html" ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] } ], "metadata": {