langchain/docs/docs/integrations/document_loaders/async_chromium.ipynb
Tridib Roy Arjo d667b1ea8f
docs: Update async_chromium.ipynb (#19514)
In Jupyter, asyncio would throw an error before `.load()` unless
`nest_asyncio` is applied (Issue #8494 mentioned this)

+Minor typo fixes..
2024-03-26 00:02:50 +00:00

125 lines
3.3 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "ad553e51",
"metadata": {},
"source": [
"# Async Chromium\n",
"\n",
"Chromium is one of the browsers supported by Playwright, a library used to control browser automation. \n",
"\n",
"By running `p.chromium.launch(headless=True)`, we are launching a headless instance of Chromium. \n",
"\n",
"Headless mode means that the browser is running without a graphical user interface.\n",
"\n",
"`AsyncChromiumLoader` loads the page, and then we use `Html2TextTransformer` to transform to text."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1c3a4c19",
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade --quiet playwright beautifulsoup4\n",
"!playwright install"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "dd2cdea7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'<!DOCTYPE html><html lang=\"en\"><head><script src=\"https://s0.2mdn.net/instream/video/client.js\" asyn'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_community.document_loaders import AsyncChromiumLoader\n",
"\n",
"urls = [\"https://www.wsj.com\"]\n",
"loader = AsyncChromiumLoader(urls)\n",
"docs = loader.load()\n",
"docs[0].page_content[0:100]"
]
},
{
"cell_type": "markdown",
"id": "c64e7df9",
"metadata": {},
"source": [
"If you are using Jupyter notebooks, you might need to apply `nest_asyncio` before loading the documents."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5f2fe3c0",
"metadata": {},
"outputs": [],
"source": [
"!pip install nest-asyncio\n",
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "013caa7e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"Skip to Main ContentSkip to SearchSkip to... Select * Top News * What's News *\\nFeatured Stories * Retirement * Life & Arts * Hip-Hop * Sports * Video *\\nEconomy * Real Estate * Sports * CMO * CIO * CFO * Risk & Compliance *\\nLogistics Report * Sustainable Business * Heard on the Street * Barrons *\\nMarketWatch * Mansion Global * Penta * Opinion * Journal Reports * Sponsored\\nOffers Explore Our Brands * WSJ * * * * * Barron's * * * * * MarketWatch * * *\\n* * IBD # The Wall Street Journal SubscribeSig\""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_community.document_transformers import Html2TextTransformer\n",
"\n",
"html2text = Html2TextTransformer()\n",
"docs_transformed = html2text.transform_documents(docs)\n",
"docs_transformed[0].page_content[0:500]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}