Files
langchain/docs/versioned_docs/version-0.2.x/integrations/document_loaders/async_chromium.ipynb
Jacob Lee aff771923a Jacob/new docs (#20570)
Use docusaurus versioning with a callout, merged master as well

@hwchase17 @baskaryan

---------

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com>
Co-authored-by: Leonid Kuligin <lkuligin@yandex.ru>
Co-authored-by: Averi Kitsch <akitsch@google.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Nuno Campos <nuno@langchain.dev>
Co-authored-by: Nuno Campos <nuno@boringbits.io>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Martín Gotelli Ferenaz <martingotelliferenaz@gmail.com>
Co-authored-by: Fayfox <admin@fayfox.com>
Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>
Co-authored-by: Dawson Bauer <105886620+djbauer2@users.noreply.github.com>
Co-authored-by: Ravindu Somawansa <ravindu.somawansa@gmail.com>
Co-authored-by: Dhruv Chawla <43818888+Dominastorm@users.noreply.github.com>
Co-authored-by: ccurme <chester.curme@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: WeichenXu <weichen.xu@databricks.com>
Co-authored-by: Benito Geordie <89472452+benitoThree@users.noreply.github.com>
Co-authored-by: kartikTAI <129414343+kartikTAI@users.noreply.github.com>
Co-authored-by: Kartik Sarangmath <kartik@thirdai.com>
Co-authored-by: Sevin F. Varoglu <sfvaroglu@octoml.ai>
Co-authored-by: MacanPN <martin.triska@gmail.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Hyeongchan Kim <kozistr@gmail.com>
Co-authored-by: sdan <git@sdan.io>
Co-authored-by: Guangdong Liu <liugddx@gmail.com>
Co-authored-by: Rahul Triptahi <rahul.psit.ec@gmail.com>
Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: pjb157 <84070455+pjb157@users.noreply.github.com>
Co-authored-by: Eun Hye Kim <ehkim1440@gmail.com>
Co-authored-by: kaijietti <43436010+kaijietti@users.noreply.github.com>
Co-authored-by: Pengcheng Liu <pcliu.fd@gmail.com>
Co-authored-by: Tomer Cagan <tomer@tomercagan.com>
Co-authored-by: Christophe Bornet <cbornet@hotmail.com>
2024-04-18 11:10:55 -07:00

125 lines
3.3 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "ad553e51",
"metadata": {},
"source": [
"# Async Chromium\n",
"\n",
"Chromium is one of the browsers supported by Playwright, a library used to control browser automation. \n",
"\n",
"By running `p.chromium.launch(headless=True)`, we are launching a headless instance of Chromium. \n",
"\n",
"Headless mode means that the browser is running without a graphical user interface.\n",
"\n",
"`AsyncChromiumLoader` loads the page, and then we use `Html2TextTransformer` to transform to text."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1c3a4c19",
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade --quiet playwright beautifulsoup4\n",
"!playwright install"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "dd2cdea7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'<!DOCTYPE html><html lang=\"en\"><head><script src=\"https://s0.2mdn.net/instream/video/client.js\" asyn'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_community.document_loaders import AsyncChromiumLoader\n",
"\n",
"urls = [\"https://www.wsj.com\"]\n",
"loader = AsyncChromiumLoader(urls)\n",
"docs = loader.load()\n",
"docs[0].page_content[0:100]"
]
},
{
"cell_type": "markdown",
"id": "c64e7df9",
"metadata": {},
"source": [
"If you are using Jupyter notebooks, you might need to apply `nest_asyncio` before loading the documents."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5f2fe3c0",
"metadata": {},
"outputs": [],
"source": [
"!pip install nest-asyncio\n",
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "013caa7e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"Skip to Main ContentSkip to SearchSkip to... Select * Top News * What's News *\\nFeatured Stories * Retirement * Life & Arts * Hip-Hop * Sports * Video *\\nEconomy * Real Estate * Sports * CMO * CIO * CFO * Risk & Compliance *\\nLogistics Report * Sustainable Business * Heard on the Street * Barrons *\\nMarketWatch * Mansion Global * Penta * Opinion * Journal Reports * Sponsored\\nOffers Explore Our Brands * WSJ * * * * * Barron's * * * * * MarketWatch * * *\\n* * IBD # The Wall Street Journal SubscribeSig\""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_community.document_transformers import Html2TextTransformer\n",
"\n",
"html2text = Html2TextTransformer()\n",
"docs_transformed = html2text.transform_documents(docs)\n",
"docs_transformed[0].page_content[0:500]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}