langchain/docs/extras/modules/data_connection/document_loaders/integrations
Pau Ramon Revilla 87802c86d9
Added a MHTML document loader (#6311)
MHTML is a very interesting format since it's used both for emails but
also for archived webpages. Some scraping projects want to store pages
in disk to process them later, mhtml is perfect for that use case.

This is heavily inspired from the beautifulsoup html loader, but
extracting the html part from the mhtml file.

---------

Co-authored-by: rlm <pexpresss31@gmail.com>
2023-06-25 13:12:08 -07:00
..
example_data feat: Add UnstructuredRSTLoader (#6594) 2023-06-25 12:41:57 -07:00
acreom.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
airbyte_json.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
airtable.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
alibaba_cloud_maxcompute.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
apify_dataset.ipynb docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
arxiv.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
aws_s3_directory.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
aws_s3_file.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
azlyrics.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
azure_blob_storage_container.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
azure_blob_storage_file.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
bibtex.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
bilibili.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
blackboard.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
blockchain.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
chatgpt_loader.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
college_confidential.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
confluence.ipynb fix titles in documentation 2023-06-17 11:09:11 -07:00
conll-u.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
copypaste.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
csv.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
diffbot.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
discord.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
docugami.ipynb docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
duckdb.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
email.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
embaas.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
epub.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
evernote.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
excel.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
facebook_chat.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
fauna.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
figma.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
git.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
gitbook.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
github.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_bigquery.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_cloud_storage_directory.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_cloud_storage_file.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_drive.ipynb Harrison/gdrive enhancements (#6375) 2023-06-18 11:07:23 -07:00
gutenberg.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
hacker_news.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
hugging_face_dataset.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
ifixit.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
image_captions.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
image.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
imsdb.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
iugu.ipynb Minor Grammar Fixes in Docs and Comments (#6536) 2023-06-21 09:53:31 -07:00
joplin.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
jupyter_notebook.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
mastodon.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
mediawikidump.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
merge_doc_loader.ipynb Create merge loader that combines documents from a set of loaders (#6659) 2023-06-23 13:02:48 -07:00
mhtml.ipynb Added a MHTML document loader (#6311) 2023-06-25 13:12:08 -07:00
microsoft_onedrive.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
microsoft_powerpoint.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
microsoft_word.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
modern_treasury.ipynb Minor Grammar Fixes in Docs and Comments (#6536) 2023-06-21 09:53:31 -07:00
notion.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
notiondb.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
obsidian.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
odt.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
open_city_data.ipynb Loader for OpenCityData and minor cleanups to Pandas, Airtable loaders (#6301) 2023-06-22 22:20:42 -07:00
pandas_dataframe.ipynb Loader for OpenCityData and minor cleanups to Pandas, Airtable loaders (#6301) 2023-06-22 22:20:42 -07:00
psychic.ipynb docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
pyspark_dataframe.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
readthedocs_documentation.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
recursive_url_loader.ipynb Recursive URL loader (#6455) 2023-06-23 13:09:00 -07:00
reddit.ipynb docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
roam.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
rst.ipynb feat: Add UnstructuredRSTLoader (#6594) 2023-06-25 12:41:57 -07:00
sitemap.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
slack.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
snowflake.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
spreedly.ipynb Minor Grammar Fixes in Docs and Comments (#6536) 2023-06-21 09:53:31 -07:00
stripe.ipynb Minor Grammar Fixes in Docs and Comments (#6536) 2023-06-21 09:53:31 -07:00
subtitle.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
telegram.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
tomarkdown.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
toml.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
trello.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
twitter.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
unstructured_file.ipynb Harrison/unstructured page number (#6464) 2023-06-19 22:31:43 -07:00
url.ipynb Add markdown to specify important arguments (#6246) 2023-06-18 17:47:00 -07:00
weather.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
web_base.ipynb Update web_base.ipynb (#6430) 2023-06-19 21:43:35 -07:00
whatsapp_chat.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
wikipedia.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
xml.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
youtube_audio.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
youtube_transcript.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00