mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-08 22:42:05 +00:00
Added a MHTML document loader (#6311)
MHTML is a very interesting format since it's used both for emails but also for archived webpages. Some scraping projects want to store pages in disk to process them later, mhtml is perfect for that use case. This is heavily inspired from the beautifulsoup html loader, but extracting the html part from the mhtml file. --------- Co-authored-by: rlm <pexpresss31@gmail.com>
This commit is contained in:
committed by
GitHub
parent
05eec99269
commit
87802c86d9
108
tests/integration_tests/examples/example.mht
Normal file
108
tests/integration_tests/examples/example.mht
Normal file
@@ -0,0 +1,108 @@
|
||||
From: <Saved by Blink>
|
||||
Snapshot-Content-Location: https://langchain.com/
|
||||
Subject:
|
||||
Date: Fri, 16 Jun 2023 19:32:59 -0000
|
||||
MIME-Version: 1.0
|
||||
Content-Type: multipart/related;
|
||||
type="text/html";
|
||||
boundary="----MultipartBoundary--dYaUgeoeP18TqraaeOwkeZyu1vI09OtkFwH2rcnJMt----"
|
||||
|
||||
|
||||
------MultipartBoundary--dYaUgeoeP18TqraaeOwkeZyu1vI09OtkFwH2rcnJMt----
|
||||
Content-Type: text/html
|
||||
Content-ID: <frame-2F1DB31BBD26C55A7F1EEC7561350515@mhtml.blink>
|
||||
Content-Transfer-Encoding: quoted-printable
|
||||
Content-Location: https://langchain.com/
|
||||
|
||||
<html><head><title>LangChain</title><meta http-equiv=3D"Content-Type" content=3D"text/html; charset=
|
||||
=3DUTF-8"><link rel=3D"stylesheet" type=3D"text/css" href=3D"cid:css-c9ac93=
|
||||
be-2ab2-46d8-8690-80da3a6d1832@mhtml.blink" /></head><body data-new-gr-c-s-=
|
||||
check-loaded=3D"14.1112.0" data-gr-ext-installed=3D""><p align=3D"center">
|
||||
<b><font size=3D"6">L</font><font size=3D"4">ANG </font><font size=3D"6">C=
|
||||
</font><font size=3D"4">HAIN </font><font size=3D"2">=F0=9F=A6=9C=EF=B8=8F=
|
||||
=F0=9F=94=97</font><br>Official Home Page</b><font size=3D"1"> </font>=
|
||||
</p>
|
||||
|
||||
<hr>
|
||||
<center>
|
||||
<table border=3D"0" cellspacing=3D"0" width=3D"90%">
|
||||
<tbody>
|
||||
<tr>
|
||||
<td height=3D"55" valign=3D"top" width=3D"50%">
|
||||
<ul>
|
||||
<li><a href=3D"https://langchain.com/integrations.html">Integration=
|
||||
s</a>=20
|
||||
</li></ul></td>
|
||||
<td height=3D"45" valign=3D"top" width=3D"50%">
|
||||
<ul>
|
||||
<li><a href=3D"https://langchain.com/features.html">Features</a>=20
|
||||
</li></ul></td></tr>
|
||||
<tr>
|
||||
<td height=3D"55" valign=3D"top" width=3D"50%">
|
||||
<ul>
|
||||
<li><a href=3D"https://blog.langchain.dev/">Blog</a>=20
|
||||
</li></ul></td>
|
||||
<td height=3D"45" valign=3D"top" width=3D"50%">
|
||||
<ul>
|
||||
<li><a href=3D"https://docs.langchain.com/docs/">Conceptual Guide</=
|
||||
a>=20
|
||||
</li></ul></td></tr>
|
||||
|
||||
<tr>
|
||||
<td height=3D"45" valign=3D"top" width=3D"50%">
|
||||
<ul>
|
||||
<li><a href=3D"https://github.com/hwchase17/langchain">Python Repo<=
|
||||
/a></li></ul></td>
|
||||
<td height=3D"45" valign=3D"top" width=3D"50%">
|
||||
<ul>
|
||||
<li><a href=3D"https://github.com/hwchase17/langchainjs">JavaScript=
|
||||
Repo</a></li></ul></td></tr>
|
||||
=20
|
||||
=09
|
||||
<tr>
|
||||
<td height=3D"45" valign=3D"top" width=3D"50%">
|
||||
<ul>
|
||||
<li><a href=3D"https://python.langchain.com/en/latest/">Python Docu=
|
||||
mentation</a> </li></ul></td>
|
||||
<td height=3D"45" valign=3D"top" width=3D"50%">
|
||||
<ul>
|
||||
<li><a href=3D"https://js.langchain.com/docs/">JavaScript Document=
|
||||
ation</a>
|
||||
</li></ul></td></tr>
|
||||
<tr>
|
||||
<td height=3D"45" valign=3D"top" width=3D"50%">
|
||||
<ul>
|
||||
<li><a href=3D"https://github.com/hwchase17/chat-langchain">Python =
|
||||
ChatLangChain</a> </li></ul></td>
|
||||
<td height=3D"45" valign=3D"top" width=3D"50%">
|
||||
<ul>
|
||||
<li><a href=3D"https://github.com/sullivan-sean/chat-langchainjs">=
|
||||
JavaScript ChatLangChain</a>
|
||||
</li></ul></td></tr>
|
||||
<tr>
|
||||
<td height=3D"45" valign=3D"top" width=3D"50%">
|
||||
<ul>
|
||||
<li><a href=3D"https://discord.gg/6adMQxSpJS">Discord</a> </li></ul=
|
||||
></td>
|
||||
<td height=3D"55" valign=3D"top" width=3D"50%">
|
||||
<ul>
|
||||
<li><a href=3D"https://twitter.com/langchainai">Twitter</a>
|
||||
</li></ul></td></tr>
|
||||
=09
|
||||
|
||||
|
||||
</tbody></table></center>
|
||||
<hr>
|
||||
<font size=3D"2">
|
||||
<p>If you have any comments about our WEB page, you can=20
|
||||
write us at the address shown above. However, due to=20
|
||||
the limited number of personnel in our corporate office, we are unable to=
|
||||
=20
|
||||
provide a direct response.</p></font>
|
||||
<hr>
|
||||
<p align=3D"left"><font size=3D"2">Copyright =C2=A9 2023-2023<b> LangChain =
|
||||
Inc.</b></font><font size=3D"2">=20
|
||||
</font></p>
|
||||
</body></html>
|
||||
|
||||
------MultipartBoundary--dYaUgeoeP18TqraaeOwkeZyu1vI09OtkFwH2rcnJMt------
|
Reference in New Issue
Block a user