langchain/docs/modules/indexes/document_loaders/examples
Chetanya Rastogi aead062a70
Add an example tutorial for using PDFMinerPDFasHTMLLoader (#2960)
Last week I added the `PDFMinerPDFasHTMLLoader`. I am adding some
example code in the notebook to serve as a tutorial for how that loader
can be used to create snippets of a pdf that are structured within
sections. All the other loaders only provide the `Document` objects
segmented by pages but that's pretty loose given the amount of other
metadata that can be extracted.

With the new loader, one can leverage font-size of the text to decide
when a new sections starts and can segment the text more semantically as
shown in the tutorial notebook. The cell shows that we are able to find
the content of entire section under **Related Work** for the example pdf
which is spread across 2 pages and hence is stored as two separate
documents by other loaders
2023-04-16 08:34:39 -07:00
..
example_data Add file filter param to Git loader (#2904) 2023-04-14 10:45:54 -07:00
airbyte_json.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
apify_dataset.ipynb Harrison/apify (#2215) 2023-03-30 20:58:14 -07:00
azlyrics.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
azure_blob_storage_container.ipynb Harrison/site map (#2061) 2023-03-27 16:28:08 -07:00
azure_blob_storage_file.ipynb Harrison/site map (#2061) 2023-03-27 16:28:08 -07:00
bigquery.ipynb Harrison/big query (#2100) 2023-03-28 08:17:22 -07:00
bilibili.ipynb Added bilibili loader (#2673) (#2724) 2023-04-11 10:40:32 -07:00
blackboard.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
college_confidential.ipynb Harrison/site map (#2061) 2023-03-27 16:28:08 -07:00
CoNLL-U.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
copypaste.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
csv.ipynb [Docs] minor fixes to loaders links and rst warnings (#2846) 2023-04-13 10:54:40 -07:00
dataframe.ipynb Harrison/document cleanup (#2062) 2023-03-27 16:32:55 -07:00
directory_loader.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
duckdb.ipynb Harrison/duckdb (#2064) 2023-03-27 19:51:34 -07:00
email.ipynb Harrison/msg files (#2375) 2023-04-04 06:48:34 -07:00
epub.ipynb bump version to 128 (#2236) 2023-03-31 11:16:21 -07:00
evernote.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
facebook_chat.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
figma.ipynb [Documents] Updated Figma docs and added example (#2172) 2023-03-29 22:11:45 -07:00
gcs_directory.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
gcs_file.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
git.ipynb Add file filter param to Git loader (#2904) 2023-04-14 10:45:54 -07:00
gitbook.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
googledrive.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
gutenberg.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
hn.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
html.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
ifixit.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
image.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
imsdb.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
markdown.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
notebook.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
notion.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
notiondb.ipynb feat: Add Notion database document loader (#2056) 2023-03-28 08:07:09 -07:00
obsidian.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
pdf.ipynb Add an example tutorial for using PDFMinerPDFasHTMLLoader (#2960) 2023-04-16 08:34:39 -07:00
powerpoint.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
readthedocs_documentation.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
roam.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
s3_directory.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
s3_file.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
sitemap.ipynb docs: tiny fix on docs verbiage (#2124) 2023-03-28 22:56:29 -07:00
slack_directory.ipynb Add Slack Directory Loader (#2841) 2023-04-13 21:31:59 -07:00
srt.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
telegram.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
unstructured_file.ipynb feat: document loader for epublications (#2202) 2023-03-30 20:45:31 -07:00
url.ipynb Harrison/playwright (#2871) 2023-04-13 22:15:03 -07:00
web_base.ipynb [Docs] minor fixes to loaders links and rst warnings (#2846) 2023-04-13 10:54:40 -07:00
whatsapp_chat.ipynb Harrison/whatsapp loader (#2085) 2023-03-27 23:43:45 -07:00
word_document.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00
youtube.ipynb big docs refactor (#1978) 2023-03-26 19:49:46 -07:00