Add Mathpix pdf loader (#3727)

Inspo
https://twitter.com/danielgross/status/1651695062307274754?s=46&t=1zHLap5WG4I_kQPPjfW9fA

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
This commit is contained in:
Davis Chase
2023-04-28 20:11:22 -07:00
committed by GitHub
parent 37ed6f2177
commit 220a7076ac
4 changed files with 167 additions and 9 deletions

View File

@@ -155,6 +155,46 @@
" print(str(doc.metadata[\"page\"]) + \":\", doc.page_content)"
]
},
{
"cell_type": "markdown",
"id": "6d5c9879",
"metadata": {},
"source": [
"## Using MathPix\n",
"\n",
"Inspired by Daniel Gross's [https://gist.github.com/danielgross/3ab4104e14faccc12b49200843adab21](https://gist.github.com/danielgross/3ab4104e14faccc12b49200843adab21)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "950eb58f",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import MathpixPDFLoader"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cb6fd473",
"metadata": {},
"outputs": [],
"source": [
"loader = MathpixPDFLoader(\"example_data/layout-parser-paper.pdf\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a1d41e1a",
"metadata": {},
"outputs": [],
"source": [
"data = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "09d64998",
@@ -568,7 +608,7 @@
},
{
"cell_type": "markdown",
"id": "f0048206",
"id": "15b57eab",
"metadata": {},
"source": [
"## PyPDF Directory\n",
@@ -579,7 +619,7 @@
{
"cell_type": "code",
"execution_count": 1,
"id": "ecd0cb16",
"id": "b9e521d9",
"metadata": {},
"outputs": [],
"source": [
@@ -589,7 +629,7 @@
{
"cell_type": "code",
"execution_count": 2,
"id": "96592167",
"id": "4b20590f",
"metadata": {},
"outputs": [],
"source": [
@@ -599,7 +639,7 @@
{
"cell_type": "code",
"execution_count": 3,
"id": "c750454c",
"id": "e5ead943",
"metadata": {},
"outputs": [],
"source": [
@@ -609,7 +649,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "ab7f8fdb",
"id": "ea25b03c",
"metadata": {},
"outputs": [],
"source": []