mirror of
				https://github.com/hwchase17/langchain.git
				synced 2025-10-31 16:08:59 +00:00 
			
		
		
		
	# Bibtex integration Wrap bibtexparser to retrieve a list of docs from a bibtex file. * Get the metadata from the bibtex entries * `page_content` get from the local pdf referenced in the `file` field of the bibtex entry using `pymupdf` * If no valid pdf file, `page_content` set to the `abstract` field of the bibtex entry * Support Zotero flavour using regex to get the file path * Added usage example in `docs/modules/indexes/document_loaders/examples/bibtex.ipynb` --------- Co-authored-by: Sébastien M. Popoff <sebastien.popoff@espci.fr> Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
		
			
				
	
	
		
			191 lines
		
	
	
		
			6.3 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			191 lines
		
	
	
		
			6.3 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| {
 | ||
|  "cells": [
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "id": "bda1f3f5",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "# BibTeX\n",
 | ||
|     "\n",
 | ||
|     "> BibTeX is a file format and reference management system commonly used in conjunction with LaTeX typesetting. It serves as a way to organize and store bibliographic information for academic and research documents.\n",
 | ||
|     "\n",
 | ||
|     "BibTeX files have a .bib extension and consist of plain text entries representing references to various publications, such as books, articles, conference papers, theses, and more. Each BibTeX entry follows a specific structure and contains fields for different bibliographic details like author names, publication title, journal or book title, year of publication, page numbers, and more.\n",
 | ||
|     "\n",
 | ||
|     "Bibtex files can also store the path to documents, such as `.pdf` files that can be retrieved."
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "id": "1b7a1eef-7bf7-4e7d-8bfc-c4e27c9488cb",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "## Installation\n",
 | ||
|     "First, you need to install `bibtexparser` and `PyMuPDF`."
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": 26,
 | ||
|    "id": "b674aaea-ed3a-4541-8414-260a8f67f623",
 | ||
|    "metadata": {
 | ||
|     "tags": []
 | ||
|    },
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "#!pip install bibtexparser pymupdf"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "id": "95f05e1c-195e-4e2b-ae8e-8d6637f15be6",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "## Examples"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "id": "e29b954c-1407-4797-ae21-6ba8937156be",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "`BibtexLoader` has these arguments:\n",
 | ||
|     "- `file_path`: the path the the `.bib` bibtex file\n",
 | ||
|     "- optional `max_docs`: default=None, i.e. not limit. Use it to limit number of retrieved documents.\n",
 | ||
|     "- optional `max_content_chars`: default=4000. Use it to limit the number of characters in a single document.\n",
 | ||
|     "- optional `load_extra_meta`: default=False. By default only the most important fields from the bibtex entries: `Published` (publication year), `Title`, `Authors`, `Summary`, `Journal`, `Keywords`, and `URL`. If True, it will also try to load return `entry_id`, `note`, `doi`, and `links` fields. \n",
 | ||
|     "- optional `file_pattern`: default=`r'[^:]+\\.pdf'`. Regex pattern to find files in the `file` entry. Default pattern supports `Zotero` flavour bibtex style and bare file path."
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": 27,
 | ||
|    "id": "9bfd5e46",
 | ||
|    "metadata": {
 | ||
|     "tags": []
 | ||
|    },
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "from langchain.document_loaders import BibtexLoader"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": 28,
 | ||
|    "id": "01971b53",
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "# Create a dummy bibtex file and download a pdf.\n",
 | ||
|     "import urllib.request\n",
 | ||
|     "\n",
 | ||
|     "urllib.request.urlretrieve(\"https://www.fourmilab.ch/etexts/einstein/specrel/specrel.pdf\", \"einstein1905.pdf\")\n",
 | ||
|     "\n",
 | ||
|     "bibtex_text = \"\"\"\n",
 | ||
|     "    @article{einstein1915,\n",
 | ||
|     "        title={Die Feldgleichungen der Gravitation},\n",
 | ||
|     "        abstract={Die Grundgleichungen der Gravitation, die ich hier entwickeln werde, wurden von mir in einer Abhandlung: ,,Die formale Grundlage der allgemeinen Relativit{\\\"a}tstheorie`` in den Sitzungsberichten der Preu{\\ss}ischen Akademie der Wissenschaften 1915 ver{\\\"o}ffentlicht.},\n",
 | ||
|     "        author={Einstein, Albert},\n",
 | ||
|     "        journal={Sitzungsberichte der K{\\\"o}niglich Preu{\\ss}ischen Akademie der Wissenschaften},\n",
 | ||
|     "        volume={1915},\n",
 | ||
|     "        number={1},\n",
 | ||
|     "        pages={844--847},\n",
 | ||
|     "        year={1915},\n",
 | ||
|     "        doi={10.1002/andp.19163540702},\n",
 | ||
|     "        link={https://onlinelibrary.wiley.com/doi/abs/10.1002/andp.19163540702},\n",
 | ||
|     "        file={einstein1905.pdf}\n",
 | ||
|     "    }\n",
 | ||
|     "    \"\"\"\n",
 | ||
|     "# save bibtex_text to biblio.bib file\n",
 | ||
|     "with open(\"./biblio.bib\", \"w\") as file:\n",
 | ||
|     "    file.write(bibtex_text)"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": 29,
 | ||
|    "id": "2631f46b",
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "docs = BibtexLoader(\"./biblio.bib\").load()"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": 30,
 | ||
|    "id": "33ef1fb2",
 | ||
|    "metadata": {},
 | ||
|    "outputs": [
 | ||
|     {
 | ||
|      "data": {
 | ||
|       "text/plain": [
 | ||
|        "{'id': 'einstein1915',\n",
 | ||
|        " 'published_year': '1915',\n",
 | ||
|        " 'title': 'Die Feldgleichungen der Gravitation',\n",
 | ||
|        " 'publication': 'Sitzungsberichte der K{\"o}niglich Preu{\\\\ss}ischen Akademie der Wissenschaften',\n",
 | ||
|        " 'authors': 'Einstein, Albert',\n",
 | ||
|        " 'abstract': 'Die Grundgleichungen der Gravitation, die ich hier entwickeln werde, wurden von mir in einer Abhandlung: ,,Die formale Grundlage der allgemeinen Relativit{\"a}tstheorie`` in den Sitzungsberichten der Preu{\\\\ss}ischen Akademie der Wissenschaften 1915 ver{\"o}ffentlicht.',\n",
 | ||
|        " 'url': 'https://doi.org/10.1002/andp.19163540702'}"
 | ||
|       ]
 | ||
|      },
 | ||
|      "execution_count": 30,
 | ||
|      "metadata": {},
 | ||
|      "output_type": "execute_result"
 | ||
|     }
 | ||
|    ],
 | ||
|    "source": [
 | ||
|     "docs[0].metadata"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": 31,
 | ||
|    "id": "46969806-45a9-4c4d-a61b-cfb9658fc9de",
 | ||
|    "metadata": {
 | ||
|     "tags": []
 | ||
|    },
 | ||
|    "outputs": [
 | ||
|     {
 | ||
|      "name": "stdout",
 | ||
|      "output_type": "stream",
 | ||
|      "text": [
 | ||
|       "ON THE ELECTRODYNAMICS OF MOVING\n",
 | ||
|       "BODIES\n",
 | ||
|       "By A. EINSTEIN\n",
 | ||
|       "June 30, 1905\n",
 | ||
|       "It is known that Maxwell’s electrodynamics—as usually understood at the\n",
 | ||
|       "present time—when applied to moving bodies, leads to asymmetries which do\n",
 | ||
|       "not appear to be inherent in the phenomena. Take, for example, the recipro-\n",
 | ||
|       "cal electrodynamic action of a magnet and a conductor. The observable phe-\n",
 | ||
|       "nomenon here depends only on the r\n"
 | ||
|      ]
 | ||
|     }
 | ||
|    ],
 | ||
|    "source": [
 | ||
|     "print(docs[0].page_content[:400])  # all pages of the pdf content"
 | ||
|    ]
 | ||
|   }
 | ||
|  ],
 | ||
|  "metadata": {
 | ||
|   "kernelspec": {
 | ||
|    "display_name": "Python 3 (ipykernel)",
 | ||
|    "language": "python",
 | ||
|    "name": "python3"
 | ||
|   },
 | ||
|   "language_info": {
 | ||
|    "codemirror_mode": {
 | ||
|     "name": "ipython",
 | ||
|     "version": 3
 | ||
|    },
 | ||
|    "file_extension": ".py",
 | ||
|    "mimetype": "text/x-python",
 | ||
|    "name": "python",
 | ||
|    "nbconvert_exporter": "python",
 | ||
|    "pygments_lexer": "ipython3",
 | ||
|    "version": "3.11.3"
 | ||
|   }
 | ||
|  },
 | ||
|  "nbformat": 4,
 | ||
|  "nbformat_minor": 5
 | ||
| }
 |