Harrison/format agent instructions (#973)

Co-authored-by: Andrew White <white.d.andrew@gmail.com> Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net> Co-authored-by: Peng Qu <82029664+pengqu123@users.noreply.github.com>
2025-09-05 21:12:48 +00:00 · 2023-02-10 10:07:26 -08:00
parent 5469d898a9
commit c64f98e2bb
10 changed files with 441 additions and 187 deletions
--- a/docs/modules/document_loaders/examples/pdf.ipynb
+++ b/docs/modules/document_loaders/examples/pdf.ipynb
@@ -10,6 +10,133 @@
    "This covers how to load pdfs into a document format that we can use downstream."
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "743f9413",
+   "metadata": {},
+   "source": [
+    "## Using PyPDF\n",
+    "\n",
+    "Allows for tracking of page numbers as well."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "c428b0c5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import PagedPDFSplitter\n",
+    "\n",
+    "loader = PagedPDFSplitter(\"example_data/layout-parser-paper.pdf\")\n",
+    "pages = loader.load_and_split()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ebd895e4",
+   "metadata": {},
+   "source": [
+    "An advantage of this approach is that documents can be retrieved with page numbers."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "87fa7b3a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "9: 10 Z. Shen et al.\n",
+      "Fig. 4: Illustration of (a) the original historical Japanese document with layout\n",
+      "detection results and (b) a recreated version of the document image that achieves\n",
+      "much better character recognition recall. The reorganization algorithm rearranges\n",
+      "the tokens based on the their detected bounding boxes given a maximum allowed\n",
+      "height.\n",
+      "4LayoutParser Community Platform\n",
+      "Another focus of LayoutParser is promoting the reusability of layout detection\n",
+      "models and full digitization pipelines. Similar to many existing deep learning\n",
+      "libraries, LayoutParser comes with a community model hub for distributing\n",
+      "layout models. End-users can upload their self-trained models to the model hub,\n",
+      "and these models can be loaded into a similar interface as the currently available\n",
+      "LayoutParser pre-trained models. For example, the model trained on the News\n",
+      "Navigator dataset [17] has been incorporated in the model hub.\n",
+      "Beyond DL models, LayoutParser also promotes the sharing of entire doc-\n",
+      "ument digitization pipelines. For example, sometimes the pipeline requires the\n",
+      "combination of multiple DL models to achieve better accuracy. Currently, pipelines\n",
+      "are mainly described in academic papers and implementations are often not pub-\n",
+      "licly available. To this end, the LayoutParser community platform also enables\n",
+      "the sharing of layout pipelines to promote the discussion and reuse of techniques.\n",
+      "For each shared pipeline, it has a dedicated project page, with links to the source\n",
+      "code, documentation, and an outline of the approaches. A discussion panel is\n",
+      "provided for exchanging ideas. Combined with the core LayoutParser library,\n",
+      "users can easily build reusable components based on the shared pipelines and\n",
+      "apply them to solve their unique problems.\n",
+      "5 Use Cases\n",
+      "The core objective of LayoutParser is to make it easier to create both large-scale\n",
+      "and light-weight document digitization pipelines. Large-scale document processing\n",
+      "3: 4 Z. Shen et al.\n",
+      "Efficient Data AnnotationC u s t o m i z e d  M o d e l  T r a i n i n gModel Cust omizationDI A Model HubDI A Pipeline SharingCommunity PlatformLa y out Detection ModelsDocument Images \n",
+      "T h e  C o r e  L a y o u t P a r s e r  L i b r a r yOCR ModuleSt or age & VisualizationLa y out Data Structur e\n",
+      "Fig. 1: The overall architecture of LayoutParser . For an input document image,\n",
+      "the core LayoutParser library provides a set of o\u000b",
+      "-the-shelf tools for layout\n",
+      "detection, OCR, visualization, and storage, backed by a carefully designed layout\n",
+      "data structure. LayoutParser also supports high level customization via e\u000ecient\n",
+      "layout annotation and model training functions. These improve model accuracy\n",
+      "on the target samples. The community platform enables the easy sharing of DIA\n",
+      "models and whole digitization pipelines to promote reusability and reproducibility.\n",
+      "A collection of detailed documentation, tutorials and exemplar projects make\n",
+      "LayoutParser easy to learn and use.\n",
+      "AllenNLP [ 8] and transformers [ 34] have provided the community with complete\n",
+      "DL-based support for developing and deploying models for general computer\n",
+      "vision and natural language processing problems. LayoutParser , on the other\n",
+      "hand, specializes speci\f",
+      "cally in DIA tasks. LayoutParser is also equipped with a\n",
+      "community platform inspired by established model hubs such as Torch Hub [23]\n",
+      "andTensorFlow Hub [1]. It enables the sharing of pretrained models as well as\n",
+      "full document processing pipelines that are unique to DIA tasks.\n",
+      "There have been a variety of document data collections to facilitate the\n",
+      "development of DL models. Some examples include PRImA [ 3](magazine layouts),\n",
+      "PubLayNet [ 38](academic paper layouts), Table Bank [ 18](tables in academic\n",
+      "papers), Newspaper Navigator Dataset [ 16,17](newspaper \f",
+      "gure layouts) and\n",
+      "HJDataset [31](historical Japanese document layouts). A spectrum of models\n",
+      "trained on these datasets are currently available in the LayoutParser model zoo\n",
+      "to support di\u000b",
+      "erent use cases.\n",
+      "3 The Core LayoutParser Library\n",
+      "At the core of LayoutParser is an o\u000b",
+      "-the-shelf toolkit that streamlines DL-\n",
+      "based document image analysis. Five components support a simple interface\n",
+      "with comprehensive functionalities: 1) The layout detection models enable using\n",
+      "pre-trained or self-trained DL models for layout detection with just four lines\n",
+      "of code. 2) The detected layout information is stored in carefully engineered\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain.vectorstores import FAISS\n",
+    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
+    "\n",
+    "faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())\n",
+    "docs = faiss_index.similarity_search(\"How will the community be engaged?\", k=2)\n",
+    "for doc in docs:\n",
+    "    print(str(doc.metadata[\"page\"]) + \":\", doc.page_content)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "09d64998",
+   "metadata": {},
+   "source": [
+    "## Using Unstructured"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 1,
@@ -65,7 +192,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.9"
+   "version": "3.9.1"
  }
 },
 "nbformat": 4,
--- a/docs/modules/document_loaders/how_to_guides.rst
+++ b/docs/modules/document_loaders/how_to_guides.rst
@@ -27,6 +27,8 @@ There are a lot of different document loaders that LangChain supports. Below are

 `Roam <./examples/roam.html>`_: A walkthrough of how to load data from a Roam file export.

+`EveryNote <./examples/everynote.html>`_: A walkthrough of how to load data from a EveryNote (`.enex`) file.
+
 `YouTube <./examples/youtube.html>`_: A walkthrough of how to load the transcript from a YouTube video.

 `s3 File <./examples/s3_file.html>`_: A walkthrough of how to load a file from s3.