feat: document loader for epublications (#2202)

### Summary Adds a new document loader for processing e-publications. Works with `unstructured>=0.5.4`. You need to have [`pandoc`](https://pandoc.org/installing.html) installed for this loader to work. ### Testing ```python from langchain.document_loaders import UnstructuredEPubLoader loader = UnstructuredEPubLoader("winter-sports.epub", mode="elements") data = loader.load() data[0] ```
2025-09-05 04:55:14 +00:00 · 2023-03-30 23:45:31 -04:00
parent a4a1ee6b5d
commit 3dfe1cf60e
5 changed files with 154 additions and 5 deletions
--- a/docs/ecosystem/unstructured.md
+++ b/docs/ecosystem/unstructured.md
@@ -13,10 +13,11 @@ This page is broken into two parts: installation and setup, and then references
 - Install the Python SDK with `pip install "unstructured[local-inference]"`
 - Install the following system dependencies if they are not already available on your system.
  Depending on what document types you're parsing, you may not need all of these.
-    - `libmagic-dev`
-    - `poppler-utils`
-    - `tesseract-ocr`
-    - `libreoffice`
+    - `libmagic-dev` (filetype detection)
+    - `poppler-utils` (images and PDFs)
+    - `tesseract-ocr`(images and PDFs)
+    - `libreoffice` (MS Office docs)
+    - `pandoc` (EPUBs)
 - If you are parsing PDFs using the `"hi_res"` strategy, run the following to install the `detectron2` model, which
  `unstructured` uses for layout detection:
    - `pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"`