mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-11 16:01:33 +00:00
feat: Add UnstructuredRSTLoader
(#6594)
### Summary Adds an `UnstructuredRSTLoader` for loading [reStructuredText](https://en.wikipedia.org/wiki/ReStructuredText) file. ### Testing ```python from langchain.document_loaders import UnstructuredRSTLoader loader = UnstructuredRSTLoader( file_path="example_data/README.rst", mode="elements" ) docs = loader.load() print(docs[0]) ``` ### Reviewers - @hwchase17 - @rlancemartin - @eyurtsev
This commit is contained in:
15
tests/integration_tests/document_loaders/test_rst.py
Normal file
15
tests/integration_tests/document_loaders/test_rst.py
Normal file
@@ -0,0 +1,15 @@
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
from langchain.document_loaders import UnstructuredRSTLoader
|
||||
|
||||
EXAMPLE_DIRECTORY = file_path = Path(__file__).parent.parent / "examples"
|
||||
|
||||
|
||||
def test_unstructured_rst_loader() -> None:
|
||||
"""Test unstructured loader."""
|
||||
file_path = os.path.join(EXAMPLE_DIRECTORY, "README.rst")
|
||||
loader = UnstructuredRSTLoader(str(file_path))
|
||||
docs = loader.load()
|
||||
|
||||
assert len(docs) == 1
|
28
tests/integration_tests/examples/README.rst
Normal file
28
tests/integration_tests/examples/README.rst
Normal file
@@ -0,0 +1,28 @@
|
||||
Example Docs
|
||||
------------
|
||||
|
||||
The sample docs directory contains the following files:
|
||||
|
||||
- ``example-10k.html`` - A 10-K SEC filing in HTML format
|
||||
- ``layout-parser-paper.pdf`` - A PDF copy of the layout parser paper
|
||||
- ``factbook.xml``/``factbook.xsl`` - Example XML/XLS files that you
|
||||
can use to test stylesheets
|
||||
|
||||
These documents can be used to test out the parsers in the library. In
|
||||
addition, here are instructions for pulling in some sample docs that are
|
||||
too big to store in the repo.
|
||||
|
||||
XBRL 10-K
|
||||
^^^^^^^^^
|
||||
|
||||
You can get an example 10-K in inline XBRL format using the following
|
||||
``curl``. Note, you need to have the user agent set in the header or the
|
||||
SEC site will reject your request.
|
||||
|
||||
.. code:: bash
|
||||
|
||||
curl -O \
|
||||
-A '${organization} ${email}'
|
||||
https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt
|
||||
|
||||
You can parse this document using the HTML parser.
|
Reference in New Issue
Block a user