feat: add UnstructuredExcelLoader for .xlsx and .xls files (#5617)

# Unstructured Excel Loader

Adds an `UnstructuredExcelLoader` class for `.xlsx` and `.xls` files.
Works with `unstructured>=0.6.7`. A plain text representation of the
Excel file will be available under the `page_content` attribute in the
doc. If you use the loader in `"elements"` mode, an HTML representation
of the Excel file will be available under the `text_as_html` metadata
key. Each sheet in the Excel document is its own document.

### Testing

```python
from langchain.document_loaders import UnstructuredExcelLoader

loader = UnstructuredExcelLoader(
    "example_data/stanley-cups.xlsx",
    mode="elements"
)
docs = loader.load()
```

## Who can review?

@hwchase17
@eyurtsev
This commit is contained in:
Matt Robinson
2023-06-03 15:44:12 -04:00
committed by GitHub
parent 9a7488a5ce
commit a97e4252e3
6 changed files with 118 additions and 0 deletions

View File

@@ -0,0 +1,15 @@
import os
from pathlib import Path
from langchain.document_loaders import UnstructuredExcelLoader
EXAMPLE_DIRECTORY = file_path = Path(__file__).parent.parent / "examples"
def test_unstructured_excel_loader() -> None:
"""Test unstructured loader."""
file_path = os.path.join(EXAMPLE_DIRECTORY, "stanley-cups.xlsx")
loader = UnstructuredExcelLoader(str(file_path))
docs = loader.load()
assert len(docs) == 1

Binary file not shown.