langchain/tests/unit_tests/document_loader
Eugene Yurtsev 3c490b5ba3
Docugami DataLoader (#4727)
### Adds a document loader for Docugami

Specifically:

1. Adds a data loader that talks to the [Docugami](http://docugami.com)
API to download processed documents as semantic XML
2. Parses the semantic XML into chunks, with additional metadata
capturing chunk semantics
3. Adds a detailed notebook showing how you can use additional metadata
returned by Docugami for techniques like the [self-querying
retriever](https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/self_query_retriever.html)
4. Adds an integration test, and related documentation

Here is an example of a result that is not possible without the
capabilities added by Docugami (from the notebook):

<img width="1585" alt="image"
src="https://github.com/hwchase17/langchain/assets/749277/bb6c1ce3-13dc-4349-a53b-de16681fdd5b">

---------

Co-authored-by: Taqi Jaffri <tjaffri@docugami.com>
Co-authored-by: Taqi Jaffri <tjaffri@gmail.com>
2023-05-15 10:53:00 -04:00
..
blob_loaders Add progress bar to filesystemblob loader, update pytest config for unit tests (#4212) 2023-05-08 16:15:09 -04:00
loaders Docugami DataLoader (#4727) 2023-05-15 10:53:00 -04:00
parsers Feature: pdfplumber PDF loader with BaseBlobParser (#4552) 2023-05-15 09:47:02 -04:00
test_docs/csv Fix #4087 by setting the correct csv dialect (#4103) 2023-05-13 20:35:01 -07:00
__init__.py Harrison/youtube loader (#1545) 2023-03-08 20:53:27 -08:00
test_base.py Add BlobParser abstraction (#3979) 2023-05-05 21:43:38 -04:00
test_csv_loader.py Fix #4087 by setting the correct csv dialect (#4103) 2023-05-13 20:35:01 -07:00
test_json_loader.py Harrison/json loader fix (#4686) 2023-05-14 18:25:59 -07:00
test_web_base.py Respect User-Specified User-Agent in WebBaseLoader (#4579) 2023-05-14 23:09:27 -04:00
test_youtube.py Improve video_id extraction in YoutubeLoader (#4452) 2023-05-15 10:45:19 -04:00