langchain/docs
Alexander Golodkov 2a70a07aad
community[minor]: added new document loaders based on dedoc library (#24303)
### Description
This pull request added new document loaders to load documents of
various formats using [Dedoc](https://github.com/ispras/dedoc):
  - `DedocFileLoader` (determine file types automatically and parse)
  - `DedocPDFLoader` (for `PDF` and images parsing)
- `DedocAPIFileLoader` (determine file types automatically and parse
using Dedoc API without library installation)

[Dedoc](https://dedoc.readthedocs.io) is an open-source library/service
that extracts texts, tables, attached files and document structure
(e.g., titles, list items, etc.) from files of various formats. The
library is actively developed and maintained by a group of developers.

`Dedoc` supports `DOCX`, `XLSX`, `PPTX`, `EML`, `HTML`, `PDF`, images
and more.
Full list of supported formats can be found
[here](https://dedoc.readthedocs.io/en/latest/#id1).
For `PDF` documents, `Dedoc` allows to determine textual layer
correctness and split the document into paragraphs.


### Issue
This pull request extends variety of document loaders supported by
`langchain_community` allowing users to choose the most suitable option
for raw documents parsing.

### Dependencies
The PR added a new (optional) dependency `dedoc>=2.2.5` ([library
documentation](https://dedoc.readthedocs.io)) to the
`extended_testing_deps.txt`

### Twitter handle
None

### Add tests and docs
1. Test for the integration:
`libs/community/tests/integration_tests/document_loaders/test_dedoc.py`
2. Example notebook:
`docs/docs/integrations/document_loaders/dedoc.ipynb`
3. Information about the library:
`docs/docs/integrations/providers/dedoc.mdx`

### Lint and test

Done locally:

  - `make format`
  - `make lint`
  - `make integration_tests`
  - `make docs_build` (from the project root)

---------

Co-authored-by: Nasty <bogatenkova.anastasiya@mail.ru>
2024-07-23 02:04:53 +00:00
..
api_reference docs: readthedocs deprecation fix (#24321) 2024-07-16 20:32:51 +00:00
data 👥 Update LangChain people data (#23697) 2024-07-01 17:42:55 +00:00
docs community[minor]: added new document loaders based on dedoc library (#24303) 2024-07-23 02:04:53 +00:00
scripts docs: advanced feature note (#24456) 2024-07-19 20:05:59 +00:00
src docs: chain migration guide (#23844) 2024-07-05 16:37:34 -07:00
static docs[patch]: Update intro diagram (#24290) 2024-07-15 22:04:42 -07:00
.gitignore infra: cleanup docs build (#21134) 2024-05-01 17:34:05 -07:00
.yarnrc.yml
babel.config.js
docusaurus.config.js docs: rm discord (#23985) 2024-07-08 14:27:58 -07:00
ignore-step.sh infra: docs ignore step in script (#24090) 2024-07-10 15:18:00 -07:00
Makefile docs: remove couchbase from docs linking (#24277) 2024-07-15 17:34:41 +00:00
package.json docs[patch]: Adds feedback input after thumbs up/down (#23141) 2024-06-18 16:08:22 -07:00
README.md
sidebars.js docs[minor]: Hide langserve pages (#23618) 2024-06-28 08:25:08 -07:00
vercel_build.sh infra: use nbconvert for docs build (#21135) 2024-05-07 12:30:17 -07:00
vercel_requirements.txt package: security update urllib3 to @1.26.19 (#23366) 2024-06-24 19:44:39 +00:00
vercel.json infra: docs ignore step in script (#24090) 2024-07-10 15:18:00 -07:00
yarn.lock docs[patch]: Adds feedback input after thumbs up/down (#23141) 2024-06-18 16:08:22 -07:00

LangChain Documentation

For more information on contributing to our documentation, see the Documentation Contributing Guide