langchain/libs/community/langchain_community/document_loaders/parsers
Erik 4e0a6ebe7d
community: Add warning when page_content is empty (#25955)
Page content sometimes is empty when PyMuPDF can not find text on pages.
For example, this can happen when the text of the PDF is not copyable
"by hand". Then an OCR solution is need - which is not integrated here.

This warning should accurately warn the user that some pages are lost
during this process.

Thank you for contributing to LangChain!

- [ ] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core,
experimental, etc. is being modified. Use "docs: ..." for purely docs
changes, "templates: ..." for template changes, "infra: ..." for CI
changes.
  - Example: "community: add foobar LLM"


- [ ] **PR message**: ***Delete this entire checklist*** and replace
with
    - **Description:** a description of the change
    - **Issue:** the issue # it fixes, if applicable
    - **Dependencies:** any dependencies required for this change
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!


- [ ] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.


- [ ] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
2024-09-19 05:22:09 +00:00
..
html docs: community docstring updates (#21040) 2024-04-29 17:40:23 -04:00
language community[patch]: Add missing annotations (#24890) 2024-07-31 18:13:44 +00:00
__init__.py community[patch]: Fix remaining __inits__ in community (#22037) 2024-05-22 17:42:17 +00:00
audio.py [Community]: Fix - Open AI Whisper client.audio.transcriptions returning Text Object which raises error (#25271) 2024-08-19 09:36:42 -04:00
doc_intelligence.py
docai.py multiple: update removal targets (#25361) 2024-08-14 09:50:39 -04:00
generic.py docs: fix mimetype parser docstring (#25463) 2024-08-15 16:16:52 -07:00
grobid.py Update grobid.py (#23399) 2024-06-26 09:11:02 -04:00
msword.py
pdf.py community: Add warning when page_content is empty (#25955) 2024-09-19 05:22:09 +00:00
registry.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
txt.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
vsdx.py