Commit Graph

3 Commits

Author SHA1 Message Date
Ryan Parker
12111cb922
community: fallback on core async atransform_documents method for MarkdownifyTransformer (#27866)
# Description
Implements the `atransform_documents` method for
`MarkdownifyTransformer` using the `asyncio` built-in library for
concurrency.

Note that this is mainly for API completeness when working with async
frameworks rather than for performance, since the `markdownify` function
is not I/O bound because it works with `Document` objects already in
memory.

# Issue
Fixes #27865

# Dependencies
No new dependencies added, but
[`markdownify`](https://github.com/matthewwithanm/python-markdownify) is
required since this PR updates the `markdownify` integration.

# Tests and docs
- Tests added
- I did not modify the docstrings since they already described the basic
functionality, and [the API docs also already included a
description](https://python.langchain.com/api_reference/community/document_transformers/langchain_community.document_transformers.markdownify.MarkdownifyTransformer.html#langchain_community.document_transformers.markdownify.MarkdownifyTransformer.atransform_documents).
If it would be helpful, I would be happy to update the docstrings and/or
the API docs.

# Lint and test
- [x] format
- [x] lint
- [x] test

I ran formatting with `make format`, linting with `make lint`, and
confirmed that tests pass using `make test`. Note that some unit tests
pass in CI but may fail when running `make_test`. Those unit tests are:
- `test_extract_html` (and `test_extract_html_async`)
- `test_strip_tags` (and `test_strip_tags_async`)
- `test_convert_tags` (and `test_convert_tags_async`)

The reason for the difference is that there are trailing spaces when the
tests are run in the CI checks, and no trailing spaces when run with
`make test`. I ensured that the tests pass in CI, but they may fail with
`make test` due to the addition of trailing spaces.

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
2024-12-13 22:32:22 +00:00
Bagatur
e81ddb32a6
docs: fix kwargs docstring (#25010)
Fix:
![Screenshot 2024-08-02 at 5 33 37
PM](https://github.com/user-attachments/assets/7c56cdeb-ee81-454c-b3eb-86aa8a9bdc8d)
2024-08-02 19:54:54 -07:00
Sokolov Fedor
f4ddf64faa
community: Add MarkdownifyTransformer to langchain_community.document_transformers (#21247)
- Added new document_transformer: MarkdonifyTransformer, that uses
`markdonify` package with customizable options to convert HTML to
Markdown. It's similar to Html2TextTransformer, but has more flexible
options and also I've noticed that sometimes MarkdownifyTransformer
performs better than html2text one, so that's why I use markdownify on
my project.
- Added docs and tests

- Usage:
```python
from langchain_community.document_transformers import MarkdownifyTransformer

markdownify = MarkdownifyTransformer()
docs_transform = markdownify.transform_documents(docs)
```

- Example of better performance on simple task, that I've noticed:
```
<html>
<head><title>Reports on product movement</title></head>
<body>
<p data-block-key="2wst7">The reports on product movement will be useful for forming supplier orders and controlling outcomes.</p>
</body>
```
**Html2TextTransformer**: 
```python
[Document(page_content='The reports on product movement will be useful for forming supplier orders and\ncontrolling outcomes.\n\n')]
# Here we can see 'and\ncontrolling', which has extra '\n' in it
```
**MarkdownifyTranformer**:
```python
[Document(page_content='Reports on product movement\n\nThe reports on product movement will be useful for forming supplier orders and controlling outcomes.')]
```

---------

Co-authored-by: Sokolov Fedor <f.sokolov@sokolov-macbook.bbrouter>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Co-authored-by: Sokolov Fedor <f.sokolov@sokolov-macbook.local>
Co-authored-by: Sokolov Fedor <f.sokolov@192.168.1.6>
2024-05-08 14:45:13 -07:00