Commit Graph

59 Commits

Author SHA1 Message Date
Đỗ Quang Minh
cd198ac9ed
community: add custom model for OpenAIWhisperParser (#29831)
Add `model` properties for OpenAIWhisperParser. Defaulted to `whisper-1`
(previous value).
Please help me update the docs and other related components of this
repo.
2025-02-16 21:26:07 -05:00
Philippe PRADOS
beb75b2150
community[minor]: 05 - Refactoring PyPDFium2 parser (#29625)
This is one part of a larger Pull Request (PR) that is too large to be
submitted all at once. This specific part focuses on updating the
PyPDFium2 parser.

For more details, see
https://github.com/langchain-ai/langchain/pull/28970.
2025-02-07 21:31:12 -05:00
Christophe Bornet
723031d548
community: Bump ruff version to 0.9 (#29206)
Co-authored-by: Erick Friis <erick@langchain.dev>
2025-02-08 01:21:10 +00:00
Philippe PRADOS
6ff0d5c807
community[minor]: 04 - Refactoring PDFMiner parser (#29526)
This is one part of a larger Pull Request (PR) that is too large to be
submitted all at once. This specific part focuses on updating the XXX
parser.

For more details, see [PR
28970](https://github.com/langchain-ai/langchain/pull/28970).

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2025-02-05 21:08:27 -05:00
Philippe PRADOS
5771e561fb
[Bugfix langchain_community] Fix PyMuPDFLoader (#29550)
- **Description:**  add legacy properties
    - **Issue:** #29470
    - **Twitter handle:** pprados
2025-02-04 09:24:40 -05:00
Philippe PRADOS
ceda8bc050
community[minor]: 03 - Refactoring PyPDF parser (#29330)
This is one part of a larger Pull Request (PR) that is too large to be
submitted all at once.
This specific part focuses on updating the PyPDF parser.

For more details, see [PR
28970](https://github.com/langchain-ai/langchain/pull/28970).
2025-01-31 10:05:07 -05:00
Adrián Panella
1551d9750c
community(doc_loaders): allow any credential type in AzureAIDocumentI… (#29289)
allow any credential type in AzureAIDocumentInteligence, not only
`api_key`.
This allows to use any of the credentials types integrated with AD.

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
2025-01-27 20:56:30 +00:00
江同学呀
a1e62070d0
community: Fix the problem of error reporting when OCR extracts text from PDF. (#29378)
- **Description:** The issue has been fixed where images could not be
recognized from ```xObject[obj]["/Filter"]``` (whose value can be either
a string or a list of strings) in the ```_extract_images_from_page()```
method. It also resolves the bug where vectorization by Faiss fails due
to the failure of image extraction from a PDF containing only
images```IndexError: list index out of range```.

![69a60f3f6bd474641b9126d74bb18f7e](https://github.com/user-attachments/assets/dc9e098d-2862-49f7-93b0-00f1056727dc)

- **Issue:** 
    Fix the following issues:
[#15227 ](https://github.com/langchain-ai/langchain/issues/15227)
[#22892 ](https://github.com/langchain-ai/langchain/issues/22892)
[#26652 ](https://github.com/langchain-ai/langchain/issues/26652)
[#27153 ](https://github.com/langchain-ai/langchain/issues/27153)
    Related issues:
[#7067 ](https://github.com/langchain-ai/langchain/issues/7067)

- **Dependencies:** None
- **Twitter handle:** None

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
2025-01-23 15:01:52 +00:00
Philippe PRADOS
4efc5093c1
community[minor]: Refactoring PyMuPDF parser, loader and add image blob parsers (#29063)
* Adds BlobParsers for images. These implementations can take an image
and produce one or more documents per image. This interface can be used
for exposing OCR capabilities.
* Update PyMuPDFParser and Loader to standardize metadata, handle
images, improve table extraction etc.

- **Twitter handle:** pprados

This is one part of a larger Pull Request (PR) that is too large to be
submitted all at once.
This specific part focuses to prepare the update of all parsers.

For more details, see [PR
28970](https://github.com/langchain-ai/langchain/pull/28970).

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2025-01-20 15:15:43 -05:00
Nadeem Sajjad
eaf2fb287f
community(pypdfloader): added page_label in metadata for pypdf loader (#29225)
# Description

## Summary
This PR adds support for handling multi-labeled page numbers in the
**PyPDFLoader**. Some PDFs use complex page numbering systems where the
actual content may begin after multiple introductory pages. The
page_label field helps accurately reflect the document’s page structure,
making it easier to handle such cases during document parsing.

## Motivation
This feature improves document parsing accuracy by allowing users to
access the actual page labels instead of relying only on the physical
page numbers. This is particularly useful for documents where the first
few pages have roman numerals or other non-standard page labels.

## Use Case
This feature is especially useful for **Retrieval-Augmented Generation**
(RAG) systems where users may reference page numbers when asking
questions. Some PDFs have both labeled page numbers (like roman numerals
for introductory sections) and index-based page numbers.

For example, a user might ask:

	"What is mentioned on page 5?"

The system can now check both:
	•	**Index-based page number** (page)
	•	**Labeled page number** (page_label)

This dual-check helps improve retrieval accuracy. Additionally, the
results can be validated with an **agent or tool** to ensure the
retrieved pages match the user’s query contextually.

## Code Changes

- Added a page_label field to the metadata of the Document class in
**PyPDFLoader**.
- Implemented support for retrieving page_label from the
pdf_reader.page_labels.
- Created a test case (test_pypdf_loader_with_multi_label_page_numbers)
with a sample PDF containing multi-labeled pages
(geotopo-komprimiert.pdf) [[Source of
pdf](https://github.com/py-pdf/sample-files/blob/main/009-pdflatex-geotopo/GeoTopo-komprimiert.pdf)].
- Updated existing tests to ensure compatibility and verify page_label
extraction.

## Tests Added

- Added a new test case for a PDF with multi-labeled pages.
- Verified both page and page_label metadata fields are correctly
extracted.

## Screenshots

<img width="549" alt="image"
src="https://github.com/user-attachments/assets/65db9f5c-032e-4592-926f-824777c28f33"
/>
2025-01-15 14:18:07 -05:00
Mohammad Mohtashim
21eb39dff0
[Community]: AzureOpenAIWhisperParser Authenication Fix (#29135)
- **Description:** `AzureOpenAIWhisperParser` authentication fix as
stated in the issue.
- **Issue:** #29133
2025-01-15 09:44:53 -05:00
Philippe PRADOS
2921597c71
community[patch]: Refactoring PDF loaders: 01 prepare (#29062)
- **Refactoring PDF loaders step 1**: "community: Refactoring PDF
loaders to standardize approaches"

- **Description:** Declare CloudBlobLoader in __init__.py. file_path is
Union[str, PurePath] anywhere
- **Twitter handle:** pprados

This is one part of a larger Pull Request (PR) that is too large to be
submitted all at once.
This specific part focuses to prepare the update of all parsers.

For more details, see [PR
28970](https://github.com/langchain-ai/langchain/pull/28970).

@eyurtsev it's the start of a PR series.
2025-01-07 11:00:04 -05:00
Mohammad Mohtashim
49a26c1fca
(Community): Fix Keyword argument for AzureAIDocumentIntelligenceParser (#28959)
- **Description:** Fix the `body` keyword argument for
AzureAIDocumentIntelligenceParser`
- **Issue:** #28948
2025-01-02 11:27:12 -05:00
Anusha Karkhanis
26bdf40072
Langchain_Community: SQL LanguageParser (#28430)
## Description
(This PR has contributions from @khushiDesai, @ashvini8, and
@ssumaiyaahmed).

This PR addresses **Issue #11229** which addresses the need for SQL
support in document parsing. This is integrated into the generic
TreeSitter parsing library, allowing LangChain users to easily load
codebases in SQL into smaller, manageable "documents."

This pull request adds a new ```SQLSegmenter``` class, which provides
the SQL integration.

## Issue
**Issue #11229**: Add support for a variety of languages to
LanguageParser

## Testing
We created a file ```test_sql.py``` with several tests to ensure the
```SQLSegmenter``` is functional. Below are the tests we added:

- ```def test_is_valid```: Checks SQL validity.
- ```def test_extract_functions_classes```: Extracts individual SQL
statements.
- ```def test_simplify_code```: Simplifies SQL code with comments.

---------

Co-authored-by: Syeda Sumaiya Ahmed <114104419+ssumaiyaahmed@users.noreply.github.com>
Co-authored-by: ashvini hunagund <97271381+ashvini8@users.noreply.github.com>
Co-authored-by: Khushi Desai <khushi.desai@advantawitty.com>
Co-authored-by: Khushi Desai <59741309+khushiDesai@users.noreply.github.com>
Co-authored-by: ccurme <chester.curme@gmail.com>
2024-12-19 20:30:57 +00:00
Martin Triska
e6b41d081d
community: DocumentLoaderAsParser wrapper (#27749)
## Description

This pull request introduces the `DocumentLoaderAsParser` class, which
acts as an adapter to transform document loaders into parsers within the
LangChain framework. The class enables document loaders that accept a
`file_path` parameter to be utilized as blob parsers. This is
particularly useful for integrating various document loading
capabilities seamlessly into the LangChain ecosystem.

When merged in together with PR
https://github.com/langchain-ai/langchain/pull/27716 It opens options
for `SharePointLoader` / `OneDriveLoader` to process any filetype that
has a document loader.

### Features

- **Flexible Parsing**: The `DocumentLoaderAsParser` class can adapt any
document loader that meets the criteria of accepting a `file_path`
argument, allowing for lazy parsing of documents.
- **Compatibility**: The class has been designed to work with various
document loaders, making it versatile for different use cases.

### Usage Example

To use the `DocumentLoaderAsParser`, you would initialize it with a
suitable document loader class and any required parameters. Here’s an
example of how to do this with the `UnstructuredExcelLoader`:

```python
from langchain_community.document_loaders.blob_loaders import Blob
from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser
from langchain_community.document_loaders.excel import UnstructuredExcelLoader

# Initialize the parser adapter with UnstructuredExcelLoader
xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged")

# Use parser, for ex. pass it to MimeTypeBasedParser
MimeTypeBasedParser(
    handlers={
        "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": xlsx_parser
    }
)
```


- **Dependencies:** None
- **Twitter handle:** @martintriska1

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
2024-12-18 12:47:08 -05:00
Mohammad Mohtashim
d49df4871d
[Community]: Image Extraction Fixed for PDFPlumberParser (#28491)
- **Description:** One-Bit Images was raising error which has been fixed
in this PR for `PDFPlumberParser`
 - **Issue:** #28480

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
2024-12-18 11:45:48 -05:00
Vimpas
337fed80a5
community: 🐛 PDF Filter Type Error (#27154)
Thank you for contributing to LangChain!

 **PR title**: "community: fix  PDF Filter Type Error"


  - **Description:** fix  PDF Filter Type Error"
  - **Issue:** the issue #27153 it fixes,
  - **Dependencies:** no
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!



- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
2024-12-13 23:30:29 +00:00
Johannes Mohren
c1d348e95d
doc-loader: retain Azure Doc Intelligence API metadata in Document parser (#28382)
**Description**:
This PR modifies the doc_intelligence.py parser in the community package
to include all metadata returned by the Azure Doc Intelligence API in
the Document object. Previously, only the parsed content (markdown) was
retained, while other important metadata such as bounding boxes (bboxes)
for images and tables was discarded. These image bboxes are crucial for
supporting use cases like multi-modal RAG workflows when using Azure Doc
Intelligence.

The change ensures that all information returned by the Azure Doc
Intelligence API is preserved by setting the metadata attribute of the
Document object to the entire result returned by the API, rather than an
empty dictionary. This extends the parser's utility for complex use
cases without breaking existing functionality.

**Issue**:
This change does not address a specific issue number, but it resolves a
critical limitation in supporting multimodal workflows when using the
LangChain wrapper for the Azure API.

**Dependencies**:
No additional dependencies are required for this change.

---------

Co-authored-by: jmohren <johannes.mohren@aol.de>
2024-12-10 11:22:58 -05:00
Aksel Joonas Reedi
2cb39270ec
community: bytes as a source to AzureAIDocumentIntelligenceLoader (#26618)
- **Description:** This PR adds functionality to pass in in-memory bytes
as a source to `AzureAIDocumentIntelligenceLoader`.
- **Issue:** I needed the functionality, so I added it.
- **Dependencies:** NA
- **Twitter handle:** @akseljoonas if this is a big enough change :)

---------

Co-authored-by: Aksel Joonas Reedi <aksel@klippa.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
2024-11-07 03:40:21 +00:00
ccurme
0172d938b4
community: add AzureOpenAIWhisperParser (#27796)
Commandeered from https://github.com/langchain-ai/langchain/pull/26757.

---------

Co-authored-by: Sheepsta300 <128811766+Sheepsta300@users.noreply.github.com>
2024-10-31 12:37:41 -04:00
Erick Friis
600b7bdd61
all: test 3.13 ci (#27197)
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
2024-10-25 12:56:58 -07:00
Erik
4e0a6ebe7d
community: Add warning when page_content is empty (#25955)
Page content sometimes is empty when PyMuPDF can not find text on pages.
For example, this can happen when the text of the PDF is not copyable
"by hand". Then an OCR solution is need - which is not integrated here.

This warning should accurately warn the user that some pages are lost
during this process.

Thank you for contributing to LangChain!

- [ ] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core,
experimental, etc. is being modified. Use "docs: ..." for purely docs
changes, "templates: ..." for template changes, "infra: ..." for CI
changes.
  - Example: "community: add foobar LLM"


- [ ] **PR message**: ***Delete this entire checklist*** and replace
with
    - **Description:** a description of the change
    - **Issue:** the issue # it fixes, if applicable
    - **Dependencies:** any dependencies required for this change
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!


- [ ] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.


- [ ] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
2024-09-19 05:22:09 +00:00
Mohammad Mohtashim
75c3c81b8c
[Community]: Fix - Open AI Whisper client.audio.transcriptions returning Text Object which raises error (#25271)
- **Description:** The following
[line](fd546196ef/libs/community/langchain_community/document_loaders/parsers/audio.py (L117))
in `OpenAIWhisperParser` returns a text object for some odd reason
despite the official documentation saying it should return `Transcript`
Instance which should have the text attribute. But for the example given
in the issue and even when I tried running on my own, I was directly
getting the text. The small PR accounts for that.
 - **Issue:** : #25218
 

I was able to replicate the error even without the GenericLoader as
shown below and the issue was with `OpenAIWhisperParser`

```python
parser = OpenAIWhisperParser(api_key="sk-fxxxxxxxxx",
                                            response_format="srt",
                                            temperature=0)

list(parser.lazy_parse(Blob.from_path('path_to_file.m4a')))
```
2024-08-19 09:36:42 -04:00
Bagatur
253ceca76a
docs: fix mimetype parser docstring (#25463) 2024-08-15 16:16:52 -07:00
ccurme
27690506d0
multiple: update removal targets (#25361) 2024-08-14 09:50:39 -04:00
Eugene Yurtsev
d24b82357f
community[patch]: Add missing annotations (#24890)
This PR adds annotations in comunity package.

Annotations are only strictly needed in subclasses of BaseModel for
pydantic 2 compatibility.

This PR adds some unnecessary annotations, but they're not bad to have
regardless for documentation pages.
2024-07-31 18:13:44 +00:00
Brice Fotzo
034a8c7c1b
community: support advanced text extraction options for pdf documents (#20265)
**Description:** 
- Updated constructors in PyPDFParser and PyPDFLoader to handle
`extraction_mode` and additional kwargs, aligning with the capabilities
of `PageObject.extract_text()` from pypdf.

- Added `test_pypdf_loader_with_layout` along with a corresponding
example text file to validate layout extraction from PDFs.

**Issue:** fixes #19735 

**Dependencies:** This change requires updating the pypdf dependency
from version 3.4.0 to at least 4.0.0.

Additional changes include the addition of a new test
test_pypdf_loader_with_layout and an example text file to ensure the
functionality of layout extraction from PDFs aligns with the new
capabilities.

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
2024-07-17 20:47:09 +00:00
Bagatur
a0c2281540
infra: update mypy 1.10, ruff 0.5 (#23721)
```python
"""python scripts/update_mypy_ruff.py"""
import glob
import tomllib
from pathlib import Path

import toml
import subprocess
import re

ROOT_DIR = Path(__file__).parents[1]


def main():
    for path in glob.glob(str(ROOT_DIR / "libs/**/pyproject.toml"), recursive=True):
        print(path)
        with open(path, "rb") as f:
            pyproject = tomllib.load(f)
        try:
            pyproject["tool"]["poetry"]["group"]["typing"]["dependencies"]["mypy"] = (
                "^1.10"
            )
            pyproject["tool"]["poetry"]["group"]["lint"]["dependencies"]["ruff"] = (
                "^0.5"
            )
        except KeyError:
            continue
        with open(path, "w") as f:
            toml.dump(pyproject, f)
        cwd = "/".join(path.split("/")[:-1])
        completed = subprocess.run(
            "poetry lock --no-update; poetry install --with typing; poetry run mypy . --no-color",
            cwd=cwd,
            shell=True,
            capture_output=True,
            text=True,
        )
        logs = completed.stdout.split("\n")

        to_ignore = {}
        for l in logs:
            if re.match("^(.*)\:(\d+)\: error:.*\[(.*)\]", l):
                path, line_no, error_type = re.match(
                    "^(.*)\:(\d+)\: error:.*\[(.*)\]", l
                ).groups()
                if (path, line_no) in to_ignore:
                    to_ignore[(path, line_no)].append(error_type)
                else:
                    to_ignore[(path, line_no)] = [error_type]
        print(len(to_ignore))
        for (error_path, line_no), error_types in to_ignore.items():
            all_errors = ", ".join(error_types)
            full_path = f"{cwd}/{error_path}"
            try:
                with open(full_path, "r") as f:
                    file_lines = f.readlines()
            except FileNotFoundError:
                continue
            file_lines[int(line_no) - 1] = (
                file_lines[int(line_no) - 1][:-1] + f"  # type: ignore[{all_errors}]\n"
            )
            with open(full_path, "w") as f:
                f.write("".join(file_lines))

        subprocess.run(
            "poetry run ruff format .; poetry run ruff --select I --fix .",
            cwd=cwd,
            shell=True,
            capture_output=True,
            text=True,
        )


if __name__ == "__main__":
    main()

```
2024-07-03 10:33:27 -07:00
Alireza Kashani
c39521b70d
Update grobid.py (#23399)
fixed potential `IndexError: list index out of range` in case there is
no title

Thank you for contributing to LangChain!

- [ ] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core,
experimental, etc. is being modified. Use "docs: ..." for purely docs
changes, "templates: ..." for template changes, "infra: ..." for CI
changes.
  - Example: "community: add foobar LLM"


- [ ] **PR message**: ***Delete this entire checklist*** and replace
with
    - **Description:** a description of the change
    - **Issue:** the issue # it fixes, if applicable
    - **Dependencies:** any dependencies required for this change
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!


- [ ] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.


- [ ] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
2024-06-26 09:11:02 -04:00
Zheng Robert Jia
a349fce880
docs[minor],community[patch]: Minor tutorial docs improvement, minor import error quick fix. (#22725)
minor changes to module import error handling and minor issues in
tutorial documents.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>
2024-06-20 15:36:49 -04:00
Max Mulatz
058a64c563
Community[minor]: Add language parser for Elixir (#22742)
Hi 👋 

First off, thanks a ton for your work on this 💚 Really appreciate what
you're providing here for the community.

## Description

This PR adds a basic language parser for the
[Elixir](https://elixir-lang.org/) programming language. The parser code
is based upon the approach outlined in
https://github.com/langchain-ai/langchain/pull/13318: it's using
`tree-sitter` under the hood and aligns with all the other `tree-sitter`
based parses added that PR.

The `CHUNK_QUERY` I'm using here is probably not the most sophisticated
one, but it worked for my application. It's a starting point to provide
"core" parsing support for Elixir in LangChain. It enables people to use
the language parser out in real world applications which may then lead
to further tweaking of the queries. I consider this PR just the ground
work.

- **Dependencies:** requires `tree-sitter` and `tree-sitter-languages`
from the extended dependencies
- **Twitter handle:**`@bitcrowd`

## Checklist

- [x] **PR title**: "package: description"
- [x] **Add tests and docs**
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified.

<!-- If no one reviews your PR within a few days, please @-mention one
of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. -->
2024-06-10 15:56:57 +00:00
Eugene Yurtsev
36813d2f00
community[patch]: Fix remaining __inits__ in community (#22037)
Fixes the __init__ files in community to use __all__ which is statically
defined.
2024-05-22 17:42:17 +00:00
roiperlman
9992beaff9
community: Add arguments to whisper parser (#20378)
**Description:** Added a few additional arguments to the whisper parser,
which can be consumed by the underlying API.
The prompt is especially important to fine-tune transcriptions.

---------

Co-authored-by: Roi Perlman <roi@fivesigmalabs.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
2024-05-08 17:53:13 -07:00
ccurme
6da3d92b42
(all): update removal in deprecation warnings from 0.2 to 0.3 (#21265)
We are pushing out the removal of these to 0.3.

`find . -type f -name "*.py" -exec sed -i ''
's/removal="0\.2/removal="0.3/g' {} +`
2024-05-03 14:29:36 -04:00
Leonid Ganeline
85094cbb3a
docs: community docstring updates (#21040)
Added missed docstrings. Updated docstrings to consistent format.
2024-04-29 17:40:23 -04:00
hulitaitai
7d0a008744
community[minor]: Add audio-parser "faster-whisper" in audio.py (#20012)
faster-whisper is a reimplementation of OpenAI's Whisper model using
CTranslate2, which is up to 4 times faster than enai/whisper for the
same accuracy while using less memory. The efficiency can be further
improved with 8-bit quantization on both CPU and GPU.

It can automatically detect the following 14 languages and transcribe
the text into their respective languages: en, zh, fr, de, ja, ko, ru,
es, th, it, pt, vi, ar, tr.

The gitbub repository for faster-whisper is :
    https://github.com/SYSTRAN/faster-whisper

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2024-04-18 20:50:59 +00:00
Leonid Ganeline
4cb5f4c353
community[patch]: import flattening fix (#20110)
This PR should make it easier for linters to do type checking and for IDEs to jump to definition of code.

See #20050 as a template for this PR.
- As a byproduct: Added 3 missed `test_imports`.
- Added missed `SolarChat` in to __init___.py Added it into test_import
ut.
- Added `# type: ignore` to fix linting. It is not clear, why linting
errors appear after ^ changes.

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2024-04-10 13:01:19 -04:00
david02871
e1a24d09c5
community: Add PHP language parser to document_loaders (#19850)
**Description:**
Added a PHP language parser to document_loaders
**Issue:** N/A
**Dependencies:** N/A
**Twitter handle:** N/A

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
2024-04-08 11:30:28 -04:00
Leonid Kuligin
eb0521064e
deprecating integrations moved to langchain_google_community (#19841)
Thank you for contributing to LangChain!

- [ ] **PR title**: "community: deprecating integrations moved to
langchain_google_community"

- [ ] **PR message**: deprecating integrations moved to
langchain_google_community

---------

Co-authored-by: ccurme <chester.curme@gmail.com>
2024-04-02 17:06:07 -04:00
Zijian Han
8e976545f3
community[patch]: support OpenAI whisper base url (#17695)
**Description:** The base URL for OpenAI is retrieved from the
environment variable "OPENAI_BASE_URL", whereas for langchain it is
obtained from "OPENAI_API_BASE". By adding `base_url =
os.environ.get("OPENAI_API_BASE")`, the OpenAI proxy can execute
correctly.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2024-03-29 00:35:27 +00:00
Fabrizio Ruocco
f12cb0bea4
community[patch]: Microsoft Azure Document Intelligence updates (#16932)
- **Description:** Update Azure Document Intelligence implementation by
Microsoft team and RAG cookbook with Azure AI Search

---------

Co-authored-by: Lu Zhang (AI) <luzhan@microsoft.com>
Co-authored-by: Yateng Hong <yatengh@microsoft.com>
Co-authored-by: teethache <hongyateng2006@126.com>
Co-authored-by: Lu Zhang <44625949+luzhang06@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
2024-03-26 23:36:59 -07:00
Leonid Ganeline
11195cfa42
community[patch]: speed up import times in the community package (#18928)
This PR speeds up import times in the community package
2024-03-11 16:37:36 -04:00
Luis Antonio Vieira Junior
67c880af74
community[patch]: adding linearization config to AmazonTextractPDFLoader (#17489)
- **Description:** Adding an optional parameter `linearization_config`
to the `AmazonTextractPDFLoader` so the caller can define how the output
will be linearized, instead of forcing a predefined set of linearization
configs. It will still have a default configuration as this will be an
optional parameter.
- **Issue:** #17457
- **Dependencies:** The same ones that already exist for
`AmazonTextractPDFLoader`
- **Twitter handle:** [@lvieirajr19](https://twitter.com/lvieirajr19)

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2024-03-08 17:25:22 -08:00
Bagatur
5efb5c099f
text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346) 2024-02-29 18:33:21 -08:00
Leo Diegues
b15fccbb99
community[patch]: Skip OpenAIWhisperParser extremely small audio chunks to avoid api error (#11450)
**Description**
This PR addresses a rare issue in `OpenAIWhisperParser` that causes it
to crash when processing an audio file with a duration very close to the
class's chunk size threshold of 20 minutes.

**Issue**
#11449

**Dependencies**
None

**Tag maintainer**
@agola11 @eyurtsev 

**Twitter handle**
leonardodiegues

---------

Co-authored-by: Leonardo Diegues <leonardo.diegues@grupofolha.com.br>
Co-authored-by: Bagatur <baskaryan@gmail.com>
2024-02-22 17:02:43 -08:00
Rakib Hosen
5ce1827d31
community[patch]: fix import in language parser (#17538)
- **Description:** Resolving import error in language_parser.py during
"from langchain.langchain.text_splitter import Language - **Issue:** the
issue #17536
- **Dependencies:** NO
- **Twitter handle:** @iRakibHosen

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2024-02-14 11:11:23 -08:00
Ian Gregory
e5472b5eb8
Framework for supporting more languages in LanguageParser (#13318)
## Description

I am submitting this for a school project as part of a team of 5. Other
team members are @LeilaChr, @maazh10, @Megabear137, @jelalalamy. This PR
also has contributions from community members @Harrolee and @Mario928.

Initial context is in the issue we opened (#11229).

This pull request adds:

- Generic framework for expanding the languages that `LanguageParser`
can handle, using the
[tree-sitter](https://github.com/tree-sitter/py-tree-sitter#py-tree-sitter)
parsing library and existing language-specific parsers written for it
- Support for the following additional languages in `LanguageParser`:
  - C
  - C++
  - C#
  - Go
- Java (contributed by @Mario928
https://github.com/ThatsJustCheesy/langchain/pull/2)
  - Kotlin
  - Lua
  - Perl
  - Ruby
  - Rust
  - Scala
- TypeScript (contributed by @Harrolee
https://github.com/ThatsJustCheesy/langchain/pull/1)

Here is the [design
document](https://docs.google.com/document/d/17dB14cKCWAaiTeSeBtxHpoVPGKrsPye8W0o_WClz2kk)
if curious, but no need to read it.

## Issues

- Closes #11229
- Closes #10996
- Closes #8405

## Dependencies

`tree_sitter` and `tree_sitter_languages` on PyPI. We have tried to add
these as optional dependencies.

## Documentation

We have updated the list of supported languages, and also added a
section to `source_code.ipynb` detailing how to add support for
additional languages using our framework.

## Maintainer

- @hwchase17 (previously reviewed
https://github.com/langchain-ai/langchain/pull/6486)

Thanks!!

## Git commits

We will gladly squash any/all of our commits (esp merge commits) if
necessary. Let us know if this is desirable, or if you will be
squash-merging anyway.

<!-- Thank you for contributing to LangChain!

Replace this entire comment with:
  - **Description:** a description of the change, 
  - **Issue:** the issue # it fixes (if applicable),
  - **Dependencies:** any dependencies required for this change,
- **Tag maintainer:** for a quicker response, tag the relevant
maintainer (see below),
- **Twitter handle:** we announce bigger features on Twitter. If your PR
gets announced, and you'd like a mention, we'll gladly shout you out!

Please make sure your PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

See contribution guidelines for more information on how to write/run
tests, lint, etc:

https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in `docs/extras`
directory.

If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17.
 -->

---------

Co-authored-by: Maaz Hashmi <mhashmi373@gmail.com>
Co-authored-by: LeilaChr <87657694+LeilaChr@users.noreply.github.com>
Co-authored-by: Jeremy La <jeremylai511@gmail.com>
Co-authored-by: Megabear137 <zubair.alnoor27@gmail.com>
Co-authored-by: Lee Harrold <lhharrold@sep.com>
Co-authored-by: Mario928 <88029051+Mario928@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2024-02-13 08:45:49 -08:00
Robby
ece4b43a81
community[patch]: doc loaders mypy fixes (#17368)
**Description:** Fixed `type: ignore`'s for mypy for some
document_loaders.
**Issue:** [Remove "type: ignore" comments #17048
](https://github.com/langchain-ai/langchain/issues/17048)

---------

Co-authored-by: Robby <h0rv@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2024-02-12 16:51:06 -08:00
Erick Friis
3a2eb6e12b
infra: add print rule to ruff (#16221)
Added noqa for existing prints. Can slowly remove / will prevent more
being intro'd
2024-02-09 16:13:30 -08:00
Leonid Ganeline
932c52c333
community[patch]: docstrings (#16810)
- added missed docstrings
- formated docstrings to the consistent form
2024-02-09 12:48:57 -08:00