Commit Graph

82 Commits

Author SHA1 Message Date
Philippe PRADOS
beb75b2150
community[minor]: 05 - Refactoring PyPDFium2 parser (#29625)
This is one part of a larger Pull Request (PR) that is too large to be
submitted all at once. This specific part focuses on updating the
PyPDFium2 parser.

For more details, see
https://github.com/langchain-ai/langchain/pull/28970.
2025-02-07 21:31:12 -05:00
Christophe Bornet
723031d548
community: Bump ruff version to 0.9 (#29206)
Co-authored-by: Erick Friis <erick@langchain.dev>
2025-02-08 01:21:10 +00:00
ccurme
d172984c91
infra: migrate to uv (#29566) 2025-02-06 13:36:26 -05:00
Philippe PRADOS
6ff0d5c807
community[minor]: 04 - Refactoring PDFMiner parser (#29526)
This is one part of a larger Pull Request (PR) that is too large to be
submitted all at once. This specific part focuses on updating the XXX
parser.

For more details, see [PR
28970](https://github.com/langchain-ai/langchain/pull/28970).

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2025-02-05 21:08:27 -05:00
Philippe PRADOS
ceda8bc050
community[minor]: 03 - Refactoring PyPDF parser (#29330)
This is one part of a larger Pull Request (PR) that is too large to be
submitted all at once.
This specific part focuses on updating the PyPDF parser.

For more details, see [PR
28970](https://github.com/langchain-ai/langchain/pull/28970).
2025-01-31 10:05:07 -05:00
Philippe PRADOS
4efc5093c1
community[minor]: Refactoring PyMuPDF parser, loader and add image blob parsers (#29063)
* Adds BlobParsers for images. These implementations can take an image
and produce one or more documents per image. This interface can be used
for exposing OCR capabilities.
* Update PyMuPDFParser and Loader to standardize metadata, handle
images, improve table extraction etc.

- **Twitter handle:** pprados

This is one part of a larger Pull Request (PR) that is too large to be
submitted all at once.
This specific part focuses to prepare the update of all parsers.

For more details, see [PR
28970](https://github.com/langchain-ai/langchain/pull/28970).

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2025-01-20 15:15:43 -05:00
Nadeem Sajjad
eaf2fb287f
community(pypdfloader): added page_label in metadata for pypdf loader (#29225)
# Description

## Summary
This PR adds support for handling multi-labeled page numbers in the
**PyPDFLoader**. Some PDFs use complex page numbering systems where the
actual content may begin after multiple introductory pages. The
page_label field helps accurately reflect the document’s page structure,
making it easier to handle such cases during document parsing.

## Motivation
This feature improves document parsing accuracy by allowing users to
access the actual page labels instead of relying only on the physical
page numbers. This is particularly useful for documents where the first
few pages have roman numerals or other non-standard page labels.

## Use Case
This feature is especially useful for **Retrieval-Augmented Generation**
(RAG) systems where users may reference page numbers when asking
questions. Some PDFs have both labeled page numbers (like roman numerals
for introductory sections) and index-based page numbers.

For example, a user might ask:

	"What is mentioned on page 5?"

The system can now check both:
	•	**Index-based page number** (page)
	•	**Labeled page number** (page_label)

This dual-check helps improve retrieval accuracy. Additionally, the
results can be validated with an **agent or tool** to ensure the
retrieved pages match the user’s query contextually.

## Code Changes

- Added a page_label field to the metadata of the Document class in
**PyPDFLoader**.
- Implemented support for retrieving page_label from the
pdf_reader.page_labels.
- Created a test case (test_pypdf_loader_with_multi_label_page_numbers)
with a sample PDF containing multi-labeled pages
(geotopo-komprimiert.pdf) [[Source of
pdf](https://github.com/py-pdf/sample-files/blob/main/009-pdflatex-geotopo/GeoTopo-komprimiert.pdf)].
- Updated existing tests to ensure compatibility and verify page_label
extraction.

## Tests Added

- Added a new test case for a PDF with multi-labeled pages.
- Verified both page and page_label metadata fields are correctly
extracted.

## Screenshots

<img width="549" alt="image"
src="https://github.com/user-attachments/assets/65db9f5c-032e-4592-926f-824777c28f33"
/>
2025-01-15 14:18:07 -05:00
Mohammad Mohtashim
21eb39dff0
[Community]: AzureOpenAIWhisperParser Authenication Fix (#29135)
- **Description:** `AzureOpenAIWhisperParser` authentication fix as
stated in the issue.
- **Issue:** #29133
2025-01-15 09:44:53 -05:00
Philippe PRADOS
2921597c71
community[patch]: Refactoring PDF loaders: 01 prepare (#29062)
- **Refactoring PDF loaders step 1**: "community: Refactoring PDF
loaders to standardize approaches"

- **Description:** Declare CloudBlobLoader in __init__.py. file_path is
Union[str, PurePath] anywhere
- **Twitter handle:** pprados

This is one part of a larger Pull Request (PR) that is too large to be
submitted all at once.
This specific part focuses to prepare the update of all parsers.

For more details, see [PR
28970](https://github.com/langchain-ai/langchain/pull/28970).

@eyurtsev it's the start of a PR series.
2025-01-07 11:00:04 -05:00
yeounhak
f38fc89f35
community: Corrected aload func to be asynchronous from webBaseLoader (#28337)
- **Description:** The aload function, contrary to its name, is not an
asynchronous function, so it cannot work concurrently with other
asynchronous functions.

- **Issue:** #28336 

- **Test: **: Done

- **Docs: **
[here](e0a95e5646/docs/docs/integrations/document_loaders/web_base.ipynb (L201))

- **Lint: ** All checks passed

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
2024-12-20 14:42:52 -05:00
Anusha Karkhanis
26bdf40072
Langchain_Community: SQL LanguageParser (#28430)
## Description
(This PR has contributions from @khushiDesai, @ashvini8, and
@ssumaiyaahmed).

This PR addresses **Issue #11229** which addresses the need for SQL
support in document parsing. This is integrated into the generic
TreeSitter parsing library, allowing LangChain users to easily load
codebases in SQL into smaller, manageable "documents."

This pull request adds a new ```SQLSegmenter``` class, which provides
the SQL integration.

## Issue
**Issue #11229**: Add support for a variety of languages to
LanguageParser

## Testing
We created a file ```test_sql.py``` with several tests to ensure the
```SQLSegmenter``` is functional. Below are the tests we added:

- ```def test_is_valid```: Checks SQL validity.
- ```def test_extract_functions_classes```: Extracts individual SQL
statements.
- ```def test_simplify_code```: Simplifies SQL code with comments.

---------

Co-authored-by: Syeda Sumaiya Ahmed <114104419+ssumaiyaahmed@users.noreply.github.com>
Co-authored-by: ashvini hunagund <97271381+ashvini8@users.noreply.github.com>
Co-authored-by: Khushi Desai <khushi.desai@advantawitty.com>
Co-authored-by: Khushi Desai <59741309+khushiDesai@users.noreply.github.com>
Co-authored-by: ccurme <chester.curme@gmail.com>
2024-12-19 20:30:57 +00:00
Frank Dai
e81433497b
community: support Confluence cookies (#28760)
**Description**: Some confluence instances don't support personal access
token, then cookie is a convenient way to authenticate. This PR adds
support for Confluence cookies.

**Twitter handle**: soulmachine
2024-12-17 12:16:36 -05:00
Hiros
8f5e72de05
community: Correctly handle multi-element rich text (#25762)
**Description:**

- Add _concatenate_rich_text method to combine all elements in rich text
arrays
- Update load_page method to use _concatenate_rich_text for rich text
properties
- Ensure all text content is captured, including inline code and
formatted text
- Add unit tests to verify correct handling of multi-element rich text
This fix prevents truncation of content after backticks or other
formatting elements.

 **Issue:**

Using Notion DB Loader, the text for `richtext` and `title` is truncated
after 1st element was loaded as Notion Loader only read the first
element.

**Dependencies:** any dependencies required for this change
None.

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
2024-12-16 20:20:27 +00:00
Botong Zhu
13c3c4a210
community: fixes json loader not getting texts with json standard (#27327)
This PR fixes JSONLoader._get_text not converting objects to json string
correctly.
If an object is serializable and is not a dict, JSONLoader will use
python built-in str() method to convert it to string. This may cause
object converted to strings not following json standard. For example, a
list will be converted to string with single quotes, and if json.loads
try to load this string, it will cause error.

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
2024-12-12 19:33:45 +00:00
Naka Masato
ce3b69aa05
community: add include_labels option to ConfluenceLoader (#28259)
## **Description:**

Enable `ConfluenceLoader` to include labels with `include_labels` option
(`false` by default for backward compatibility). and the labels are set
to `metadata` in the `Document`. e.g. `{"labels": ["l1", "l2"]}`

## Notes

Confluence API supports to get labels by providing `metadata.labels` to
`expand` query parameter

All of the following functions support `expand` in the same way:
- confluence.get_page_by_id
- confluence.get_all_pages_by_label
- confluence.get_all_pages_from_space
- cql (internally using
[/api/content/search](https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-content/#api-wiki-rest-api-content-search-get))

## **Issue:**

No issue related to this PR.

## **Dependencies:** 

No changes.

## **Twitter handle:** 

[@gymnstcs](https://x.com/gymnstcs)


- [x] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.

- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
2024-12-09 19:35:01 +00:00
Jan Heimes
ef365543cb
community: add Needle retriever and document loader integration (#28157)
- [x] **PR title**: "community: add Needle retriever and document loader
integration"
- Where "package" is whichever of langchain, community, core, etc. is
being modified. Use "docs: ..." for purely docs changes, "infra: ..."
for CI changes.
  - Example: "community: add foobar LLM"

- [x] **PR message**: ***Delete this entire checklist*** and replace
with
- **Description:** This PR adds a new integration for Needle, which
includes:
- **NeedleRetriever**: A retriever for fetching documents from Needle
collections.
- **NeedleLoader**: A document loader for managing and loading documents
into Needle collections.
      - Example notebooks demonstrating usage have been added in:
        - `docs/docs/integrations/retrievers/needle.ipynb`
        - `docs/docs/integrations/document_loaders/needle.ipynb`.
- **Dependencies:** The `needle-python` package is required as an
external dependency for accessing Needle's API. It has been added to the
extended testing dependencies list.
- **Twitter handle:** Feel free to mention me if this PR gets announced:
[needlexai](https://x.com/NeedlexAI).

- [x] **Add tests and docs**: If you're adding a new integration, please
include
1. Unit tests have been added for both `NeedleRetriever` and
`NeedleLoader` in `libs/community/tests/unit_tests`. These tests mock
API calls to avoid relying on network access.
2. Example notebooks have been added to `docs/docs/integrations/`,
showcasing both retriever and loader functionality.

- [x] **Lint and test**: Run `make format`, `make lint`, and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
  - `make format`: Passed
  - `make lint`: Passed
- `make test`: Passed (requires `needle-python` to be installed locally;
this package is not added to LangChain dependencies).

Additional guidelines:
- [x] Optional dependencies are imported only within functions.
- [x] No dependencies have been added to pyproject.toml files except for
those required for unit tests.
- [x] The PR does not touch more than one package.
- [x] Changes are fully backwards compatible.
- [x] Community additions are not re-imported into LangChain core.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
2024-12-03 22:06:25 +00:00
ccurme
0172d938b4
community: add AzureOpenAIWhisperParser (#27796)
Commandeered from https://github.com/langchain-ai/langchain/pull/26757.

---------

Co-authored-by: Sheepsta300 <128811766+Sheepsta300@users.noreply.github.com>
2024-10-31 12:37:41 -04:00
ccurme
f2285376a5
community[patch]: add web loader tests (#26728) 2024-09-20 18:29:54 -04:00
ccurme
f91bdd12d2
community[patch]: add to pypdf tests and run in CI (#26663) 2024-09-19 14:45:49 +00:00
Rajendra Kadam
60dc19da30
[community] Added PebbloTextLoader for loading text data in PebbloSafeLoader (#26582)
- **Description:** Added PebbloTextLoader for loading text in
PebbloSafeLoader.
- Since PebbloSafeLoader wraps document loaders, this new loader enables
direct loading of text into Documents using PebbloSafeLoader.
- **Issue:** NA
- **Dependencies:** NA
- [x] **Tests**: Added/Updated tests
2024-09-19 09:59:04 -04:00
RUO
0a177ec2cc
community: Enhance MongoDBLoader with flexible metadata and optimized field extraction (#23376)
### Description:
This pull request significantly enhances the MongodbLoader class in the
LangChain community package by adding robust metadata customization and
improved field extraction capabilities. The updated class now allows
users to specify additional metadata fields through the metadata_names
parameter, enabling the extraction of both top-level and deeply nested
document attributes as metadata. This flexibility is crucial for users
who need to include detailed contextual information without altering the
database schema.

Moreover, the include_db_collection_in_metadata flag offers optional
inclusion of database and collection names in the metadata, allowing for
even greater customization depending on the user's needs.

The loader's field extraction logic has been refined to handle missing
or nested fields more gracefully. It now employs a safe access mechanism
that avoids the KeyError previously encountered when a specified nested
field was absent in a document. This update ensures that the loader can
handle diverse and complex data structures without failure, making it
more resilient and user-friendly.

### Issue:
This pull request addresses a critical issue where the MongodbLoader
class in the LangChain community package could throw a KeyError when
attempting to access nested fields that may not exist in some documents.
The previous implementation did not handle the absence of specified
nested fields gracefully, leading to runtime errors and interruptions in
data processing workflows.

This enhancement ensures robust error handling by safely accessing
nested document fields, using default values for missing data, thus
preventing KeyError and ensuring smoother operation across various data
structures in MongoDB. This improvement is crucial for users working
with diverse and complex data sets, ensuring the loader can adapt to
documents with varying structures without failing.

### Dependencies: 
Requires motor for asynchronous MongoDB interaction.

### Twitter handle: 
N/A

### Add tests and docs
Tests: Unit tests have been added to verify that the metadata inclusion
toggle works as expected and that the field extraction correctly handles
nested fields.
Docs: An example notebook demonstrating the use of the enhanced
MongodbLoader is included in the docs/docs/integrations directory. This
notebook includes setup instructions, example usage, and outputs.
(Here is the notebook link : [colab
link](https://colab.research.google.com/drive/1tp7nyUnzZa3dxEFF4Kc3KS7ACuNF6jzH?usp=sharing))
Lint and test
Before submitting, I ran make format, make lint, and make test as per
the contribution guidelines. All tests pass, and the code style adheres
to the LangChain standards.

```python
import unittest
from unittest.mock import patch, MagicMock
import asyncio
from langchain_community.document_loaders.mongodb import MongodbLoader

class TestMongodbLoader(unittest.TestCase):
    def setUp(self):
        """Setup the MongodbLoader test environment by mocking the motor client 
        and database collection interactions."""
        # Mocking the AsyncIOMotorClient
        self.mock_client = MagicMock()
        self.mock_db = MagicMock()
        self.mock_collection = MagicMock()

        self.mock_client.get_database.return_value = self.mock_db
        self.mock_db.get_collection.return_value = self.mock_collection

        # Initialize the MongodbLoader with test data
        self.loader = MongodbLoader(
            connection_string="mongodb://localhost:27017",
            db_name="testdb",
            collection_name="testcol"
        )

    @patch('langchain_community.document_loaders.mongodb.AsyncIOMotorClient', return_value=MagicMock())
    def test_constructor(self, mock_motor_client):
        """Test if the constructor properly initializes with the correct database and collection names."""
        loader = MongodbLoader(
            connection_string="mongodb://localhost:27017",
            db_name="testdb",
            collection_name="testcol"
        )
        self.assertEqual(loader.db_name, "testdb")
        self.assertEqual(loader.collection_name, "testcol")

    def test_aload(self):
        """Test the aload method to ensure it correctly queries and processes documents."""
        # Setup mock data and responses for the database operations
        self.mock_collection.count_documents.return_value = asyncio.Future()
        self.mock_collection.count_documents.return_value.set_result(1)
        self.mock_collection.find.return_value = [
            {"_id": "1", "content": "Test document content"}
        ]

        # Run the aload method and check responses
        loop = asyncio.get_event_loop()
        results = loop.run_until_complete(self.loader.aload())
        self.assertEqual(len(results), 1)
        self.assertEqual(results[0].page_content, "Test document content")

    def test_construct_projection(self):
        """Verify that the projection dictionary is constructed correctly based on field names."""
        self.loader.field_names = ['content', 'author']
        self.loader.metadata_names = ['timestamp']
        expected_projection = {'content': 1, 'author': 1, 'timestamp': 1}
        projection = self.loader._construct_projection()
        self.assertEqual(projection, expected_projection)

if __name__ == '__main__':
    unittest.main()
```


### Additional Example for Documentation
Sample Data:

```json
[
    {
        "_id": "1",
        "title": "Artificial Intelligence in Medicine",
        "content": "AI is transforming the medical industry by providing personalized medicine solutions.",
        "author": {
            "name": "John Doe",
            "email": "john.doe@example.com"
        },
        "tags": ["AI", "Healthcare", "Innovation"]
    },
    {
        "_id": "2",
        "title": "Data Science in Sports",
        "content": "Data science provides insights into player performance and strategic planning in sports.",
        "author": {
            "name": "Jane Smith",
            "email": "jane.smith@example.com"
        },
        "tags": ["Data Science", "Sports", "Analytics"]
    }
]
```
Example Code:

```python
loader = MongodbLoader(
    connection_string="mongodb://localhost:27017",
    db_name="example_db",
    collection_name="articles",
    filter_criteria={"tags": "AI"},
    field_names=["title", "content"],
    metadata_names=["author.name", "author.email"],
    include_db_collection_in_metadata=True
)

documents = loader.load()

for doc in documents:
    print("Page Content:", doc.page_content)
    print("Metadata:", doc.metadata)
```
Expected Output:

```
Page Content: Artificial Intelligence in Medicine AI is transforming the medical industry by providing personalized medicine solutions.
Metadata: {'author_name': 'John Doe', 'author_email': 'john.doe@example.com', 'database': 'example_db', 'collection': 'articles'}
```

Thank you.

---

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

---------

Co-authored-by: ccurme <chester.curme@gmail.com>
2024-09-17 10:23:17 -04:00
Erick Friis
c2a3021bb0
multiple: pydantic 2 compatibility, v0.3 (#26443)
Signed-off-by: ChengZi <chen.zhang@zilliz.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Dan O'Donovan <dan.odonovan@gmail.com>
Co-authored-by: Tom Daniel Grande <tomdgrande@gmail.com>
Co-authored-by: Grande <Tom.Daniel.Grande@statsbygg.no>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: ccurme <chester.curme@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Co-authored-by: Tomaz Bratanic <bratanic.tomaz@gmail.com>
Co-authored-by: ZhangShenao <15201440436@163.com>
Co-authored-by: Friso H. Kingma <fhkingma@gmail.com>
Co-authored-by: ChengZi <chen.zhang@zilliz.com>
Co-authored-by: Nuno Campos <nuno@langchain.dev>
Co-authored-by: Morgante Pell <morgantep@google.com>
2024-09-13 14:38:45 -07:00
Alexander KIRILOV
6a8f8a56ac
community[patch]: added content_columns option to CSVLoader (#23809)
**Description:** 
Adding a new option to the CSVLoader that allows us to implicitly
specify the columns that are used for generating the Document content.
Currently these are implicitly set as "all fields not part of the
metadata_columns".

In some cases however it is useful to have a field both as a metadata
and as part of the document content.
2024-09-02 20:25:53 +00:00
Rajendra Kadam
1f1679e960
community: Refactor PebbloSafeLoader (#25582)
**Refactor PebbloSafeLoader**
  - Created `APIWrapper` and moved API logic into it.
  - Moved helper functions to the utility file.
  - Created smaller functions and methods for better readability.
  - Properly read environment variables.
  - Removed unused code.

**Issue:** NA
**Dependencies:** NA
**tests**:  Updated
2024-08-22 11:46:52 -04:00
Eugene Yurtsev
d24b82357f
community[patch]: Add missing annotations (#24890)
This PR adds annotations in comunity package.

Annotations are only strictly needed in subclasses of BaseModel for
pydantic 2 compatibility.

This PR adds some unnecessary annotations, but they're not bad to have
regardless for documentation pages.
2024-07-31 18:13:44 +00:00
Rajendra Kadam
a6add89bd4
community[minor]: [PebbloSafeLoader] Implement content-size-based batching (#24871)
- **Title:** [PebbloSafeLoader] Implement content-size-based batching in
the classification flow(loader/doc API)
- **Description:** 
- Implemented content-size-based batching in the loader/doc API, set to
100KB with no external configuration option, intentionally hard-coded to
prevent timeouts.
    - Remove unused field(pb_id) from doc_metadata
- **Issue:** NA
- **Dependencies:** NA
- **Add tests and docs:** Updated
2024-07-31 09:10:28 -04:00
Oleg Kulyk
4b1b7959a2
community[minor]: Add ScrapingAnt Loader Community Integration (#24514)
Added [ScrapingAnt](https://scrapingant.com/) Web Loader integration.
ScrapingAnt is a web scraping API that allows extracting web page data
into accessible and well-formatted markdown.

Description: Added ScrapingAnt web loader for retrieving web page data
as markdown
Dependencies: scrapingant-client
Twitter: @WeRunTheWorld3

---------

Co-authored-by: Oleg Kulyk <oleg@scrapingant.com>
2024-07-24 21:11:43 -04:00
Alexander Golodkov
2a70a07aad
community[minor]: added new document loaders based on dedoc library (#24303)
### Description
This pull request added new document loaders to load documents of
various formats using [Dedoc](https://github.com/ispras/dedoc):
  - `DedocFileLoader` (determine file types automatically and parse)
  - `DedocPDFLoader` (for `PDF` and images parsing)
- `DedocAPIFileLoader` (determine file types automatically and parse
using Dedoc API without library installation)

[Dedoc](https://dedoc.readthedocs.io) is an open-source library/service
that extracts texts, tables, attached files and document structure
(e.g., titles, list items, etc.) from files of various formats. The
library is actively developed and maintained by a group of developers.

`Dedoc` supports `DOCX`, `XLSX`, `PPTX`, `EML`, `HTML`, `PDF`, images
and more.
Full list of supported formats can be found
[here](https://dedoc.readthedocs.io/en/latest/#id1).
For `PDF` documents, `Dedoc` allows to determine textual layer
correctness and split the document into paragraphs.


### Issue
This pull request extends variety of document loaders supported by
`langchain_community` allowing users to choose the most suitable option
for raw documents parsing.

### Dependencies
The PR added a new (optional) dependency `dedoc>=2.2.5` ([library
documentation](https://dedoc.readthedocs.io)) to the
`extended_testing_deps.txt`

### Twitter handle
None

### Add tests and docs
1. Test for the integration:
`libs/community/tests/integration_tests/document_loaders/test_dedoc.py`
2. Example notebook:
`docs/docs/integrations/document_loaders/dedoc.ipynb`
3. Information about the library:
`docs/docs/integrations/providers/dedoc.mdx`

### Lint and test

Done locally:

  - `make format`
  - `make lint`
  - `make integration_tests`
  - `make docs_build` (from the project root)

---------

Co-authored-by: Nasty <bogatenkova.anastasiya@mail.ru>
2024-07-23 02:04:53 +00:00
Dristy Srivastava
020cc1cf3e
Community[minor]: Added checksum in while send data to pebblo-cloud (#23968)
- **Description:** 
            - Updated checksum in doc metadata
- Sending checksum and removing actual content, while sending data to
`pebblo-cloud` if `classifier-location `is `pebblo-cloud` in
`/loader/doc` API
            - Adding `pb_id` i.e. pebblo id to doc metadata
            - Refactoring as needed.
- Sending `content-checksum` and removing actual content, while sending
data to `pebblo-cloud` if `classifier-location `is `pebblo-cloud` in
`prmopt` API
- **Issue:** NA
- **Dependencies:** NA
- **Tests:** Updated
- **Docs** NA

---------

Co-authored-by: dristy.cd <dristy@clouddefense.io>
2024-07-19 13:52:54 -04:00
Rajendra Kadam
ee8aa54f53
community[patch]: Fix source path mismatch in PebbloSafeLoader (#23857)
**Description:** Fix for source path mismatch in PebbloSafeLoader. The
fix involves storing the full path in the doc metadata in VectorDB
**Issue:** NA, caught in internal testing
**Dependencies:** NA
**Add tests**:  Updated tests
2024-07-05 15:24:17 -04:00
Bagatur
a0c2281540
infra: update mypy 1.10, ruff 0.5 (#23721)
```python
"""python scripts/update_mypy_ruff.py"""
import glob
import tomllib
from pathlib import Path

import toml
import subprocess
import re

ROOT_DIR = Path(__file__).parents[1]


def main():
    for path in glob.glob(str(ROOT_DIR / "libs/**/pyproject.toml"), recursive=True):
        print(path)
        with open(path, "rb") as f:
            pyproject = tomllib.load(f)
        try:
            pyproject["tool"]["poetry"]["group"]["typing"]["dependencies"]["mypy"] = (
                "^1.10"
            )
            pyproject["tool"]["poetry"]["group"]["lint"]["dependencies"]["ruff"] = (
                "^0.5"
            )
        except KeyError:
            continue
        with open(path, "w") as f:
            toml.dump(pyproject, f)
        cwd = "/".join(path.split("/")[:-1])
        completed = subprocess.run(
            "poetry lock --no-update; poetry install --with typing; poetry run mypy . --no-color",
            cwd=cwd,
            shell=True,
            capture_output=True,
            text=True,
        )
        logs = completed.stdout.split("\n")

        to_ignore = {}
        for l in logs:
            if re.match("^(.*)\:(\d+)\: error:.*\[(.*)\]", l):
                path, line_no, error_type = re.match(
                    "^(.*)\:(\d+)\: error:.*\[(.*)\]", l
                ).groups()
                if (path, line_no) in to_ignore:
                    to_ignore[(path, line_no)].append(error_type)
                else:
                    to_ignore[(path, line_no)] = [error_type]
        print(len(to_ignore))
        for (error_path, line_no), error_types in to_ignore.items():
            all_errors = ", ".join(error_types)
            full_path = f"{cwd}/{error_path}"
            try:
                with open(full_path, "r") as f:
                    file_lines = f.readlines()
            except FileNotFoundError:
                continue
            file_lines[int(line_no) - 1] = (
                file_lines[int(line_no) - 1][:-1] + f"  # type: ignore[{all_errors}]\n"
            )
            with open(full_path, "w") as f:
                f.write("".join(file_lines))

        subprocess.run(
            "poetry run ruff format .; poetry run ruff --select I --fix .",
            cwd=cwd,
            shell=True,
            capture_output=True,
            text=True,
        )


if __name__ == "__main__":
    main()

```
2024-07-03 10:33:27 -07:00
Finlay Macklon
616d06d7fe
community: glob multiple patterns when using DirectoryLoader (#22852)
- **Description:** Updated
*community.langchain_community.document_loaders.directory.py* to enable
the use of multiple glob patterns in the `DirectoryLoader` class. Now,
the glob parameter is of type `list[str] | str` and still defaults to
the same value as before. I updated the docstring of the class to
reflect this, and added a unit test to
*community.tests.unit_tests.document_loaders.test_directory.py* named
`test_directory_loader_glob_multiple`. This test also shows an example
of how to use the new functionality.
- ~~Issue:~~**Discussion Thread:**
https://github.com/langchain-ai/langchain/discussions/18559
- **Dependencies:** None
- **Twitter handle:** N/a

- [x] **Add tests and docs**
    - Added test (described above)
    - Updated class docstring

- [x] **Lint and test**

---------

Co-authored-by: isaac hershenson <ihershenson@hmc.edu>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Co-authored-by: Isaac Francisco <78627776+isahers1@users.noreply.github.com>
2024-06-18 09:24:50 -07:00
Mr. Lance E Sloan «UMich»
84dc2dd059
community[patch]: Load YouTube transcripts (captions) as fixed-duration chunks with start times (#21710)
- **Description:** Add a new format, `CHUNKS`, to
`langchain_community.document_loaders.youtube.YoutubeLoader` which
creates multiple `Document` objects from YouTube video transcripts
(captions), each of a fixed duration. The metadata of each chunk
`Document` includes the start time of each one and a URL to that time in
the video on the YouTube website.
  
I had implemented this for UMich (@umich-its-ai) in a local module, but
it makes sense to contribute this to LangChain community for all to
benefit and to simplify maintenance.

- **Issue:** N/A
- **Dependencies:** N/A
- **Twitter:** lsloan_umich
- **Mastodon:**
[lsloan@mastodon.social](https://mastodon.social/@lsloan)

With regards to **tests and documentation**, most existing features of
the `YoutubeLoader` class are not tested. Only the
`YoutubeLoader.extract_video_id()` static method had a test. However,
while I was waiting for this PR to be reviewed and merged, I had time to
add a test for the chunking feature I've proposed in this PR.

I have added an example of using chunking to the
`docs/docs/integrations/document_loaders/youtube_transcript.ipynb`
notebook.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2024-06-11 17:44:36 +00:00
Max Mulatz
058a64c563
Community[minor]: Add language parser for Elixir (#22742)
Hi 👋 

First off, thanks a ton for your work on this 💚 Really appreciate what
you're providing here for the community.

## Description

This PR adds a basic language parser for the
[Elixir](https://elixir-lang.org/) programming language. The parser code
is based upon the approach outlined in
https://github.com/langchain-ai/langchain/pull/13318: it's using
`tree-sitter` under the hood and aligns with all the other `tree-sitter`
based parses added that PR.

The `CHUNK_QUERY` I'm using here is probably not the most sophisticated
one, but it worked for my application. It's a starting point to provide
"core" parsing support for Elixir in LangChain. It enables people to use
the language parser out in real world applications which may then lead
to further tweaking of the queries. I consider this PR just the ground
work.

- **Dependencies:** requires `tree-sitter` and `tree-sitter-languages`
from the extended dependencies
- **Twitter handle:**`@bitcrowd`

## Checklist

- [x] **PR title**: "package: description"
- [x] **Add tests and docs**
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified.

<!-- If no one reviews your PR within a few days, please @-mention one
of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. -->
2024-06-10 15:56:57 +00:00
Isaac Francisco
ba3e219d83
community[patch]: recursive url loader fix and unit tests (#22521)
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
2024-06-05 17:56:20 -07:00
Philippe PRADOS
6dd621d636
community[minor]: Add CloudBlobLoader that supports loading data from cloud buckets (#21957)
Thank you for contributing to LangChain!

- [ ] **PR title**: "Add CloudBlobLoader"
  - community: Add CloudBlobLoader

- [ ] **PR message**: Add cloud blob loader
    - **Description:** 
 Langchain provides several approaches to read different file formats:

Specific loaders (`CVSLoader`) or blob-compatible loaders
(`FileSystemBlobLoader`). The only implementation proposed for
BlobLoader is `FileSystemBlobLoader`.
      
Many projects retrieve files from cloud storage. We propose a new
implementation of `BlobLoader` to read files from the three cloud
storage systems. The interface is strictly identical to
`FileSystemBlobLoader`. The only difference is the constructor, which
takes a cloud "url" object such as `s3://my-bucket`, `az://my-bucket`,
or `gs://my-bucket`.
      
By streamlining the process, this novel implementation eliminates the
requirement to pre-download files from cloud storage to local temporary
files (which are seldom removed).
      
The code relies on the
[CloudPathLib](https://cloudpathlib.drivendata.org/stable/) library to
interpret cloud URLs. This has been added as an optional dependency.

```Python
loader = CloudBlobLoader("s3://mybucket/id")
for blob in loader.yield_blobs():
    print(blob)
```

- [X] **Dependencies:** CloudPathLib
- [X] **Twitter handle:** pprados


- [X] **Add tests and docs**: Add unit test, but it's easy to convert to
integration test, with some files in a cloud storage (see
`test_cloud_blob_loader.py`)

- [X] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified.

Hello from Paris @hwchase17. Can you review this PR?

---------

Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>
2024-05-23 10:59:55 -04:00
Bagatur
50186da0a1
infra: rm unused # noqa violations (#22049)
Updating #21137
2024-05-22 15:21:08 -07:00
Mazen Ramadan
3c1d77dd64
community[minor]: Add Scrapfly Loader community integration (#22036)
Added [Scrapfly](https://scrapfly.io/) Web Loader integration. Scrapfly
is a web scraping API that allows extracting web page data into
accessible markdown or text datasets.

- __Description__: Added Scrapfly web loader for retrieving web page
data as markdown or text.
- Dependencies: scrapfly-sdk
- Twitter: @thealchemi1st

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2024-05-22 21:29:13 +00:00
maang-h
fc93bed8c4
community: Fix CSVLoader columns is None (#20701)
- **Bug code**: In
langchain_community/document_loaders/csv_loader.py:100

- **Description**: currently, when 'CSVLoader' reads the column as None
in the 'csv' file, it will report an error because the 'CSVLoader' does
not verify whether the column is of str type and does not consider how
to handle the corresponding 'row_data' when the column is' None 'in the
csv. This pr provides a solution.

- **Issue:**  Fix #20699 

- **thinking:**

1. Refer to the processing method for
'langchain_community/document_loaders/csv_loader.py:100' when **'v'**
equals'None', and apply the same method to '**k**'.
(Reference`csv.DictReader` ,**'k'** will only be None when `
len(columns) < len(number_row_data)` is established)
2. **‘k’** equals None only holds when it is the last column, and its
corresponding **'v'** type is a list. Therefore, I referred to the data
format in 'Document' and used ',' to concatenated the elements in the
list.(But I'm not sure if you accept this form, if you have any other
ideas, communicate)

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
2024-05-22 12:57:46 -07:00
Eugene Yurtsev
25fbe356b4
community[patch]: upgrade to recent version of mypy (#21616)
This PR upgrades community to a recent version of mypy. It inserts type:
ignore on all existing failures.
2024-05-13 14:55:07 -04:00
Rohan Aggarwal
8021d2a2ab
community[minor]: Oraclevs integration (#21123)
Thank you for contributing to LangChain!

- Oracle AI Vector Search 
Oracle AI Vector Search is designed for Artificial Intelligence (AI)
workloads that allows you to query data based on semantics, rather than
keywords. One of the biggest benefit of Oracle AI Vector Search is that
semantic search on unstructured data can be combined with relational
search on business data in one single system. This is not only powerful
but also significantly more effective because you don't need to add a
specialized vector database, eliminating the pain of data fragmentation
between multiple systems.


- Oracle AI Vector Search is designed for Artificial Intelligence (AI)
workloads that allows you to query data based on semantics, rather than
keywords. One of the biggest benefit of Oracle AI Vector Search is that
semantic search on unstructured data can be combined with relational
search on business data in one single system. This is not only powerful
but also significantly more effective because you don't need to add a
specialized vector database, eliminating the pain of data fragmentation
between multiple systems.
This Pull Requests Adds the following functionalities
Oracle AI Vector Search : Vector Store
Oracle AI Vector Search : Document Loader
Oracle AI Vector Search : Document Splitter
Oracle AI Vector Search : Summary
Oracle AI Vector Search : Oracle Embeddings


- We have added unit tests and have our own local unit test suite which
verifies all the code is correct. We have made sure to add guides for
each of the components and one end to end guide that shows how the
entire thing runs.


- We have made sure that make format and make lint run clean.

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, hwchase17.

---------

Co-authored-by: skmishraoracle <shailendra.mishra@oracle.com>
Co-authored-by: hroyofc <harichandan.roy@oracle.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
2024-05-04 03:15:35 +00:00
Leonid Ganeline
dc7c06bc07
community[minor]: import fix (#20995)
Issue: When the third-party package is not installed, whenever we need
to `pip install <package>` the ImportError is raised.
But sometimes, the `ValueError` or `ModuleNotFoundError` is raised. It
is bad for consistency.
Change: replaced the `ValueError` or `ModuleNotFoundError` with
`ImportError` when we raise an error with the `pip install <package>`
message.
Note: Ideally, we replace all `try: import... except... raise ... `with
helper functions like `import_aim` or just use the existing
[langchain_core.utils.utils.guard_import](https://api.python.langchain.com/en/latest/utils/langchain_core.utils.utils.guard_import.html#langchain_core.utils.utils.guard_import)
But it would be much bigger refactoring. @baskaryan Please, advice on
this.
2024-04-29 10:32:50 -04:00
WilliamEspegren
804390ba4b
community: Spider integration (#20937)
Added the [Spider.cloud](https://spider.cloud) document loader.
[Spider](https://github.com/spider-rs/spider) is the
[fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md)
and cheapest crawler that returns LLM-ready data.

```
- **Description:** Adds Spider data loader
- **Dependencies:** spider-client
- **Twitter handle:** @WilliamEspegren 
```

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: = <=>
Co-authored-by: Chester Curme <chester.curme@gmail.com>
2024-04-27 21:45:03 +00:00
Chip Davis
e818c75f8a
infra: test directory loader multithreaded (#20281)
This is a unit test for #20230 which was a fix for using multithreaded
mode with directory loader @eyurtsev
2024-04-26 19:16:47 -07:00
am-kinetica
b54b19ba1c
community[minor]: Implemented Kinetica Document Loader and added notebooks (#20002)
- [ ] **Kinetica Document Loader**: "community: a class to load
Documents from Kinetica"



- [ ] **Kinetica Document Loader**: 
- **Description:** implemented KineticaLoader in `kinetica_loader.py`
- **Dependencies:** install the Kinetica API using `pip install
gpudb==7.2.0.1 `
2024-04-25 13:39:00 -07:00
Mish Ushakov
6ccecf2363
community[minor]: added Browserbase loader (#20478) 2024-04-25 01:11:03 +00:00
Ravindu Somawansa
5acc7ba622
community[minor]: Add glue catalog loader (#20220)
Add Glue Catalog loader
2024-04-16 11:39:23 -04:00
Nicolas
ad04585e30
community[minor]: Firecrawl.dev integration (#20364)
Added the [FireCrawl](https://firecrawl.dev) document loader. Firecrawl
crawls and convert any website into LLM-ready data. It crawls all
accessible subpages and give you clean markdown for each.

    - **Description:** Adds FireCrawl data loader
    - **Dependencies:** firecrawl-py
    - **Twitter handle:** @mendableai 

ccing contributors: (@ericciarla @nickscamara)

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
2024-04-12 19:13:48 +00:00
Leonid Ganeline
4cb5f4c353
community[patch]: import flattening fix (#20110)
This PR should make it easier for linters to do type checking and for IDEs to jump to definition of code.

See #20050 as a template for this PR.
- As a byproduct: Added 3 missed `test_imports`.
- Added missed `SolarChat` in to __init___.py Added it into test_import
ut.
- Added `# type: ignore` to fix linting. It is not clear, why linting
errors appear after ^ changes.

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2024-04-10 13:01:19 -04:00
david02871
e1a24d09c5
community: Add PHP language parser to document_loaders (#19850)
**Description:**
Added a PHP language parser to document_loaders
**Issue:** N/A
**Dependencies:** N/A
**Twitter handle:** N/A

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
2024-04-08 11:30:28 -04:00