Commit Graph

247 Commits

Author SHA1 Message Date
Bagatur
aa9f247803 core[patch]: manually coerce ToolMessage args (#26283) 2024-09-10 15:57:57 -07:00
Bagatur
f2f9187919 community[patch]: fix community warnings 1 (#26239) 2024-09-09 17:27:00 -07:00
Eugene Yurtsev
e02b093d81 community[patch]: Fix more issues (#26116)
This PR resolves more type checking issues and fixes some bugs.
2024-09-05 16:31:21 -04:00
Eugene Yurtsev
0cc6584889 community[patch]: Resolve more linting issues (#26115)
Resolve a bunch of errors caught with mypy
2024-09-05 15:59:30 -04:00
Eugene Yurtsev
6e1b0d0228 community[patch]: Skip unit test that depends on langchain-aws and fix pydantic settings (#26111)
* Skip unit test that depends on langchain-aws
* fix pydantic settings
2024-09-05 15:08:34 -04:00
Harrison Chase
8516a03a02 langchain-community[major]: Upgrade community to pydantic 2 (#26011)
This PR upgrades langchain-community to pydantic 2.


* Most of this PR was auto-generated using code mods with gritql
(https://github.com/eyurtsev/migrate-pydantic/tree/main)
* Subsequently, some code was fixed manually due to accommodate
differences between pydantic 1 and 2

Breaking Changes:

- Use TEXTEMBED_API_KEY and TEXTEMBEB_API_URL for env variables for text
embed integrations:
cbea780492

Other changes:

- Added pydantic_settings as a required dependency for community. This
may be removed if we have enough time to convert the dependency into an
optional one.

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
2024-09-05 14:07:10 -04:00
Alexander KIRILOV
6a8f8a56ac community[patch]: added content_columns option to CSVLoader (#23809)
**Description:** 
Adding a new option to the CSVLoader that allows us to implicitly
specify the columns that are used for generating the Document content.
Currently these are implicitly set as "all fields not part of the
metadata_columns".

In some cases however it is useful to have a field both as a metadata
and as part of the document content.
2024-09-02 20:25:53 +00:00
mehdiosa
c6f00e6bdc community: Fix branch not being considered when using GithubFileLoader (#20075)
- **Description:** Added `ref` query parameter so data is not loaded
only from the default branch but any branch passed

---------

Co-authored-by: Osama Mehdi <mehdi@hm.edu>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
2024-08-30 21:47:11 +00:00
Isaac Francisco
d5ddaac1fc docs minor fix (#25794) 2024-08-28 04:14:36 +00:00
Dristy Srivastava
7205057c3e [Community][minor]: Added langchain_version while calling discover API (#24428)
- **Description:** Added langchain version while calling discover API
during both ingestion and retrieval
- **Issue:** NA
- **Dependencies:** NA
- **Tests:** NA
- **Docs** NA

---------

Co-authored-by: dristy.cd <dristy@clouddefense.io>
2024-08-26 08:47:48 -04:00
Dristy Srivastava
fbb4761199 [Community][minor]: Updating source path, and file path for SharePoint loader in PebbloSafeLoader (#25592)
- **Description:** Updating source path and file path in Pebblo safe
loader for SharePoint apps during loading
- **Issue:** NA
- **Dependencies:** NA
- **Tests:** NA
- **Docs** NA

---------

Co-authored-by: dristy.cd <dristy@clouddefense.io>
2024-08-26 08:38:40 -04:00
clement.l
642f9530cd community: add supported blockchains to Blockchain Document Loader (#25428)
- Remove deprecated chains.
- Add more supported chains.

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
2024-08-23 14:39:42 +00:00
Kevin Engelke
3c7f12cbf5 community[minor]: Fix missing 'keep_newlines' parameter forward-pass to 'process_pages' function in confluence loader (#20086) (#20087)
- **Description:** Fixed missing `keep_newlines` parameter forward-pass
in confluence-loader
- **Issue:** #20086 
- **Dependencies:** None

---------

Co-authored-by: ccurme <chester.curme@gmail.com>
2024-08-23 12:59:38 +00:00
Nobuhiko Otoba
4b63a217c2 "community: Fix GithubFileLoader source code", "docs: Fix GithubFileLoader code sample" (#19943)
This PR adds tiny improvements to the `GithubFileLoader` document loader
and its code sample, addressing the following issues:

1. Currently, the `file_extension` argument of `GithubFileLoader` does
not change its behavior at all.
1. The `GithubFileLoader` sample code in
`docs/docs/integrations/document_loaders/github.ipynb` does not work as
it stands.

The respective solutions I propose are the following:

1. Remove `file_extension` argument from `GithubFileLoader`.
1. Specify the branch as `master` (not the default `main`) and rename
`documents` as `document`.

---------

Co-authored-by: Isaac Francisco <78627776+isahers1@users.noreply.github.com>
2024-08-22 18:24:57 -04:00
Rajendra Kadam
1f1679e960 community: Refactor PebbloSafeLoader (#25582)
**Refactor PebbloSafeLoader**
  - Created `APIWrapper` and moved API logic into it.
  - Moved helper functions to the utility file.
  - Created smaller functions and methods for better readability.
  - Properly read environment variables.
  - Removed unused code.

**Issue:** NA
**Dependencies:** NA
**tests**:  Updated
2024-08-22 11:46:52 -04:00
Dristy Srivastava
b002702af6 [Community][minor]: Updating metadata with full_path in SharePoint loader (#25593)
- **Description:** Updating metadata for sharepoint loader with full
path i.e., webUrl
- **Issue:** NA
- **Dependencies:** NA
- **Tests:** NA
- **Docs** NA

Co-authored-by: dristy.cd <dristy@clouddefense.io>
Co-authored-by: ccurme <chester.curme@gmail.com>
2024-08-21 13:10:14 +00:00
Mohammad Mohtashim
75c3c81b8c [Community]: Fix - Open AI Whisper client.audio.transcriptions returning Text Object which raises error (#25271)
- **Description:** The following
[line](fd546196ef/libs/community/langchain_community/document_loaders/parsers/audio.py (L117))
in `OpenAIWhisperParser` returns a text object for some odd reason
despite the official documentation saying it should return `Transcript`
Instance which should have the text attribute. But for the example given
in the issue and even when I tried running on my own, I was directly
getting the text. The small PR accounts for that.
 - **Issue:** : #25218
 

I was able to replicate the error even without the GenericLoader as
shown below and the issue was with `OpenAIWhisperParser`

```python
parser = OpenAIWhisperParser(api_key="sk-fxxxxxxxxx",
                                            response_format="srt",
                                            temperature=0)

list(parser.lazy_parse(Blob.from_path('path_to_file.m4a')))
```
2024-08-19 09:36:42 -04:00
Bagatur
253ceca76a docs: fix mimetype parser docstring (#25463) 2024-08-15 16:16:52 -07:00
Isaac Francisco
966b408634 [docs]: doc loader changes (#25417) 2024-08-14 19:46:33 -07:00
ccurme
27690506d0 multiple: update removal targets (#25361) 2024-08-14 09:50:39 -04:00
Harrison Chase
967b6f21f6 docs: improve document loaders index (#25365)
Co-authored-by: Erick Friis <erick@langchain.dev>
2024-08-14 01:48:48 +00:00
Isaac Francisco
f4ffd692a3 [docs]: standardize doc loader doc strings (#25325) 2024-08-13 23:18:56 +00:00
Eugene Yurtsev
bd6c31617e community[patch]: Remove more @allow_reuse=True validators (#25236)
Remove some additional allow_reuse=True usage in @root_validators.
2024-08-09 11:10:27 -04:00
Shivendra Soni
66b7206ab6 community: Add llm-extraction option to FireCrawl Document Loader (#25231)
**Description:** This minor PR aims to add `llm_extraction` to Firecrawl
loader. This feature is supported on API and PythonSDK, but the
langchain loader omits adding this to the response.
**Twitter handle:** [scalable_pizza](https://x.com/scalablepizza)

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
2024-08-09 13:59:10 +00:00
Rajendra Kadam
663638d6a8 community[minor]: [SharePointLoader] Load extended metadata for the root folder (#24872)
- **Title:** [SharePointLoader] Load extended metadata for the root
folder
- **Description:** 
    - Ensure extended metadata loads correctly for the root folder.
- Cleanup: Refactor SharePointLoader to remove unused fields(`file_id` &
`site_id`).
- **Dependencies:** NA
- **Add tests and docs:** NA
2024-08-08 14:39:16 -04:00
Eugene Yurtsev
bf5193bb99 community[patch]: Upgrade pydantic extra (#25185)
Upgrade to using a literal for specifying the extra which is the
recommended approach in pydantic 2.

This works correctly also in pydantic v1.

```python
from pydantic.v1 import BaseModel

class Foo(BaseModel, extra="forbid"):
    x: int

Foo(x=5, y=1)
```

And 


```python
from pydantic.v1 import BaseModel

class Foo(BaseModel):
    x: int

    class Config:
      extra = "forbid"

Foo(x=5, y=1)
```


## Enum -> literal using grit pattern:

```
engine marzano(0.1)
language python
or {
    `extra=Extra.allow` => `extra="allow"`,
    `extra=Extra.forbid` => `extra="forbid"`,
    `extra=Extra.ignore` => `extra="ignore"`
}
```

Resorted attributes in config and removed doc-string in case we will
need to deal with going back and forth between pydantic v1 and v2 during
the 0.3 release. (This will reduce merge conflicts.)


## Sort attributes in Config:

```
engine marzano(0.1)
language python


function sort($values) js {
    return $values.text.split(',').sort().join("\n");
}


class_definition($name, $body) as $C where {
    $name <: `Config`,
    $body <: block($statements),
    $values = [],
    $statements <: some bubble($values) assignment() as $A where {
        $values += $A
    },
    $body => sort($values),
}

```
2024-08-08 17:20:39 +00:00
Dominik Fladung
ffa0c838d8 Allow ConfluenceLoader authorization via Personal Access Tokens (#25096)
- community: Allow authorization to Confluence with bearer token

- **Description:** Allow authorization to Confluence with [Personal
Access
Token](https://confluence.atlassian.com/enterprise/using-personal-access-tokens-1026032365.html)
by checking for the keys `['client_id', token: ['access_token',
'token_type']]`

- **Issue:** 

Currently the following error occurs when using an personal access token
for authorization.

```python
loader = ConfluenceLoader(
    url=os.getenv('CONFLUENCE_URL'),
    oauth2={
        'token': {"access_token": os.getenv("CONFLUENCE_ACCESS_TOKEN"), "token_type": "bearer"},
        'client_id': 'client_id',
    },
    page_ids=['12345678'], 
)
```

```
ValueError: Error(s) while validating input: ["You have either omitted require keys or added extra keys to the oauth2 dictionary. key values should be `['access_token', 'access_token_secret', 'consumer_key', 'key_cert']`"]
```

With this PR the loader runs as expected.

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
2024-08-06 13:42:47 +00:00
Jim Baldwin
6890daa90c community: make AthenaLoader profile_name optional and fix type hint (#24958)
- **Description:** This PR makes the AthenaLoader profile_name optional
and fixes the type hint which says the type is `str` but it should be
`str` or `None` as None is handled in the loader init. This is a minor
problem but it just confused me when I was using the Athena Loader to
why we had to use a Profile, as I want that for local but not
production.
- **Issue:** #24957 
- **Dependencies:** None.
2024-08-05 14:28:58 +00:00
Bagatur
e81ddb32a6 docs: fix kwargs docstring (#25010)
Fix:
![Screenshot 2024-08-02 at 5 33 37
PM](https://github.com/user-attachments/assets/7c56cdeb-ee81-454c-b3eb-86aa8a9bdc8d)
2024-08-02 19:54:54 -07:00
Bagatur
8e2316b8c2 community[patch]: Release 0.2.11 (#24989) 2024-08-02 20:08:44 +00:00
BottlePumpkin
bfc59c1d26 community: Fix KeyError in NotionDB loader when 'name' is missing (#24224)
Thank you for contributing to LangChain!

- [x] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core,
experimental, etc. is being modified. Use "docs: ..." for purely docs
changes, "templates: ..." for template changes, "infra: ..." for CI
changes.
  - Example: "community: add foobar LLM"


- [x] **PR message**: ***Delete this entire checklist*** and replace
with
    - **Description:** a description of the change
    - **Issue:** the issue # it fixes, if applicable
    - **Dependencies:** any dependencies required for this change
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!


- [x] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.


- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.



**Description:** This PR fixes a KeyError in NotionDBLoader when the
"name" key is missing in the "people" property.

**Issue:** Fixes #24223 

**Dependencies:** None

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
2024-08-01 13:55:40 +00:00
Eugene Yurtsev
d24b82357f community[patch]: Add missing annotations (#24890)
This PR adds annotations in comunity package.

Annotations are only strictly needed in subclasses of BaseModel for
pydantic 2 compatibility.

This PR adds some unnecessary annotations, but they're not bad to have
regardless for documentation pages.
2024-07-31 18:13:44 +00:00
Rajendra Kadam
a6add89bd4 community[minor]: [PebbloSafeLoader] Implement content-size-based batching (#24871)
- **Title:** [PebbloSafeLoader] Implement content-size-based batching in
the classification flow(loader/doc API)
- **Description:** 
- Implemented content-size-based batching in the loader/doc API, set to
100KB with no external configuration option, intentionally hard-coded to
prevent timeouts.
    - Remove unused field(pb_id) from doc_metadata
- **Issue:** NA
- **Dependencies:** NA
- **Add tests and docs:** Updated
2024-07-31 09:10:28 -04:00
AmosDinh
c113682328 community:Add support for specifying document_loaders.firecrawl api url. (#24747)
community:Add support for specifying document_loaders.firecrawl api url.


Add support for specifying document_loaders.firecrawl api url. 
This is mainly to support the
[self-hosting](https://github.com/mendableai/firecrawl/blob/main/SELF_HOST.md)
option firecrawl provides. Eg. now I can specify localhost:....

The corresponding firecrawl class already provides functionality to pass
the argument. See here:
4c9d62f6d3/apps/python-sdk/firecrawl/firecrawl.py (L29)

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
2024-07-28 14:30:36 -04:00
Oleg Kulyk
4b1b7959a2 community[minor]: Add ScrapingAnt Loader Community Integration (#24514)
Added [ScrapingAnt](https://scrapingant.com/) Web Loader integration.
ScrapingAnt is a web scraping API that allows extracting web page data
into accessible and well-formatted markdown.

Description: Added ScrapingAnt web loader for retrieving web page data
as markdown
Dependencies: scrapingant-client
Twitter: @WeRunTheWorld3

---------

Co-authored-by: Oleg Kulyk <oleg@scrapingant.com>
2024-07-24 21:11:43 -04:00
John
d59c656ea5 unstructured, community, initialize langchain-unstructured package (#22779)
#### Update (2): 
A single `UnstructuredLoader` is added to handle both local and api
partitioning. This loader also handles single or multiple documents.

#### Changes in `community`:
Changes here do not affect users. In the initial process of using the
SDK for the API Loaders, the Loaders in community were refactored.
Other changes include:
The `UnstructuredBaseLoader` has a new check to see if both
`mode="paged"` and `chunking_strategy="by_page"`. It also now has
`Element.element_id` added to the `Document.metadata`.
`UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. As such,
now both directly inherit from `UnstructuredBaseLoader` and initialize
their `file_path`/`file` attributes respectively and implement their own
`_post_process_elements` methods.

--------
#### Update:
New SDK Loaders in a [partner
package](https://python.langchain.com/v0.1/docs/contributing/integrations/#partner-package-in-langchain-repo)
are introduced to prevent breaking changes for users (see discussion
below).

##### TODO:
- [x] Test docstring examples
--------
- **Description:** UnstructuredAPIFileIOLoader and
UnstructuredAPIFileLoader calls to the unstructured api are now made
using the unstructured-client sdk.
- **New Dependencies:** unstructured-client

- [x] **Add tests and docs**: If you're adding a new integration, please
include
- [x] a test for the integration, preferably unit tests that do not rely
on network access,
- [x] update the description in
`docs/docs/integrations/providers/unstructured.mdx`
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

TODO:
- [x] Update
https://python.langchain.com/v0.1/docs/integrations/document_loaders/unstructured_file/#unstructured-api
-
`langchain/docs/docs/integrations/document_loaders/unstructured_file.ipynb`
- The description here needs to indicate that users should install
`unstructured-client` instead of `unstructured`. Read over closely to
look for any other changes that need to be made.
- [x] Update the `lazy_load` method in `UnstructuredBaseLoader` to
handle json responses from the API instead of just lists of elements.
- This method may need to be overwritten by the API loaders instead of
changing it in the `UnstructuredBaseLoader`.
- [x] Update the documentation links in the class docstrings (the
Unstructured documents have moved)
- [x] Update Document.metadata to include `element_id` (see thread
[here](https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1718187499818419))

---------

Signed-off-by: ChengZi <chen.zhang@zilliz.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Isaac Francisco <78627776+isahers1@users.noreply.github.com>
Co-authored-by: ChengZi <chen.zhang@zilliz.com>
2024-07-24 23:21:20 +00:00
Morteza Hosseini
9e06991aae community[patch]: Update URL to the 2markdown API (#24546)
Update the URL to Markdown endpoint.

API information is available here: https://2markdown.com/docs#url2md
2024-07-23 14:27:55 +00:00
Alexander Golodkov
2a70a07aad community[minor]: added new document loaders based on dedoc library (#24303)
### Description
This pull request added new document loaders to load documents of
various formats using [Dedoc](https://github.com/ispras/dedoc):
  - `DedocFileLoader` (determine file types automatically and parse)
  - `DedocPDFLoader` (for `PDF` and images parsing)
- `DedocAPIFileLoader` (determine file types automatically and parse
using Dedoc API without library installation)

[Dedoc](https://dedoc.readthedocs.io) is an open-source library/service
that extracts texts, tables, attached files and document structure
(e.g., titles, list items, etc.) from files of various formats. The
library is actively developed and maintained by a group of developers.

`Dedoc` supports `DOCX`, `XLSX`, `PPTX`, `EML`, `HTML`, `PDF`, images
and more.
Full list of supported formats can be found
[here](https://dedoc.readthedocs.io/en/latest/#id1).
For `PDF` documents, `Dedoc` allows to determine textual layer
correctness and split the document into paragraphs.


### Issue
This pull request extends variety of document loaders supported by
`langchain_community` allowing users to choose the most suitable option
for raw documents parsing.

### Dependencies
The PR added a new (optional) dependency `dedoc>=2.2.5` ([library
documentation](https://dedoc.readthedocs.io)) to the
`extended_testing_deps.txt`

### Twitter handle
None

### Add tests and docs
1. Test for the integration:
`libs/community/tests/integration_tests/document_loaders/test_dedoc.py`
2. Example notebook:
`docs/docs/integrations/document_loaders/dedoc.ipynb`
3. Information about the library:
`docs/docs/integrations/providers/dedoc.mdx`

### Lint and test

Done locally:

  - `make format`
  - `make lint`
  - `make integration_tests`
  - `make docs_build` (from the project root)

---------

Co-authored-by: Nasty <bogatenkova.anastasiya@mail.ru>
2024-07-23 02:04:53 +00:00
Naka Masato
884f76e05a fix: load google credentials properly in GoogleDriveLoader (#12871)
- **Description:** 
- Fix #12870: set scope in `default` func (ref:
https://google-auth.readthedocs.io/en/master/reference/google.auth.html)
- Moved the code to load default credentials to the bottom for clarity
of the logic
	- Add docstring and comment for each credential loading logic
- **Issue:** https://github.com/langchain-ai/langchain/issues/12870
- **Dependencies:** no dependencies change
- **Tag maintainer:** for a quicker response, tag the relevant
maintainer (see below),
- **Twitter handle:** @gymnstcs

<!-- If no one reviews your PR within a few days, please @-mention one
of @baskaryan, @eyurtsev, @hwchase17.
 -->

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
2024-07-22 17:43:33 +00:00
clement.l
d98b830e4b community: add flag to toggle progress bar (#24463)
- **Description:** Add a flag to determine whether to show progress bar 
- **Issue:** n/a
- **Dependencies:** n/a
- **Twitter handle:** n/a

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
2024-07-20 13:18:02 +00:00
Asi Greenholts
372c27f2e5 community[minor]: [GoogleApiYoutubeLoader] Replace API used in _get_document_for_channel from search to playlistItem (#24034)
- **Description:** Search has a limit of 500 results, playlistItems
doesn't. Added a class in except clause to catch another common error.
- **Issue:** None
- **Dependencies:** None
- **Twitter handle:** @TupleType

---------

Co-authored-by: asi-cider <88270351+asi-cider@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2024-07-19 14:04:34 -04:00
Dristy Srivastava
020cc1cf3e Community[minor]: Added checksum in while send data to pebblo-cloud (#23968)
- **Description:** 
            - Updated checksum in doc metadata
- Sending checksum and removing actual content, while sending data to
`pebblo-cloud` if `classifier-location `is `pebblo-cloud` in
`/loader/doc` API
            - Adding `pb_id` i.e. pebblo id to doc metadata
            - Refactoring as needed.
- Sending `content-checksum` and removing actual content, while sending
data to `pebblo-cloud` if `classifier-location `is `pebblo-cloud` in
`prmopt` API
- **Issue:** NA
- **Dependencies:** NA
- **Tests:** Updated
- **Docs** NA

---------

Co-authored-by: dristy.cd <dristy@clouddefense.io>
2024-07-19 13:52:54 -04:00
Brice Fotzo
034a8c7c1b community: support advanced text extraction options for pdf documents (#20265)
**Description:** 
- Updated constructors in PyPDFParser and PyPDFLoader to handle
`extraction_mode` and additional kwargs, aligning with the capabilities
of `PageObject.extract_text()` from pypdf.

- Added `test_pypdf_loader_with_layout` along with a corresponding
example text file to validate layout extraction from PDFs.

**Issue:** fixes #19735 

**Dependencies:** This change requires updating the pypdf dependency
from version 3.4.0 to at least 4.0.0.

Additional changes include the addition of a new test
test_pypdf_loader_with_layout and an example text file to ensure the
functionality of layout extraction from PDFs aligns with the new
capabilities.

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
2024-07-17 20:47:09 +00:00
Eugene Yurtsev
c4e149d4f1 community[patch]: Add linter to catch @root_validator (#24070)
- Add linter to prevent further usage of vanilla root validator
- Udpate remaining root validators
2024-07-10 14:51:03 +00:00
Rajendra Kadam
ee8aa54f53 community[patch]: Fix source path mismatch in PebbloSafeLoader (#23857)
**Description:** Fix for source path mismatch in PebbloSafeLoader. The
fix involves storing the full path in the doc metadata in VectorDB
**Issue:** NA, caught in internal testing
**Dependencies:** NA
**Add tests**:  Updated tests
2024-07-05 15:24:17 -04:00
Klaudia Lemiec
a2082bc1f8 docs: Arxiv docs update (#23871)
- [X] **PR title**
- [X] **PR message**: ***Delete this entire checklist*** and replace
with
    - **Description:** Update of docstrings and docpages
- **Issue:**
[22866](https://github.com/langchain-ai/langchain/issues/22866)

- [X] **Add tests and docs**

- [X] **Lint and test**
2024-07-05 11:43:51 -04:00
Bagatur
a0c2281540 infra: update mypy 1.10, ruff 0.5 (#23721)
```python
"""python scripts/update_mypy_ruff.py"""
import glob
import tomllib
from pathlib import Path

import toml
import subprocess
import re

ROOT_DIR = Path(__file__).parents[1]


def main():
    for path in glob.glob(str(ROOT_DIR / "libs/**/pyproject.toml"), recursive=True):
        print(path)
        with open(path, "rb") as f:
            pyproject = tomllib.load(f)
        try:
            pyproject["tool"]["poetry"]["group"]["typing"]["dependencies"]["mypy"] = (
                "^1.10"
            )
            pyproject["tool"]["poetry"]["group"]["lint"]["dependencies"]["ruff"] = (
                "^0.5"
            )
        except KeyError:
            continue
        with open(path, "w") as f:
            toml.dump(pyproject, f)
        cwd = "/".join(path.split("/")[:-1])
        completed = subprocess.run(
            "poetry lock --no-update; poetry install --with typing; poetry run mypy . --no-color",
            cwd=cwd,
            shell=True,
            capture_output=True,
            text=True,
        )
        logs = completed.stdout.split("\n")

        to_ignore = {}
        for l in logs:
            if re.match("^(.*)\:(\d+)\: error:.*\[(.*)\]", l):
                path, line_no, error_type = re.match(
                    "^(.*)\:(\d+)\: error:.*\[(.*)\]", l
                ).groups()
                if (path, line_no) in to_ignore:
                    to_ignore[(path, line_no)].append(error_type)
                else:
                    to_ignore[(path, line_no)] = [error_type]
        print(len(to_ignore))
        for (error_path, line_no), error_types in to_ignore.items():
            all_errors = ", ".join(error_types)
            full_path = f"{cwd}/{error_path}"
            try:
                with open(full_path, "r") as f:
                    file_lines = f.readlines()
            except FileNotFoundError:
                continue
            file_lines[int(line_no) - 1] = (
                file_lines[int(line_no) - 1][:-1] + f"  # type: ignore[{all_errors}]\n"
            )
            with open(full_path, "w") as f:
                f.write("".join(file_lines))

        subprocess.run(
            "poetry run ruff format .; poetry run ruff --select I --fix .",
            cwd=cwd,
            shell=True,
            capture_output=True,
            text=True,
        )


if __name__ == "__main__":
    main()

```
2024-07-03 10:33:27 -07:00
Eugene Yurtsev
f24e38876a community[patch]: Update root_validators to use explicit pre=True or pre=False (#23736) 2024-07-01 17:13:23 -04:00
Alireza Kashani
c39521b70d Update grobid.py (#23399)
fixed potential `IndexError: list index out of range` in case there is
no title

Thank you for contributing to LangChain!

- [ ] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core,
experimental, etc. is being modified. Use "docs: ..." for purely docs
changes, "templates: ..." for template changes, "infra: ..." for CI
changes.
  - Example: "community: add foobar LLM"


- [ ] **PR message**: ***Delete this entire checklist*** and replace
with
    - **Description:** a description of the change
    - **Issue:** the issue # it fixes, if applicable
    - **Dependencies:** any dependencies required for this change
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!


- [ ] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.


- [ ] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
2024-06-26 09:11:02 -04:00
Rahul Triptahi
9ef93ecd7c community[minor]: Added classification_location parameter in PebbloSafeLoader. (#22565)
Description: Add classifier_location feature flag. This flag enables
Pebblo to decide the classifier location, local or pebblo-cloud.
Unit Tests: N/A
Documentation: N/A

---------

Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
2024-06-24 17:30:38 -04:00