langchain

mirror of https://github.com/hwchase17/langchain.git synced 2025-10-14 21:40:51 +00:00

Author	SHA1	Message	Date
Naka Masato	884f76e05a	fix: load google credentials properly in GoogleDriveLoader (#12871 ) - Description: - Fix #12870: set scope in `default` func (ref: https://google-auth.readthedocs.io/en/master/reference/google.auth.html) - Moved the code to load default credentials to the bottom for clarity of the logic - Add docstring and comment for each credential loading logic - Issue: https://github.com/langchain-ai/langchain/issues/12870 - Dependencies: no dependencies change - Tag maintainer: for a quicker response, tag the relevant maintainer (see below), - Twitter handle: @gymnstcs <!-- If no one reviews your PR within a few days, please @-mention one of @baskaryan, @eyurtsev, @hwchase17. --> --------- Co-authored-by: Chester Curme <chester.curme@gmail.com>	2024-07-22 17:43:33 +00:00
clement.l	d98b830e4b	community: add flag to toggle progress bar (#24463 ) - Description: Add a flag to determine whether to show progress bar - Issue: n/a - Dependencies: n/a - Twitter handle: n/a --------- Co-authored-by: Chester Curme <chester.curme@gmail.com>	2024-07-20 13:18:02 +00:00
Asi Greenholts	372c27f2e5	community[minor]: [GoogleApiYoutubeLoader] Replace API used in _get_document_for_channel from search to playlistItem (#24034 ) - Description: Search has a limit of 500 results, playlistItems doesn't. Added a class in except clause to catch another common error. - Issue: None - Dependencies: None - Twitter handle: @TupleType --------- Co-authored-by: asi-cider <88270351+asi-cider@users.noreply.github.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>	2024-07-19 14:04:34 -04:00
Dristy Srivastava	020cc1cf3e	Community[minor]: Added checksum in while send data to pebblo-cloud (#23968 ) - Description: - Updated checksum in doc metadata - Sending checksum and removing actual content, while sending data to `pebblo-cloud` if `classifier-location `is `pebblo-cloud` in `/loader/doc` API - Adding `pb_id` i.e. pebblo id to doc metadata - Refactoring as needed. - Sending `content-checksum` and removing actual content, while sending data to `pebblo-cloud` if `classifier-location `is `pebblo-cloud` in `prmopt` API - Issue: NA - Dependencies: NA - Tests: Updated - Docs NA --------- Co-authored-by: dristy.cd <dristy@clouddefense.io>	2024-07-19 13:52:54 -04:00
Brice Fotzo	034a8c7c1b	community: support advanced text extraction options for pdf documents (#20265 ) Description: - Updated constructors in PyPDFParser and PyPDFLoader to handle `extraction_mode` and additional kwargs, aligning with the capabilities of `PageObject.extract_text()` from pypdf. - Added `test_pypdf_loader_with_layout` along with a corresponding example text file to validate layout extraction from PDFs. Issue: fixes #19735 Dependencies: This change requires updating the pypdf dependency from version 3.4.0 to at least 4.0.0. Additional changes include the addition of a new test test_pypdf_loader_with_layout and an example text file to ensure the functionality of layout extraction from PDFs aligns with the new capabilities. --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Erick Friis <erick@langchain.dev>	2024-07-17 20:47:09 +00:00
Eugene Yurtsev	c4e149d4f1	community[patch]: Add linter to catch @root_validator (#24070 ) - Add linter to prevent further usage of vanilla root validator - Udpate remaining root validators	2024-07-10 14:51:03 +00:00
Rajendra Kadam	ee8aa54f53	community[patch]: Fix source path mismatch in PebbloSafeLoader (#23857 ) Description: Fix for source path mismatch in PebbloSafeLoader. The fix involves storing the full path in the doc metadata in VectorDB Issue: NA, caught in internal testing Dependencies: NA Add tests: Updated tests	2024-07-05 15:24:17 -04:00
Klaudia Lemiec	a2082bc1f8	docs: Arxiv docs update (#23871 ) - [X] PR title - [X] PR message: *Delete this entire checklist* and replace with - Description: Update of docstrings and docpages - Issue: [22866](https://github.com/langchain-ai/langchain/issues/22866) - [X] Add tests and docs - [X] Lint and test	2024-07-05 11:43:51 -04:00
Bagatur	a0c2281540	infra: update mypy 1.10, ruff 0.5 (#23721 ) ```python """python scripts/update_mypy_ruff.py""" import glob import tomllib from pathlib import Path import toml import subprocess import re ROOT_DIR = Path(__file__).parents[1] def main(): for path in glob.glob(str(ROOT_DIR / "libs/*/pyproject.toml"), recursive=True): print(path) with open(path, "rb") as f: pyproject = tomllib.load(f) try: pyproject["tool"]["poetry"]["group"]["typing"]["dependencies"]["mypy"] = ( "^1.10" ) pyproject["tool"]["poetry"]["group"]["lint"]["dependencies"]["ruff"] = ( "^0.5" ) except KeyError: continue with open(path, "w") as f: toml.dump(pyproject, f) cwd = "/".join(path.split("/")[:-1]) completed = subprocess.run( "poetry lock --no-update; poetry install --with typing; poetry run mypy . --no-color", cwd=cwd, shell=True, capture_output=True, text=True, ) logs = completed.stdout.split("\n") to_ignore = {} for l in logs: if re.match("^(.)\:(\d+)\: error:.\[(.)\]", l): path, line_no, error_type = re.match( "^(.)\:(\d+)\: error:.\[(.*)\]", l ).groups() if (path, line_no) in to_ignore: to_ignore[(path, line_no)].append(error_type) else: to_ignore[(path, line_no)] = [error_type] print(len(to_ignore)) for (error_path, line_no), error_types in to_ignore.items(): all_errors = ", ".join(error_types) full_path = f"{cwd}/{error_path}" try: with open(full_path, "r") as f: file_lines = f.readlines() except FileNotFoundError: continue file_lines[int(line_no) - 1] = ( file_lines[int(line_no) - 1][:-1] + f" # type: ignore[{all_errors}]\n" ) with open(full_path, "w") as f: f.write("".join(file_lines)) subprocess.run( "poetry run ruff format .; poetry run ruff --select I --fix .", cwd=cwd, shell=True, capture_output=True, text=True, ) if __name__ == "__main__": main() ```	2024-07-03 10:33:27 -07:00
Eugene Yurtsev	f24e38876a	community[patch]: Update root_validators to use explicit pre=True or pre=False (#23736 )	2024-07-01 17:13:23 -04:00
Alireza Kashani	c39521b70d	Update grobid.py (#23399 ) fixed potential `IndexError: list index out of range` in case there is no title Thank you for contributing to LangChain! - [ ] PR title: "package: description" - Where "package" is whichever of langchain, community, core, experimental, etc. is being modified. Use "docs: ..." for purely docs changes, "templates: ..." for template changes, "infra: ..." for CI changes. - Example: "community: add foobar LLM" - [ ] PR message: *Delete this entire checklist* and replace with - Description: a description of the change - Issue: the issue # it fixes, if applicable - Dependencies: any dependencies required for this change - Twitter handle: if your PR gets announced, and you'd like a mention, we'll gladly shout you out! - [ ] Add tests and docs: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. - [ ] Lint and test: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.	2024-06-26 09:11:02 -04:00
Rahul Triptahi	9ef93ecd7c	community[minor]: Added classification_location parameter in PebbloSafeLoader. (#22565 ) Description: Add classifier_location feature flag. This flag enables Pebblo to decide the classifier location, local or pebblo-cloud. Unit Tests: N/A Documentation: N/A --------- Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com> Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>	2024-06-24 17:30:38 -04:00
RUO	2b87e330b0	community: fix issue with nested field extraction in MongodbLoader (#22801 ) Description: This PR addresses an issue in the `MongodbLoader` where nested fields were not being correctly extracted. The loader now correctly handles nested fields specified in the `field_names` parameter. Issue: Fixes an issue where attempting to extract nested fields from MongoDB documents resulted in `KeyError`. Dependencies: No new dependencies are required for this change. Twitter handle: (Optional, your Twitter handle if you'd like a mention when the PR is announced) ### Changes 1. Field Name Parsing: - Added logic to parse nested field names and safely extract their values from the MongoDB documents. 2. Projection Construction: - Updated the projection dictionary to include nested fields correctly. 3. Field Extraction: - Updated the `aload` method to handle nested field extraction using a recursive approach to traverse the nested dictionaries. ### Example Usage Updated usage example to demonstrate how to specify nested fields in the `field_names` parameter: ```python loader = MongodbLoader( connection_string=MONGO_URI, db_name=MONGO_DB, collection_name=MONGO_COLLECTION, filter_criteria={"data.job.company.industry_name": "IT", "data.job.detail": { "$exists": True }}, field_names=[ "data.job.detail.id", "data.job.detail.position", "data.job.detail.intro", "data.job.detail.main_tasks", "data.job.detail.requirements", "data.job.detail.preferred_points", "data.job.detail.benefits", ], ) docs = loader.load() print(len(docs)) for doc in docs: print(doc.page_content) ``` ### Testing Tested with a MongoDB collection containing nested documents to ensure that the nested fields are correctly extracted and concatenated into a single page_content string. ### Note This change ensures backward compatibility for non-nested fields and improves functionality for nested field extraction. ### Output Sample ```python print(docs[:3]) ``` ```shell # output sample: [ Document( # Here in this example, page_content is the combined text from the fields below # "position", "intro", "main_tasks", "requirements", "preferred_points", "benefits" page_content='all combined contents from the requested fields in the document', metadata={'database': 'Your Database name', 'collection': 'Your Collection name'} ), ... ] ``` --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com>	2024-06-24 19:29:11 +00:00
Rahul Triptahi	0cd3f93361	Enhance metadata of sharepointLoader. (#22248 ) Description: 2 feature flags added to SharePointLoader in this PR: 1. load_auth: if set to True, adds authorised identities to metadata 2. load_extended_metadata, adds source, owner and full_path to metadata Unit tests:N/A Documentation: To be done. --------- Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com> Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>	2024-06-21 17:03:38 -07:00
Rahul Triptahi	4bb3d5c488	[community][quick-fix]: changed from blob.path to blob.path.name in 0365BaseLoader. (#22287 ) Description: file_metadata_ was not getting propagated to returned documents. Changed the lookup key to the name of the blob's path. Changed blob.path key to blob.path.name for metadata_dict key lookup. Documentation: N/A Unit tests: N/A Co-authored-by: ccurme <chester.curme@gmail.com>	2024-06-21 15:51:03 -04:00
Zheng Robert Jia	a349fce880	docs[minor],community[patch]: Minor tutorial docs improvement, minor import error quick fix. (#22725 ) minor changes to module import error handling and minor issues in tutorial documents. --------- Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com> Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>	2024-06-20 15:36:49 -04:00
Michał Krassowski	710197e18c	community[patch]: restore compatibility with SQLAlchemy 1.x (#22546 ) - Description: Restores compatibility with SQLAlchemy 1.4.x that was broken since #18992 and adds a test run for this version on CI (only for Python 3.11) - Issue: fixes #19681 - Dependencies: None - Twitter handle: `@krassowski_m` --------- Co-authored-by: Erick Friis <erick@langchain.dev>	2024-06-19 17:58:57 +00:00
鹿鹿鹿鲨	6b46b5e9ce	community: add request_kwargs and expect TimeError AsyncHtmlLoader (#23068 ) - Description: add `request_kwargs` and expect `TimeError` in `_fetch` function for AsyncHtmlLoader. This allows you to fill in the kwargs parameter when using the `load()` method of the `AsyncHtmlLoader` class. Co-authored-by: Yucolu <yucolu@tencent.com>	2024-06-18 20:02:46 -07:00
Finlay Macklon	616d06d7fe	community: glob multiple patterns when using DirectoryLoader (#22852 ) - Description: Updated community.langchain_community.document_loaders.directory.py to enable the use of multiple glob patterns in the `DirectoryLoader` class. Now, the glob parameter is of type `list[str] \| str` and still defaults to the same value as before. I updated the docstring of the class to reflect this, and added a unit test to community.tests.unit_tests.document_loaders.test_directory.py named `test_directory_loader_glob_multiple`. This test also shows an example of how to use the new functionality. - ~~Issue:~~Discussion Thread: https://github.com/langchain-ai/langchain/discussions/18559 - Dependencies: None - Twitter handle: N/a - [x] Add tests and docs - Added test (described above) - Updated class docstring - [x] Lint and test --------- Co-authored-by: isaac hershenson <ihershenson@hmc.edu> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Isaac Francisco <78627776+isahers1@users.noreply.github.com>	2024-06-18 09:24:50 -07:00
Lucas Tucker	e25a5966b5	docs: Standardize DocumentLoader docstrings (#22932 ) Standardizing DocumentLoader docstrings (of which there are many) This PR addresses issue #22866 and adds docstrings according to the issue's specified format (in the appendix) for files csv_loader.py and json_loader.py in langchain_community.document_loaders. In particular, the following sections have been added to both CSVLoader and JSONLoader: Setup, Instantiate, Load, Async load, and Lazy load. It may be worth adding a 'Metadata' section to the JSONLoader docstring to clarify how we want to extract the JSON metadata (using the `metadata_func` argument). The files I used to walkthrough the various sections were `example_2.json` from [HERE](https://support.oneskyapp.com/hc/en-us/articles/208047697-JSON-sample-files) and `hw_200.csv` from [HERE](https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html). --------- Co-authored-by: lucast2021 <lucast2021@headroyce.org> Co-authored-by: isaac hershenson <ihershenson@hmc.edu>	2024-06-18 03:26:36 +00:00
Baskar Gopinath	c4f2bc9540	docs: Fix wrongly referenced class name in confluence.py (#22879 ) Fixes #22542 Changed ConfluenceReader to ConfluenceLoader	2024-06-14 14:00:48 -07:00
Eugene Yurtsev	9a877c7adb	community[patch]: SitemapLoader restrict depth of parsing sitemap (CVE-2024-2965) (#22903 ) This PR restricts the depth to which the sitemap can be parsed. Fix for: CVE-2024-2965	2024-06-14 13:04:40 -04:00
Isaac Francisco	75e966a2fa	docs, cli[patch]: document loaders doc template (#22862 ) From: https://github.com/langchain-ai/langchain/pull/22290 --------- Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>	2024-06-13 19:28:57 -07:00
Isaac Francisco	034257e9bf	docs: improved recursive url loader docs (#22648 ) Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>	2024-06-13 11:09:35 -07:00
Mr. Lance E Sloan «UMich»	08c466c603	community[patch]: bugfix for `YoutubeLoader`'s `LINES` format (#22815 ) - Description: A change I submitted recently introduced a bug in `YoutubeLoader`'s `LINES` output format. In those conditions, curly braces ("`{}`") creates a set, not a dictionary. This bugfix explicitly specifies that a dictionary is created. - Issue: N/A - Dependencies: N/A - Twitter: lsloan_umich - Mastodon: [lsloan@mastodon.social](https://mastodon.social/@lsloan)	2024-06-12 12:29:34 -04:00
Mr. Lance E Sloan «UMich»	84dc2dd059	community[patch]: Load YouTube transcripts (captions) as fixed-duration chunks with start times (#21710 ) - Description: Add a new format, `CHUNKS`, to `langchain_community.document_loaders.youtube.YoutubeLoader` which creates multiple `Document` objects from YouTube video transcripts (captions), each of a fixed duration. The metadata of each chunk `Document` includes the start time of each one and a URL to that time in the video on the YouTube website. I had implemented this for UMich (@umich-its-ai) in a local module, but it makes sense to contribute this to LangChain community for all to benefit and to simplify maintenance. - Issue: N/A - Dependencies: N/A - Twitter: lsloan_umich - Mastodon: [lsloan@mastodon.social](https://mastodon.social/@lsloan) With regards to tests and documentation, most existing features of the `YoutubeLoader` class are not tested. Only the `YoutubeLoader.extract_video_id()` static method had a test. However, while I was waiting for this PR to be reviewed and merged, I had time to add a test for the chunking feature I've proposed in this PR. I have added an example of using chunking to the `docs/docs/integrations/document_loaders/youtube_transcript.ipynb` notebook. --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2024-06-11 17:44:36 +00:00
am-kinetica	ad101adec8	community[patch]: Kinetica Integrations handled error in querying; quotes in table names; updated gpudb API (#22724 ) - [ ] Miscellaneous updates and fixes: - Description: Handled error in querying; quotes in table names; updated gpudb API - Issue: Threw an error with an error message difficult to understand if a query failed or returned no records - Dependencies: Updated GPUDB API version to `7.2.0.9` @baskaryan @hwchase17	2024-06-11 10:01:26 -04:00
Mohammad Mohtashim	c3cce98d86	community[patch]: Small Fix in OutlookMessageLoader (Close the Message once Open) (#22744 ) - Description: A very small fix where we close the message when it opened - Issue: #22729	2024-06-10 13:08:39 -07:00
Max Mulatz	058a64c563	Community[minor]: Add language parser for Elixir (#22742 ) Hi 👋 First off, thanks a ton for your work on this 💚 Really appreciate what you're providing here for the community. ## Description This PR adds a basic language parser for the [Elixir](https://elixir-lang.org/) programming language. The parser code is based upon the approach outlined in https://github.com/langchain-ai/langchain/pull/13318: it's using `tree-sitter` under the hood and aligns with all the other `tree-sitter` based parses added that PR. The `CHUNK_QUERY` I'm using here is probably not the most sophisticated one, but it worked for my application. It's a starting point to provide "core" parsing support for Elixir in LangChain. It enables people to use the language parser out in real world applications which may then lead to further tweaking of the queries. I consider this PR just the ground work. - Dependencies: requires `tree-sitter` and `tree-sitter-languages` from the extended dependencies - Twitter handle:`@bitcrowd` ## Checklist - [x] PR title: "package: description" - [x] Add tests and docs - [x] Lint and test: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. <!-- If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. -->	2024-06-10 15:56:57 +00:00
Isaac Francisco	ba3e219d83	community[patch]: recursive url loader fix and unit tests (#22521 ) Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com>	2024-06-05 17:56:20 -07:00
Emilien Chauvet	c3d4126eb1	community[minor]: add user agent for web scraping loaders (#22480 ) Description: This PR adds a `USER_AGENT` env variable that is to be used for web scraping. It creates a util to get that user agent and uses it in the classes used for scraping in [this piece of doc](https://python.langchain.com/v0.1/docs/use_cases/web_scraping/). Identifying your scraper is considered a good politeness practice, this PR aims at easing it. Issue: `None` Dependencies: `None` Twitter handle: `None`	2024-06-05 15:20:34 +00:00
Dristy Srivastava	ef3df45d9d	community[minor]: Updating payload for pebblo discover API (#22309 ) Description: Updating response for pebblo discover API. Also updating filed name case type Documentation: N/A Unit tests: N/A	2024-06-03 15:36:17 -07:00
Brandon Sharp	56e5aa4dd9	community[patch]: Airtable to allow for addtl params (#22092 ) - [X] PR title: "community: added optional params to Airtable table.all()" - [X] PR message: - Description: Add's kwargs to AirtableLoader to allow for kwargs: https://pyairtable.readthedocs.io/en/latest/api.html#pyairtable.Table.all - Issue: N/A - Dependencies: N/A - Twitter handle: parakoopa88 - [X] Add tests and docs: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. - [X] Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>	2024-06-03 13:05:56 -07:00
Will Higgins	83d10df78d	community[patch]: Update firecrawl api key name (#22183 ) Change 'FIREWALL' to 'FIRECRAWL' as I believe this may have been in error. Other docs refer to 'FIRECRAWL_API_KEY'. Thank you for contributing to LangChain! - [ ] PR title: "package: description" - Where "package" is whichever of langchain, community, core, experimental, etc. is being modified. Use "docs: ..." for purely docs changes, "templates: ..." for template changes, "infra: ..." for CI changes. - Example: "community: add foobar LLM" - [ ] PR message: *Delete this entire checklist* and replace with - Description: a description of the change - Issue: the issue # it fixes, if applicable - Dependencies: any dependencies required for this change - Twitter handle: if your PR gets announced, and you'd like a mention, we'll gladly shout you out! - [ ] Add tests and docs: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. - [ ] Lint and test: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2024-05-27 21:39:29 +00:00
Rahul Triptahi	1a485f59b9	community[patch]: Put authorized identities behind a feature flag in SharepointLoader (#22125 ) Description: Put authorised identities behind a feature flag, load_auth. Documentation: N/A Unit tests: N/A --------- Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com> Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>	2024-05-24 12:42:57 -04:00
Eugene Yurtsev	d3db83abe3	community[major]: lint for usage of xml library (#22132 ) * Lint for usage of standard xml library * Add forced opt-in for quip client * Actual security issue is with underlying QuipClient not LangChain integration (since the client is doing the parsing), but adding enforcement at the LangChain level.	2024-05-24 15:23:53 +00:00
Martin Triska	2df8ac402a	community[minor]: Added propagation of document metadata from O365BaseLoader (#20663 ) Description: - Added propagation of document metadata from O365BaseLoader to FileSystemBlobLoader (O365BaseLoader uses FileSystemBlobLoader under the hood). - This is done by passing dictionary `metadata_dict`: key=filename and value=dictionary containing document's metadata - Modified `FileSystemBlobLoader` to accept the `metadata_dict`, use `mimetype` from it (if available) and pass metadata further into blob loader. Issue: - `O365BaseLoader` under the hood downloads documents to temp folder and then uses `FileSystemBlobLoader` on it. - However metadata about the document in question is lost in this process. In particular: - `mime_type`: `FileSystemBlobLoader` guesses `mime_type` from the file extension, but that does not work 100% of the time. - `web_url`: this is useful to keep around since in RAG LLM we might want to provide link to the source document. In order to work well with document parsers, we pass the `web_url` as `source` (`web_url` is ignored by parsers, `source` is preserved) Dependencies: None Twitter handle: @martintriska1 Please review @baskaryan --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>	2024-05-23 11:42:19 -04:00
Eugene Yurtsev	e5541d1da7	community[patch]: Update doc-string in CloudBlobLoader (#22069 ) Update doc-string	2024-05-23 15:31:41 +00:00
Philippe PRADOS	6dd621d636	community[minor]: Add CloudBlobLoader that supports loading data from cloud buckets (#21957 ) Thank you for contributing to LangChain! - [ ] PR title: "Add CloudBlobLoader" - community: Add CloudBlobLoader - [ ] PR message: Add cloud blob loader - Description: Langchain provides several approaches to read different file formats: Specific loaders (`CVSLoader`) or blob-compatible loaders (`FileSystemBlobLoader`). The only implementation proposed for BlobLoader is `FileSystemBlobLoader`. Many projects retrieve files from cloud storage. We propose a new implementation of `BlobLoader` to read files from the three cloud storage systems. The interface is strictly identical to `FileSystemBlobLoader`. The only difference is the constructor, which takes a cloud "url" object such as `s3://my-bucket`, `az://my-bucket`, or `gs://my-bucket`. By streamlining the process, this novel implementation eliminates the requirement to pre-download files from cloud storage to local temporary files (which are seldom removed). The code relies on the [CloudPathLib](https://cloudpathlib.drivendata.org/stable/) library to interpret cloud URLs. This has been added as an optional dependency. ```Python loader = CloudBlobLoader("s3://mybucket/id") for blob in loader.yield_blobs(): print(blob) ``` - [X] Dependencies: CloudPathLib - [X] Twitter handle: pprados - [X] Add tests and docs: Add unit test, but it's easy to convert to integration test, with some files in a cloud storage (see `test_cloud_blob_loader.py`) - [X] Lint and test: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. Hello from Paris @hwchase17. Can you review this PR? --------- Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>	2024-05-23 10:59:55 -04:00
Christophe Bornet	74947ec894	community[minor]: Add Cassandra ByteStore (#22064 )	2024-05-23 10:46:23 -04:00
Bagatur	50186da0a1	infra: rm unused # noqa violations (#22049 ) Updating #21137	2024-05-22 15:21:08 -07:00
arpitkumar980	444c2a3d9f	community[patch]: sharepoint loader identity enabled (#21176 ) Thank you for contributing to LangChain! - [ ] PR title: "package: description" - Where "package" is whichever of langchain, community, core, experimental, etc. is being modified. Use "docs: ..." for purely docs changes, "templates: ..." for template changes, "infra: ..." for CI changes. - Example: "community: add foobar LLM" - [ ] PR message: *Delete this entire checklist* and replace with - Description: a description of the change - Issue: the issue # it fixes, if applicable - Dependencies: any dependencies required for this change - Twitter handle: if your PR gets announced, and you'd like a mention, we'll gladly shout you out! - [ ] Add tests and docs: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. - [ ] Lint and test: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines:https://github.com/arpitkumar980/langchain.git - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, hwchase17. --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>	2024-05-22 22:08:31 +00:00
Mazen Ramadan	3c1d77dd64	community[minor]: Add Scrapfly Loader community integration (#22036 ) Added [Scrapfly](https://scrapfly.io/) Web Loader integration. Scrapfly is a web scraping API that allows extracting web page data into accessible markdown or text datasets. - __Description__: Added Scrapfly web loader for retrieving web page data as markdown or text. - Dependencies: scrapfly-sdk - Twitter: @thealchemi1st --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2024-05-22 21:29:13 +00:00
maang-h	fc93bed8c4	community: Fix CSVLoader columns is None (#20701 ) - Bug code: In langchain_community/document_loaders/csv_loader.py:100 - Description: currently, when 'CSVLoader' reads the column as None in the 'csv' file, it will report an error because the 'CSVLoader' does not verify whether the column is of str type and does not consider how to handle the corresponding 'row_data' when the column is' None 'in the csv. This pr provides a solution. - Issue: Fix #20699 - thinking: 1. Refer to the processing method for 'langchain_community/document_loaders/csv_loader.py:100' when 'v' equals'None', and apply the same method to 'k'. (Reference`csv.DictReader` ,'k' will only be None when ` len(columns) < len(number_row_data)` is established) 2. ‘k’ equals None only holds when it is the last column, and its corresponding 'v' type is a list. Therefore, I referred to the data format in 'Document' and used ',' to concatenated the elements in the list.(But I'm not sure if you accept this form, if you have any other ideas, communicate) --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>	2024-05-22 12:57:46 -07:00
Sihan Chen	1f81277b9b	community[minor]: allow enabling proxy in aiohttp session in AsyncHTML (#19499 ) Allow enabling proxy in aiohttp session async html	2024-05-22 18:25:06 +00:00
Eugene Yurtsev	36813d2f00	community[patch]: Fix remaining __inits__ in community (#22037 ) Fixes the __init__ files in community to use __all__ which is statically defined.	2024-05-22 17:42:17 +00:00
Kefan You	24b5c27bb1	community[patch]: raise_for_status logic missing in async _fetch of WebBaseLoader (#21948 ) ## 'raise_for_status' parameter of WebBaseLoader works in sync load but not in async load. In webBaseLoader: Sync load is calling `_scrape` and has `raise_for_status` properly handled. ``` def _scrape( self, url: str, parser: Union[str, None] = None, bs_kwargs: Optional[dict] = None, ) -> Any: from bs4 import BeautifulSoup if parser is None: if url.endswith(".xml"): parser = "xml" else: parser = self.default_parser self._check_parser(parser) html_doc = self.session.get(url, self.requests_kwargs) if self.raise_for_status: html_doc.raise_for_status() if self.encoding is not None: html_doc.encoding = self.encoding elif self.autoset_encoding: html_doc.encoding = html_doc.apparent_encoding return BeautifulSoup(html_doc.text, parser, (bs_kwargs or {})) ``` Async load is calling `_fetch` but missing `raise_for_status` logic. ``` async def _fetch( self, url: str, retries: int = 3, cooldown: int = 2, backoff: float = 1.5 ) -> str: async with aiohttp.ClientSession() as session: for i in range(retries): try: async with session.get( url, headers=self.session.headers, ssl=None if self.session.verify else False, cookies=self.session.cookies.get_dict(), ) as response: return await response.text() ``` Co-authored-by: kefan.you <darkfss@sina.com>	2024-05-21 23:51:03 +00:00
Rahul Triptahi	96bd0b0844	community[patch]: Remove redundant pebblo cloud api call (#21589 ) Description: removed redundant pebblo cloud api call. Changed classified `doc` key to `ai_apps_data`. Documentation: N/A Unit tests: N/A	2024-05-20 17:15:16 -07:00
Jiří Spilka	6499897c87	community[patch]: update apify integration to attribute API activity to langchain (#21909 ) Description: Add `Origin/langchain` to Apify's client's user-agent to attribute API activity to LangChain (at Apify, we aim to monitor our integrations to evaluate whether we should invest more in the LangChain integration regarding functionality and content) Issue: None Dependencies: None Twitter handle: None	2024-05-20 14:49:23 -07:00
WilliamEspegren	30bca57aae	doc list not empty (#21208 ) Make sure the doc list is not empty, and set Metadata: true in param, to enable the user to disable metadata for slightly faster crawls.	2024-05-20 08:24:06 -07:00

1 2 3 4 5 ...

259 Commits