langchain

mirror of https://github.com/hwchase17/langchain.git synced 2025-08-15 15:46:47 +00:00

Author	SHA1	Message	Date
Philippe PRADOS	2921597c71	community[patch]: Refactoring PDF loaders: 01 prepare (#29062 ) - Refactoring PDF loaders step 1: "community: Refactoring PDF loaders to standardize approaches" - Description: Declare CloudBlobLoader in __init__.py. file_path is Union[str, PurePath] anywhere - Twitter handle: pprados This is one part of a larger Pull Request (PR) that is too large to be submitted all at once. This specific part focuses to prepare the update of all parsers. For more details, see [PR 28970](https://github.com/langchain-ai/langchain/pull/28970). @eyurtsev it's the start of a PR series.	2025-01-07 11:00:04 -05:00
Mohammad Mohtashim	49a26c1fca	(Community): Fix Keyword argument for `AzureAIDocumentIntelligenceParser` (#28959 ) - Description: Fix the `body` keyword argument for AzureAIDocumentIntelligenceParser` - Issue: #28948	2025-01-02 11:27:12 -05:00
Darien Schettler	32917a0b98	Update dataframe.py (#28871 ) community: optimize DataFrame document loader Description: Simplify the `lazy_load` method in the DataFrame document loader by combining text extraction and metadata cleanup into a single operation. This makes the code more concise while maintaining the same functionality. Issue: N/A Dependencies: None Twitter handle: N/A	2024-12-22 19:16:16 -05:00
yeounhak	f38fc89f35	community: Corrected aload func to be asynchronous from webBaseLoader (#28337 ) - Description: The aload function, contrary to its name, is not an asynchronous function, so it cannot work concurrently with other asynchronous functions. - Issue: #28336 - Test: : Done - Docs: [here](`e0a95e5646/docs/docs/integrations/document_loaders/web_base.ipynb (L201)`) - Lint: All checks passed If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: Chester Curme <chester.curme@gmail.com>	2024-12-20 14:42:52 -05:00
Anusha Karkhanis	26bdf40072	Langchain_Community: SQL LanguageParser (#28430 ) ## Description (This PR has contributions from @khushiDesai, @ashvini8, and @ssumaiyaahmed). This PR addresses Issue #11229 which addresses the need for SQL support in document parsing. This is integrated into the generic TreeSitter parsing library, allowing LangChain users to easily load codebases in SQL into smaller, manageable "documents." This pull request adds a new ```SQLSegmenter``` class, which provides the SQL integration. ## Issue Issue #11229: Add support for a variety of languages to LanguageParser ## Testing We created a file ```test_sql.py``` with several tests to ensure the ```SQLSegmenter``` is functional. Below are the tests we added: - ```def test_is_valid```: Checks SQL validity. - ```def test_extract_functions_classes```: Extracts individual SQL statements. - ```def test_simplify_code```: Simplifies SQL code with comments. --------- Co-authored-by: Syeda Sumaiya Ahmed <114104419+ssumaiyaahmed@users.noreply.github.com> Co-authored-by: ashvini hunagund <97271381+ashvini8@users.noreply.github.com> Co-authored-by: Khushi Desai <khushi.desai@advantawitty.com> Co-authored-by: Khushi Desai <59741309+khushiDesai@users.noreply.github.com> Co-authored-by: ccurme <chester.curme@gmail.com>	2024-12-19 20:30:57 +00:00
Martin Triska	e6b41d081d	community: DocumentLoaderAsParser wrapper (#27749 ) ## Description This pull request introduces the `DocumentLoaderAsParser` class, which acts as an adapter to transform document loaders into parsers within the LangChain framework. The class enables document loaders that accept a `file_path` parameter to be utilized as blob parsers. This is particularly useful for integrating various document loading capabilities seamlessly into the LangChain ecosystem. When merged in together with PR https://github.com/langchain-ai/langchain/pull/27716 It opens options for `SharePointLoader` / `OneDriveLoader` to process any filetype that has a document loader. ### Features - Flexible Parsing: The `DocumentLoaderAsParser` class can adapt any document loader that meets the criteria of accepting a `file_path` argument, allowing for lazy parsing of documents. - Compatibility: The class has been designed to work with various document loaders, making it versatile for different use cases. ### Usage Example To use the `DocumentLoaderAsParser`, you would initialize it with a suitable document loader class and any required parameters. Here’s an example of how to do this with the `UnstructuredExcelLoader`: ```python from langchain_community.document_loaders.blob_loaders import Blob from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser from langchain_community.document_loaders.excel import UnstructuredExcelLoader # Initialize the parser adapter with UnstructuredExcelLoader xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged") # Use parser, for ex. pass it to MimeTypeBasedParser MimeTypeBasedParser( handlers={ "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": xlsx_parser } ) ``` - Dependencies: None - Twitter handle: @martintriska1 If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: Chester Curme <chester.curme@gmail.com>	2024-12-18 12:47:08 -05:00
Mohammad Mohtashim	d49df4871d	[Community]: Image Extraction Fixed for `PDFPlumberParser` (#28491 ) - Description: One-Bit Images was raising error which has been fixed in this PR for `PDFPlumberParser` - Issue: #28480 --------- Co-authored-by: Chester Curme <chester.curme@gmail.com>	2024-12-18 11:45:48 -05:00
Dong Shin	0b1359801e	community: add trust_env at web_base_loader (#28514 ) - Description: I am working to address a similar issue to the one mentioned in https://github.com/langchain-ai/langchain/pull/19499. Specifically, there is a problem with the Webbase loader used in open-webui, where it fails to load the proxy configuration. This PR aims to resolve that issue. <!--If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.--> --------- Co-authored-by: Chester Curme <chester.curme@gmail.com>	2024-12-17 21:18:16 -05:00
Frank Dai	e81433497b	community: support Confluence cookies (#28760 ) Description: Some confluence instances don't support personal access token, then cookie is a convenient way to authenticate. This PR adds support for Confluence cookies. Twitter handle: soulmachine	2024-12-17 12:16:36 -05:00
Hiros	8f5e72de05	community: Correctly handle multi-element rich text (#25762 ) Description: - Add _concatenate_rich_text method to combine all elements in rich text arrays - Update load_page method to use _concatenate_rich_text for rich text properties - Ensure all text content is captured, including inline code and formatted text - Add unit tests to verify correct handling of multi-element rich text This fix prevents truncation of content after backticks or other formatting elements. Issue: Using Notion DB Loader, the text for `richtext` and `title` is truncated after 1st element was loaded as Notion Loader only read the first element. Dependencies: any dependencies required for this change None. --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Erick Friis <erick@langchain.dev>	2024-12-16 20:20:27 +00:00
Vimpas	337fed80a5	community: 🐛 PDF Filter Type Error (#27154 ) Thank you for contributing to LangChain! PR title: "community: fix PDF Filter Type Error" - Description: fix PDF Filter Type Error" - Issue: the issue #27153 it fixes, - Dependencies: no - Twitter handle: if your PR gets announced, and you'd like a mention, we'll gladly shout you out! - [x] Lint and test: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: Erick Friis <erick@langchain.dev>	2024-12-13 23:30:29 +00:00
Philippe PRADOS	f3fb5a9c68	community[minor]: Fix json._validate_metadata_func() (#22842 ) JSONparse, in _validate_metadata_func(), checks the consistency of the _metadata_func() function. To do this, it invokes it and makes sure it receives a dictionary in response. However, during the call, it does not respect future calls, as shown on line 100. This generates errors if, for example, the function is like this: ```python def generate_metadata(json_node:Dict[str,Any],kwargs:Dict[str,Any]) -> Dict[str,Any]: return { "source": url, "row": kwargs['seq_num'], "question":json_node.get("question"), } loader = JSONLoader( file_path=file_path, content_key="answer", jq_schema='.[]', metadata_func=generate_metadata, text_content=False) ``` To avoid this, the verification must comply with the specifications. This patch does just that. --------- Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>	2024-12-13 21:24:20 +00:00
Martin Triska	05ebe1e66b	Community: add `modified_since` argument to `O365BaseLoader` (#28708 ) ## What are we doing in this PR We're adding `modified_since` optional argument to `O365BaseLoader`. When set, O365 loader will only load documents newer than `modified_since` datetime. ## Why? OneDrives / Sharepoints can contain large number of documents. Current approach is to download and parse all files and let indexer to deal with duplicates. This can be prohibitively time-consuming. Especially when using OCR-based parser like [zerox](`fa06188834/libs/community/langchain_community/document_loaders/pdf.py (L948)`). This argument allows to skip documents that are older than known time of indexing. _Q: What if a file was modfied during last indexing process? A: Users can set the `modified_since` conservatively and indexer will still take care of duplicates._ If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: Erick Friis <erick@langchain.dev>	2024-12-13 17:30:17 +00:00
Botong Zhu	13c3c4a210	community: fixes json loader not getting texts with json standard (#27327 ) This PR fixes JSONLoader._get_text not converting objects to json string correctly. If an object is serializable and is not a dict, JSONLoader will use python built-in str() method to convert it to string. This may cause object converted to strings not following json standard. For example, a list will be converted to string with single quotes, and if json.loads try to load this string, it will cause error. --------- Co-authored-by: Erick Friis <erick@langchain.dev>	2024-12-12 19:33:45 +00:00
Mohammad Mohtashim	fa155a422f	[Community]: `requests_kwargs` not being used in _fetch (#28646 ) - Description: `requests_kwargs` is not being passed to `_fetch` which is fetching pages asynchronously. In this PR, making sure that we are passing `requests_kwargs` to `_fetch` just like `_scrape`. - Issue: #28634 --------- Co-authored-by: Erick Friis <erick@langchain.dev>	2024-12-11 23:46:54 +00:00
Johannes Mohren	c1d348e95d	doc-loader: retain Azure Doc Intelligence API metadata in Document parser (#28382 ) Description: This PR modifies the doc_intelligence.py parser in the community package to include all metadata returned by the Azure Doc Intelligence API in the Document object. Previously, only the parsed content (markdown) was retained, while other important metadata such as bounding boxes (bboxes) for images and tables was discarded. These image bboxes are crucial for supporting use cases like multi-modal RAG workflows when using Azure Doc Intelligence. The change ensures that all information returned by the Azure Doc Intelligence API is preserved by setting the metadata attribute of the Document object to the entire result returned by the API, rather than an empty dictionary. This extends the parser's utility for complex use cases without breaking existing functionality. Issue: This change does not address a specific issue number, but it resolves a critical limitation in supporting multimodal workflows when using the LangChain wrapper for the Azure API. Dependencies: No additional dependencies are required for this change. --------- Co-authored-by: jmohren <johannes.mohren@aol.de>	2024-12-10 11:22:58 -05:00
Alex Tonkonozhenko	0d20c314dd	Confluence Loader: Fix CQL loading (#27620 ) fix #12082 <!--- If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. -->	2024-12-10 11:05:23 -05:00
Martin Triska	75bc6bb191	community: [bugfix] fix source path for office files in O365 (#28260 ) # What problem are we fixing? Currently documents loaded using `O365BaseLoader` fetch source from `file.web_url` (where `file` is `<class 'O365.drive.File'>`). This works well for `.pdf` documents. Unfortunately office documents (`.xlsx`, `.docx` ...) pass their `web_url` in following format: `https://sharepoint_address/sites/path/to/library/root/Doc.aspx?sourcedoc=%XXXXXXXX-1111-1111-XXXX-XXXXXXXXXX%7D&file=filename.xlsx&action=default&mobileredirect=true` This obfuscates the path to the file. This PR utilizes the parrent folder's path and file name to reconstruct the actual location of the file. Knowing the file's location can be crucial for some RAG applications (path to the file can carry information we don't want to loose). @vbarda Could you please look at this one? I'm @-mentioning you since we've already closed some PRs together :-) Co-authored-by: Erick Friis <erick@langchain.dev>	2024-12-09 12:34:59 -08:00
Naka Masato	ce3b69aa05	community: add include_labels option to ConfluenceLoader (#28259 ) ## Description: Enable `ConfluenceLoader` to include labels with `include_labels` option (`false` by default for backward compatibility). and the labels are set to `metadata` in the `Document`. e.g. `{"labels": ["l1", "l2"]}` ## Notes Confluence API supports to get labels by providing `metadata.labels` to `expand` query parameter All of the following functions support `expand` in the same way: - confluence.get_page_by_id - confluence.get_all_pages_by_label - confluence.get_all_pages_from_space - cql (internally using [/api/content/search](https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-content/#api-wiki-rest-api-content-search-get)) ## Issue: No issue related to this PR. ## Dependencies: No changes. ## Twitter handle: [@gymnstcs](https://x.com/gymnstcs) - [x] Add tests and docs: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. - [x] Lint and test: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ --------- Co-authored-by: Erick Friis <erick@langchain.dev>	2024-12-09 19:35:01 +00:00
Jan Heimes	ef365543cb	community: add Needle retriever and document loader integration (#28157 ) - [x] PR title: "community: add Needle retriever and document loader integration" - Where "package" is whichever of langchain, community, core, etc. is being modified. Use "docs: ..." for purely docs changes, "infra: ..." for CI changes. - Example: "community: add foobar LLM" - [x] PR message: *Delete this entire checklist* and replace with - Description: This PR adds a new integration for Needle, which includes: - NeedleRetriever: A retriever for fetching documents from Needle collections. - NeedleLoader: A document loader for managing and loading documents into Needle collections. - Example notebooks demonstrating usage have been added in: - `docs/docs/integrations/retrievers/needle.ipynb` - `docs/docs/integrations/document_loaders/needle.ipynb`. - Dependencies: The `needle-python` package is required as an external dependency for accessing Needle's API. It has been added to the extended testing dependencies list. - Twitter handle: Feel free to mention me if this PR gets announced: [needlexai](https://x.com/NeedlexAI). - [x] Add tests and docs: If you're adding a new integration, please include 1. Unit tests have been added for both `NeedleRetriever` and `NeedleLoader` in `libs/community/tests/unit_tests`. These tests mock API calls to avoid relying on network access. 2. Example notebooks have been added to `docs/docs/integrations/`, showcasing both retriever and loader functionality. - [x] Lint and test: Run `make format`, `make lint`, and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ - `make format`: Passed - `make lint`: Passed - `make test`: Passed (requires `needle-python` to be installed locally; this package is not added to LangChain dependencies). Additional guidelines: - [x] Optional dependencies are imported only within functions. - [x] No dependencies have been added to pyproject.toml files except for those required for unit tests. - [x] The PR does not touch more than one package. - [x] Changes are fully backwards compatible. - [x] Community additions are not re-imported into LangChain core. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: Erick Friis <erick@langchain.dev>	2024-12-03 22:06:25 +00:00
Aksel Joonas Reedi	2cb39270ec	community: bytes as a source to `AzureAIDocumentIntelligenceLoader` (#26618 ) - Description: This PR adds functionality to pass in in-memory bytes as a source to `AzureAIDocumentIntelligenceLoader`. - Issue: I needed the functionality, so I added it. - Dependencies: NA - Twitter handle: @akseljoonas if this is a big enough change :) --------- Co-authored-by: Aksel Joonas Reedi <aksel@klippa.com> Co-authored-by: Erick Friis <erick@langchain.dev>	2024-11-07 03:40:21 +00:00
Martin Triska	7a9149f5dd	community: ZeroxPDFLoader (#27800 ) # OCR-based PDF loader This implements [Zerox](https://github.com/getomni-ai/zerox) PDF document loader. Zerox utilizes simple but very powerful (even though slower and more costly) approach to parsing PDF documents: it converts PDF to series of images and passes it to a vision model requesting the contents in markdown. It is especially suitable for complex PDFs that are not parsed well by other alternatives. ## Example use: ```python from langchain_community.document_loaders.pdf import ZeroxPDFLoader os.environ["OPENAI_API_KEY"] = "" ## your-api-key model = "gpt-4o-mini" ## openai model pdf_url = "https://assets.ctfassets.net/f1df9zr7wr1a/soP1fjvG1Wu66HJhu3FBS/034d6ca48edb119ae77dec5ce01a8612/OpenAI_Sacra_Teardown.pdf" loader = ZeroxPDFLoader(file_path=pdf_url, model=model) docs = loader.load() ``` The Zerox library supports wide range of provides/models. See Zerox documentation for details. - Dependencies: `zerox` - Twitter handle: @martintriska1 If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: Erick Friis <erickfriis@gmail.com>	2024-11-07 03:14:57 +00:00
Martin Triska	90189f5639	community: Allow other than default parsers in SharePointLoader and OneDriveLoader (#27716 ) ## What this PR does? ### Currently `O365BaseLoader` (and consequently both derived loaders) are limited to `pdf`, `doc`, `docx` files. - Solution: here we introduce _handlers_ attribute that allows for custom handlers to be passed in. This is done in _dict_ form: Example: ```python from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser # PR for DocumentLoaderAsParser here: https://github.com/langchain-ai/langchain/pull/27749 from langchain_community.document_loaders.excel import UnstructuredExcelLoader xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged") # create dictionary mapping file types to handlers (parsers) handlers = { "doc": MsWordParser() "pdf": PDFMinerParser() "txt": TextParser() "xlsx": xlsx_parser } loader = SharePointLoader(document_library_id="...", handlers=handlers # pass handlers to SharePointLoader ) documents = loader.load() # works the same in OneDriveLoader loader = OneDriveLoader(document_library_id="...", handlers=handlers ) ``` This dictionary is then passed to `MimeTypeBasedParser` same as in the [current implementation](`5a2cfb49e0/libs/community/langchain_community/document_loaders/parsers/registry.py (L13)`). ### Currently `SharePointLoader` and `OneDriveLoader` are separate loaders that both inherit from `O365BaseLoader` However both of these implement the same functionality. The only differences are: - `SharePointLoader` requires argument `document_library_id` whereas `OneDriveLoader` requires `drive_id`. These are just different names for the same thing. - `SharePointLoader` implements significantly more features. - Solution: `OneDriveLoader` is replaced with an empty shell just renaming `drive_id` to `document_library_id` and inheriting from `SharePointLoader` Dependencies: None Twitter handle: @martintriska1 If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.	2024-11-06 17:44:34 -05:00
Ankan Mahapatra	905f43377b	Update word_document.py \| Fixed metadata["source"] for web paths (#27220 ) The metadata["source"] value for the web paths was being set to temporary path (/tmp). Fixed it by creating a new variable self.original_file_path, which will store the original path. Thank you for contributing to LangChain! - [ ] PR title: "package: description" - Where "package" is whichever of langchain, community, core, etc. is being modified. Use "docs: ..." for purely docs changes, "templates: ..." for template changes, "infra: ..." for CI changes. - Example: "community: add foobar LLM" - [ ] PR message: *Delete this entire checklist* and replace with - Description: a description of the change - Issue: the issue # it fixes, if applicable - Dependencies: any dependencies required for this change - Twitter handle: if your PR gets announced, and you'd like a mention, we'll gladly shout you out! - [ ] Add tests and docs: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. - [ ] Lint and test: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.	2024-10-31 18:37:41 +00:00
ccurme	0172d938b4	community: add AzureOpenAIWhisperParser (#27796 ) Commandeered from https://github.com/langchain-ai/langchain/pull/26757. --------- Co-authored-by: Sheepsta300 <128811766+Sheepsta300@users.noreply.github.com>	2024-10-31 12:37:41 -04:00
Erick Friis	600b7bdd61	all: test 3.13 ci (#27197 ) Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>	2024-10-25 12:56:58 -07:00
ZhangShenao	455ab7d714	Improvement[Community] Improve Document Loaders and Splitters (#27568 ) - Fix word spelling error - Add static method decorator - Fix language splitter Co-authored-by: Erick Friis <erick@langchain.dev>	2024-10-24 21:42:16 +00:00
orkhank	9a277cbe00	community: Update `file_path` type in `JSONLoader.__init__()` signature (#27535 ) - Description: Change the type of the `file_path` argument from `str \| pathlib.Path` to `str \| os.PathLike`, since the latter is more widely used: https://stackoverflow.com/a/58541858 This is a very minor fix. I was just annoyed to see the red underline displayed by Pylance in VS Code: `reportArgumentType`. ![image](https://github.com/user-attachments/assets/719a7f8e-acca-4dfa-89df-925e1d938c71) The changes do not affect the behavior of the code.	2024-10-22 11:18:36 -07:00
sByteman	31e7664afd	community[minor]: add proxy support to RecursiveUrlLoader (#27364 ) Description This PR introduces the proxies parameter to the RecursiveUrlLoader class, allowing the user to specify proxy servers for requests. This update enables crawling through proxy servers, providing enhanced flexibility for network configurations. The key changes include: 1.Added an optional proxies parameter to the constructor (__init__). 2.Updated the documentation to explain the proxies parameter usage with an example. 3.Modified the _get_child_links_recursive method to pass the proxies parameter to the requests.get function. Sample Usage ```python from bs4 import BeautifulSoup as Soup from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader proxies = { "http": "http://localhost:1080", "https": "http://localhost:1080", } url = "https://python.langchain.com/docs/concepts/#langchain-expression-language-lcel" loader = RecursiveUrlLoader( url=url, max_depth=1, extractor=lambda x: Soup(x, "html.parser").text,proxies=proxies ) docs = loader.load() ``` --------- Co-authored-by: root <root@thb>	2024-10-16 16:29:59 +00:00
xsai9101	15c1ddaf99	community: Add support for clob datatype in oracle database (#27330 ) Description: This PR add support of clob/blob data type for oracle document loader, clob/blob can only be read by oracledb package when connection is open, so reformat code to process data before connection closes. Dependencies: oracledb package same as before. pip install oracledb Co-authored-by: Erick Friis <erick@langchain.dev>	2024-10-16 02:19:20 +00:00
Rafael Miller	fc14f675f1	Community: Updated Firecrawl Document Loader to v1 (#26548 ) This PR updates the Firecrawl Document Loader to use the recently released V1 API of Firecrawl. Key Updates: Firecrawl V1 Integration: Updated the document loader to leverage the new Firecrawl V1 API for improved performance, reliability, and developer experience. Map Functionality Added: Introduced the map mode for more flexible document loading options. These updates enhance the integration and provide access to the latest features of Firecrawl. --------- Co-authored-by: Erick Friis <erick@langchain.dev> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>	2024-10-15 13:13:28 +00:00
Martin Triska	8dc4bec947	[community] [Bugfix] base_o365 document loader metadata needs to be JSON serializable (#26322 ) In order for indexer to work, all metadata in the documents need to be JSON serializable. Timestamps are not. See here: https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/indexing/api.py#L83-L89 @eyurtsev could you please review? It's a tiny PR :-)	2024-10-14 12:48:31 -04:00
Erick Friis	7ecd720120	multiple: update docs urls to latest 2 (#26837 )	2024-09-30 17:37:07 -07:00
Rajendra Kadam	51c4393298	community[patch]: Fix validation error in SettingsConfigDict across multiple Langchain modules (#26852 ) - Description: This pull request addresses the validation error in `SettingsConfigDict` due to extra fields in the `.env` file. The issue is prevalent across multiple Langchain modules. This fix ensures that extra fields in the `.env` file are ignored, preventing validation errors. Changes include: - Applied fixes to modules using `SettingsConfigDict`. - Issue: NA, similar https://github.com/langchain-ai/langchain/issues/26850 - Dependencies: NA	2024-09-25 10:02:14 -04:00
Rajendra Kadam	7e5a9c317f	community[minor]: [Pebblo] Enhance PebbloSafeLoader to take anonymize flag (#26812 ) - Description: The flag is named `anonymize_snippets`. When set to true, the Pebblo server will anonymize snippets by redacting all personally identifiable information (PII) from the snippets going into VectorDB and the generated reports - Issue: NA - Dependencies: NA - docs: Updated	2024-09-25 09:33:06 -04:00
Rajendra Kadam	92003b3724	community[patch]: [SharePointLoader] Fix validation error in _O365Settings due to extra fields in .env file (#26851 ) Description: Fix validation error in _O365Settings by ignoring extra fields in .env file Issue: https://github.com/langchain-ai/langchain/issues/26850 Dependencies: NA	2024-09-25 09:31:59 -04:00
Rajendra Kadam	60dc19da30	[community] Added PebbloTextLoader for loading text data in PebbloSafeLoader (#26582 ) - Description: Added PebbloTextLoader for loading text in PebbloSafeLoader. - Since PebbloSafeLoader wraps document loaders, this new loader enables direct loading of text into Documents using PebbloSafeLoader. - Issue: NA - Dependencies: NA - [x] Tests: Added/Updated tests	2024-09-19 09:59:04 -04:00
Erik	4e0a6ebe7d	community: Add warning when page_content is empty (#25955 ) Page content sometimes is empty when PyMuPDF can not find text on pages. For example, this can happen when the text of the PDF is not copyable "by hand". Then an OCR solution is need - which is not integrated here. This warning should accurately warn the user that some pages are lost during this process. Thank you for contributing to LangChain! - [ ] PR title: "package: description" - Where "package" is whichever of langchain, community, core, experimental, etc. is being modified. Use "docs: ..." for purely docs changes, "templates: ..." for template changes, "infra: ..." for CI changes. - Example: "community: add foobar LLM" - [ ] PR message: *Delete this entire checklist* and replace with - Description: a description of the change - Issue: the issue # it fixes, if applicable - Dependencies: any dependencies required for this change - Twitter handle: if your PR gets announced, and you'd like a mention, we'll gladly shout you out! - [ ] Add tests and docs: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. - [ ] Lint and test: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: Erick Friis <erick@langchain.dev>	2024-09-19 05:22:09 +00:00
Tibor Reiss	85caaa773f	docs[community]: Fix raw string in docstring (#26350 ) Fixes #26212: replaced the raw string with backslashes. Alternative: raw-stringif the full docstring. --------- Co-authored-by: Erick Friis <erickfriis@gmail.com>	2024-09-19 04:18:56 +00:00
RUO	0a177ec2cc	community: Enhance MongoDBLoader with flexible metadata and optimized field extraction (#23376 ) ### Description: This pull request significantly enhances the MongodbLoader class in the LangChain community package by adding robust metadata customization and improved field extraction capabilities. The updated class now allows users to specify additional metadata fields through the metadata_names parameter, enabling the extraction of both top-level and deeply nested document attributes as metadata. This flexibility is crucial for users who need to include detailed contextual information without altering the database schema. Moreover, the include_db_collection_in_metadata flag offers optional inclusion of database and collection names in the metadata, allowing for even greater customization depending on the user's needs. The loader's field extraction logic has been refined to handle missing or nested fields more gracefully. It now employs a safe access mechanism that avoids the KeyError previously encountered when a specified nested field was absent in a document. This update ensures that the loader can handle diverse and complex data structures without failure, making it more resilient and user-friendly. ### Issue: This pull request addresses a critical issue where the MongodbLoader class in the LangChain community package could throw a KeyError when attempting to access nested fields that may not exist in some documents. The previous implementation did not handle the absence of specified nested fields gracefully, leading to runtime errors and interruptions in data processing workflows. This enhancement ensures robust error handling by safely accessing nested document fields, using default values for missing data, thus preventing KeyError and ensuring smoother operation across various data structures in MongoDB. This improvement is crucial for users working with diverse and complex data sets, ensuring the loader can adapt to documents with varying structures without failing. ### Dependencies: Requires motor for asynchronous MongoDB interaction. ### Twitter handle: N/A ### Add tests and docs Tests: Unit tests have been added to verify that the metadata inclusion toggle works as expected and that the field extraction correctly handles nested fields. Docs: An example notebook demonstrating the use of the enhanced MongodbLoader is included in the docs/docs/integrations directory. This notebook includes setup instructions, example usage, and outputs. (Here is the notebook link : [colab link](https://colab.research.google.com/drive/1tp7nyUnzZa3dxEFF4Kc3KS7ACuNF6jzH?usp=sharing)) Lint and test Before submitting, I ran make format, make lint, and make test as per the contribution guidelines. All tests pass, and the code style adheres to the LangChain standards. ```python import unittest from unittest.mock import patch, MagicMock import asyncio from langchain_community.document_loaders.mongodb import MongodbLoader class TestMongodbLoader(unittest.TestCase): def setUp(self): """Setup the MongodbLoader test environment by mocking the motor client and database collection interactions.""" # Mocking the AsyncIOMotorClient self.mock_client = MagicMock() self.mock_db = MagicMock() self.mock_collection = MagicMock() self.mock_client.get_database.return_value = self.mock_db self.mock_db.get_collection.return_value = self.mock_collection # Initialize the MongodbLoader with test data self.loader = MongodbLoader( connection_string="mongodb://localhost:27017", db_name="testdb", collection_name="testcol" ) @patch('langchain_community.document_loaders.mongodb.AsyncIOMotorClient', return_value=MagicMock()) def test_constructor(self, mock_motor_client): """Test if the constructor properly initializes with the correct database and collection names.""" loader = MongodbLoader( connection_string="mongodb://localhost:27017", db_name="testdb", collection_name="testcol" ) self.assertEqual(loader.db_name, "testdb") self.assertEqual(loader.collection_name, "testcol") def test_aload(self): """Test the aload method to ensure it correctly queries and processes documents.""" # Setup mock data and responses for the database operations self.mock_collection.count_documents.return_value = asyncio.Future() self.mock_collection.count_documents.return_value.set_result(1) self.mock_collection.find.return_value = [ {"_id": "1", "content": "Test document content"} ] # Run the aload method and check responses loop = asyncio.get_event_loop() results = loop.run_until_complete(self.loader.aload()) self.assertEqual(len(results), 1) self.assertEqual(results[0].page_content, "Test document content") def test_construct_projection(self): """Verify that the projection dictionary is constructed correctly based on field names.""" self.loader.field_names = ['content', 'author'] self.loader.metadata_names = ['timestamp'] expected_projection = {'content': 1, 'author': 1, 'timestamp': 1} projection = self.loader._construct_projection() self.assertEqual(projection, expected_projection) if __name__ == '__main__': unittest.main() ``` ### Additional Example for Documentation Sample Data: ```json [ { "_id": "1", "title": "Artificial Intelligence in Medicine", "content": "AI is transforming the medical industry by providing personalized medicine solutions.", "author": { "name": "John Doe", "email": "john.doe@example.com" }, "tags": ["AI", "Healthcare", "Innovation"] }, { "_id": "2", "title": "Data Science in Sports", "content": "Data science provides insights into player performance and strategic planning in sports.", "author": { "name": "Jane Smith", "email": "jane.smith@example.com" }, "tags": ["Data Science", "Sports", "Analytics"] } ] ``` Example Code: ```python loader = MongodbLoader( connection_string="mongodb://localhost:27017", db_name="example_db", collection_name="articles", filter_criteria={"tags": "AI"}, field_names=["title", "content"], metadata_names=["author.name", "author.email"], include_db_collection_in_metadata=True ) documents = loader.load() for doc in documents: print("Page Content:", doc.page_content) print("Metadata:", doc.metadata) ``` Expected Output: ``` Page Content: Artificial Intelligence in Medicine AI is transforming the medical industry by providing personalized medicine solutions. Metadata: {'author_name': 'John Doe', 'author_email': 'john.doe@example.com', 'database': 'example_db', 'collection': 'articles'} ``` Thank you. --- Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: ccurme <chester.curme@gmail.com>	2024-09-17 10:23:17 -04:00
Erick Friis	c2a3021bb0	multiple: pydantic 2 compatibility, v0.3 (#26443 ) Signed-off-by: ChengZi <chen.zhang@zilliz.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Dan O'Donovan <dan.odonovan@gmail.com> Co-authored-by: Tom Daniel Grande <tomdgrande@gmail.com> Co-authored-by: Grande <Tom.Daniel.Grande@statsbygg.no> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: ccurme <chester.curme@gmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Tomaz Bratanic <bratanic.tomaz@gmail.com> Co-authored-by: ZhangShenao <15201440436@163.com> Co-authored-by: Friso H. Kingma <fhkingma@gmail.com> Co-authored-by: ChengZi <chen.zhang@zilliz.com> Co-authored-by: Nuno Campos <nuno@langchain.dev> Co-authored-by: Morgante Pell <morgantep@google.com>	2024-09-13 14:38:45 -07:00
Alexander KIRILOV	6a8f8a56ac	community[patch]: added content_columns option to CSVLoader (#23809 ) Description: Adding a new option to the CSVLoader that allows us to implicitly specify the columns that are used for generating the Document content. Currently these are implicitly set as "all fields not part of the metadata_columns". In some cases however it is useful to have a field both as a metadata and as part of the document content.	2024-09-02 20:25:53 +00:00
mehdiosa	c6f00e6bdc	community: Fix branch not being considered when using GithubFileLoader (#20075 ) - Description: Added `ref` query parameter so data is not loaded only from the default branch but any branch passed --------- Co-authored-by: Osama Mehdi <mehdi@hm.edu> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Erick Friis <erick@langchain.dev>	2024-08-30 21:47:11 +00:00
Isaac Francisco	d5ddaac1fc	docs minor fix (#25794 )	2024-08-28 04:14:36 +00:00
Dristy Srivastava	7205057c3e	[Community][minor]: Added langchain_version while calling discover API (#24428 ) - Description: Added langchain version while calling discover API during both ingestion and retrieval - Issue: NA - Dependencies: NA - Tests: NA - Docs NA --------- Co-authored-by: dristy.cd <dristy@clouddefense.io>	2024-08-26 08:47:48 -04:00
Dristy Srivastava	fbb4761199	[Community][minor]: Updating source path, and file path for SharePoint loader in PebbloSafeLoader (#25592 ) - Description: Updating source path and file path in Pebblo safe loader for SharePoint apps during loading - Issue: NA - Dependencies: NA - Tests: NA - Docs NA --------- Co-authored-by: dristy.cd <dristy@clouddefense.io>	2024-08-26 08:38:40 -04:00
clement.l	642f9530cd	community: add supported blockchains to Blockchain Document Loader (#25428 ) - Remove deprecated chains. - Add more supported chains. --------- Co-authored-by: Chester Curme <chester.curme@gmail.com>	2024-08-23 14:39:42 +00:00
Kevin Engelke	3c7f12cbf5	community[minor]: Fix missing 'keep_newlines' parameter forward-pass to 'process_pages' function in confluence loader (#20086 ) (#20087 ) - Description: Fixed missing `keep_newlines` parameter forward-pass in confluence-loader - Issue: #20086 - Dependencies: None --------- Co-authored-by: ccurme <chester.curme@gmail.com>	2024-08-23 12:59:38 +00:00
Nobuhiko Otoba	4b63a217c2	"community: Fix GithubFileLoader source code", "docs: Fix GithubFileLoader code sample" (#19943 ) This PR adds tiny improvements to the `GithubFileLoader` document loader and its code sample, addressing the following issues: 1. Currently, the `file_extension` argument of `GithubFileLoader` does not change its behavior at all. 1. The `GithubFileLoader` sample code in `docs/docs/integrations/document_loaders/github.ipynb` does not work as it stands. The respective solutions I propose are the following: 1. Remove `file_extension` argument from `GithubFileLoader`. 1. Specify the branch as `master` (not the default `main`) and rename `documents` as `document`. --------- Co-authored-by: Isaac Francisco <78627776+isahers1@users.noreply.github.com>	2024-08-22 18:24:57 -04:00
Rajendra Kadam	1f1679e960	community: Refactor PebbloSafeLoader (#25582 ) Refactor PebbloSafeLoader - Created `APIWrapper` and moved API logic into it. - Moved helper functions to the utility file. - Created smaller functions and methods for better readability. - Properly read environment variables. - Removed unused code. Issue: NA Dependencies: NA tests: Updated	2024-08-22 11:46:52 -04:00

1 2 3 4 5 ...

282 Commits