langchain

mirror of https://github.com/hwchase17/langchain.git synced 2026-04-02 10:25:07 +00:00

Files

RUO 2b87e330b0 community: fix issue with nested field extraction in MongodbLoader (#22801 )

**Description:** 
This PR addresses an issue in the `MongodbLoader` where nested fields
were not being correctly extracted. The loader now correctly handles
nested fields specified in the `field_names` parameter.

**Issue:** 
Fixes an issue where attempting to extract nested fields from MongoDB
documents resulted in `KeyError`.

**Dependencies:** 
No new dependencies are required for this change.

**Twitter handle:** 
(Optional, your Twitter handle if you'd like a mention when the PR is
announced)

### Changes
1. **Field Name Parsing**:
- Added logic to parse nested field names and safely extract their
values from the MongoDB documents.

2. **Projection Construction**:
- Updated the projection dictionary to include nested fields correctly.

3. **Field Extraction**:
- Updated the `aload` method to handle nested field extraction using a
recursive approach to traverse the nested dictionaries.

### Example Usage
Updated usage example to demonstrate how to specify nested fields in the
`field_names` parameter:

```python
loader = MongodbLoader(
    connection_string=MONGO_URI,
    db_name=MONGO_DB,
    collection_name=MONGO_COLLECTION,
    filter_criteria={"data.job.company.industry_name": "IT", "data.job.detail": { "$exists": True }},
    field_names=[
        "data.job.detail.id",
        "data.job.detail.position",
        "data.job.detail.intro",
        "data.job.detail.main_tasks",
        "data.job.detail.requirements",
        "data.job.detail.preferred_points",
        "data.job.detail.benefits",
    ],
)

docs = loader.load()
print(len(docs))
for doc in docs:
    print(doc.page_content)
```
### Testing
Tested with a MongoDB collection containing nested documents to ensure
that the nested fields are correctly extracted and concatenated into a
single page_content string.
### Note
This change ensures backward compatibility for non-nested fields and
improves functionality for nested field extraction.
### Output Sample
```python
print(docs[:3])
```
```shell
# output sample:
[
    Document(
        # Here in this example, page_content is the combined text from the fields below
        # "position", "intro", "main_tasks", "requirements", "preferred_points", "benefits"
        page_content='all combined contents from the requested fields in the document',
        metadata={'database': 'Your Database name', 'collection': 'Your Collection name'}
    ),
    ...
]
```

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>

2024-06-24 19:29:11 +00:00

blob_loaders

community[patch]: Update doc-string in CloudBlobLoader (#22069 )

2024-05-23 15:31:41 +00:00

parsers

docs[minor],community[patch]: Minor tutorial docs improvement, minor import error quick fix. (#22725 )

2024-06-20 15:36:49 -04:00

__init__.py

infra: rm unused # noqa violations (#22049 )

2024-05-22 15:21:08 -07:00

acreom.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

airbyte_json.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

airbyte.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

airtable.py

community[patch]: Airtable to allow for addtl params (#22092 )

2024-06-03 13:05:56 -07:00

apify_dataset.py

community[patch]: update apify integration to attribute API activity to langchain (#21909 )

2024-05-20 14:49:23 -07:00

arcgis_loader.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

arxiv.py

community[minor]: Implement lazy_load() for ArxivLoader (#18664 )

2024-03-06 09:16:49 -05:00

assemblyai.py

community[patch]: docstrings update (#20301 )

2024-04-11 16:23:27 -04:00

astradb.py

(all): update removal in deprecation warnings from 0.2 to 0.3 (#21265 )

2024-05-03 14:29:36 -04:00

async_html.py

community: add **request_kwargs and expect TimeError AsyncHtmlLoader (#23068 )

2024-06-18 20:02:46 -07:00

athena.py

community[minor]: import fix (#20995 )

2024-04-29 10:32:50 -04:00

azlyrics.py

…

azure_ai_data.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

azure_blob_storage_container.py

community[patch]: type ignore fixes (#18395 )

2024-03-01 11:21:02 -08:00

azure_blob_storage_file.py

…

baiducloud_bos_directory.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

baiducloud_bos_file.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

base_o365.py

Enhance metadata of sharepointLoader. (#22248 )

2024-06-21 17:03:38 -07:00

base.py

core: Move document loader interfaces to core (#17723 )

2024-03-06 13:59:00 -05:00

bibtex.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

bigquery.py

(all): update removal in deprecation warnings from 0.2 to 0.3 (#21265 )

2024-05-03 14:29:36 -04:00

bilibili.py

community[patch]: docstrings update (#20301 )

2024-04-11 16:23:27 -04:00

blackboard.py

infra: rm unused # noqa violations (#22049 )

2024-05-22 15:21:08 -07:00

blockchain.py

…

brave_search.py

…

browserbase.py

community: updated Browserbase loader (#21757 )

2024-05-16 08:21:23 -07:00

browserless.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

cassandra.py

community[minor]: Add Cassandra ByteStore (#22064 )

2024-05-23 10:46:23 -04:00

chatgpt.py

…

chm.py

community[patch]: docstrings (#16810 )

2024-02-09 12:48:57 -08:00

chromium.py

community[minor]: add user agent for web scraping loaders (#22480 )

2024-06-05 15:20:34 +00:00

college_confidential.py

…

concurrent.py

community[patch]: import flattening fix (#20110 )

2024-04-10 13:01:19 -04:00

confluence.py

docs: Fix wrongly referenced class name in confluence.py (#22879 )

2024-06-14 14:00:48 -07:00

conllu.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

couchbase.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

csv_loader.py

docs: Standardize DocumentLoader docstrings (#22932 )

2024-06-18 03:26:36 +00:00

cube_semantic.py

community[patch]: Implement lazy_load() for CubeSemanticLoader (#18535 )

2024-03-05 17:32:31 -08:00

datadog_logs.py

…

dataframe.py

community[patch]: support modin document loader (#18866 )

2024-03-10 18:40:04 -07:00

diffbot.py

…

directory.py

community: glob multiple patterns when using DirectoryLoader (#22852 )

2024-06-18 09:24:50 -07:00

discord.py

…

doc_intelligence.py

docs: community docstring updates (#21040 )

2024-04-29 17:40:23 -04:00

docugami.py

(all): update removal in deprecation warnings from 0.2 to 0.3 (#21265 )

2024-05-03 14:29:36 -04:00

docusaurus.py

…

dropbox.py

infra: add print rule to ruff (#16221 )

2024-02-09 16:13:30 -08:00

duckdb_loader.py

…

email.py

community[patch]: Small Fix in OutlookMessageLoader (Close the Message once Open) (#22744 )

2024-06-10 13:08:39 -07:00

epub.py

…

etherscan.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

evernote.py

infra: rm unused # noqa violations (#22049 )

2024-05-22 15:21:08 -07:00

excel.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

facebook_chat.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

fauna.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

figma.py

…

firecrawl.py

community[patch]: Update firecrawl api key name (#22183 )

2024-05-27 21:39:29 +00:00

gcs_directory.py

(all): update removal in deprecation warnings from 0.2 to 0.3 (#21265 )

2024-05-03 14:29:36 -04:00

gcs_file.py

(all): update removal in deprecation warnings from 0.2 to 0.3 (#21265 )

2024-05-03 14:29:36 -04:00

generic.py

community[patch]: import flattening fix (#20110 )

2024-04-10 13:01:19 -04:00

geodataframe.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

git.py

Merge pull request #18539

2024-03-06 13:25:14 -05:00

gitbook.py

community[minor]: Implement lazy_load() for GitbookLoader (#18670 )

2024-03-06 09:14:36 -05:00

github.py

community: Implement lazy_load() for GithubFileLoader (#18584 )

2024-03-05 09:35:50 -08:00

glue_catalog.py

community[minor]: Add glue catalog loader (#20220 )

2024-04-16 11:39:23 -04:00

google_speech_to_text.py

(all): update removal in deprecation warnings from 0.2 to 0.3 (#21265 )

2024-05-03 14:29:36 -04:00

googledrive.py

(all): update removal in deprecation warnings from 0.2 to 0.3 (#21265 )

2024-05-03 14:29:36 -04:00

gutenberg.py

…

helpers.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

hn.py

…

html_bs.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

html.py

…

hugging_face_dataset.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

hugging_face_model.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

ifixit.py

…

image_captions.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

image.py

…

imsdb.py

…

iugu.py

…

joplin.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

json_loader.py

docs: Standardize DocumentLoader docstrings (#22932 )

2024-06-18 03:26:36 +00:00

kinetica_loader.py

community[patch]: Kinetica Integrations handled error in querying; quotes in table names; updated gpudb API (#22724 )

2024-06-11 10:01:26 -04:00

lakefs.py

…

larksuite.py

community[minor]: Add LarkSuite wiki document loader. (#21016 )

2024-04-29 10:37:50 -04:00

llmsherpa.py

community[minor]: add support for llmsherpa (#19741 )

2024-03-29 16:04:57 -07:00

markdown.py

corrected outdated link (#15053 )

2023-12-22 12:39:38 -08:00

mastodon.py

Merge pull request #18671

2024-03-06 13:23:14 -05:00

max_compute.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

mediawikidump.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

merge.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

mhtml.py

community[patch]: upgrade to recent version of mypy (#21616 )

2024-05-13 14:55:07 -04:00

mintbase.py

community[minor]: add mintbase loader to langchain (#20089 )

2024-04-30 04:11:56 +00:00

modern_treasury.py

…

mongodb.py

community: fix issue with nested field extraction in MongodbLoader (#22801 )

2024-06-24 19:29:11 +00:00

news.py

multiple: Remove unnecessary Ruff suppression comments (#21050 )

2024-04-30 17:13:48 +00:00

notebook.py

community[patch]: add NotebookLoader unit test (#17721 )

2024-03-29 00:27:46 +00:00

notion.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

notiondb.py

community[patch]: Fix NotionDBLoader 400 Error by conditionally adding filter parameter (#19075 )

2024-03-14 13:56:57 +00:00

nuclia.py

infra: add print rule to ruff (#16221 )

2024-02-09 16:13:30 -08:00

obs_directory.py

…

obs_file.py

…

obsidian.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

odt.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

onedrive_file.py

…

onedrive.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

onenote.py

community[patch]: upgrade to recent version of mypy (#21616 )

2024-05-13 14:55:07 -04:00

open_city_data.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

oracleadb_loader.py

community[minor]: add oracle autonomous database doc loader integration (#19536 )

2024-03-26 17:02:18 -07:00

oracleai.py

community[minor]: Oraclevs integration (#21123 )

2024-05-04 03:15:35 +00:00

org_mode.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

pdf.py

community[patch]: upgrade to recent version of mypy (#21616 )

2024-05-13 14:55:07 -04:00

pebblo.py

community[minor]: Updating payload for pebblo discover API (#22309 )

2024-06-03 15:36:17 -07:00

polars_dataframe.py

…

powerpoint.py

…

psychic.py

multiple: Remove unnecessary Ruff suppression comments (#21050 )

2024-04-30 17:13:48 +00:00

pubmed.py

community[patch]: upgrade to recent version of mypy (#21616 )

2024-05-13 14:55:07 -04:00

pyspark_dataframe.py

…

python.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

quip.py

community[major]: lint for usage of xml library (#22132 )

2024-05-24 15:23:53 +00:00

readthedocs.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

recursive_url_loader.py

docs, cli[patch]: document loaders doc template (#22862 )

2024-06-13 19:28:57 -07:00

reddit.py

…

roam.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

rocksetdb.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

rspace.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

rss.py

multiple: Remove unnecessary Ruff suppression comments (#21050 )

2024-04-30 17:13:48 +00:00

rst.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

rtf.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

s3_directory.py

community[patch]: Skip nested directories when using S3DirectoryLoader (#17829 )

2024-03-08 16:50:58 -08:00

s3_file.py

community[patch]: support unstructured_kwargs for s3 loader (#15473 )

2024-03-27 22:03:48 +00:00

scrapfly.py

community[minor]: Add Scrapfly Loader community integration (#22036 )

2024-05-22 21:29:13 +00:00

sharepoint.py

Enhance metadata of sharepointLoader. (#22248 )

2024-06-21 17:03:38 -07:00

sitemap.py

community[patch]: SitemapLoader restrict depth of parsing sitemap (CVE-2024-2965) (#22903 )

2024-06-14 13:04:40 -04:00

slack_directory.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

snowflake_loader.py

community[patch]: upgrade to recent version of mypy (#21616 )

2024-05-13 14:55:07 -04:00

spider.py

doc list not empty (#21208 )

2024-05-20 08:24:06 -07:00

spreedly.py

…

sql_database.py

community[patch]: restore compatibility with SQLAlchemy 1.x (#22546 )

2024-06-19 17:58:57 +00:00

srt.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

stripe.py

…

surrealdb.py

community[patch]: SurrealDB fix for asyncio (#16092 )

2024-01-23 19:46:19 -08:00

telegram.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

tencent_cos_directory.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

tencent_cos_file.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

tensorflow_datasets.py

community[patch]: upgrade to recent version of mypy (#21616 )

2024-05-13 14:55:07 -04:00

text.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

tidb.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

tomarkdown.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

toml.py

community: Use default load() implementation in doc loaders (#18385 )

2024-03-01 14:46:52 -05:00

trello.py

community: Implement lazy_load() for TrelloLoader (#18658 )

2024-03-06 13:04:36 -05:00

tsv.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

twitter.py

…

unstructured.py

community[minor]: import fix (#20995 )

2024-04-29 10:32:50 -04:00

url_playwright.py

docs: community docstring updates (#21040 )

2024-04-29 17:40:23 -04:00

url_selenium.py

…

url.py

…

vsdx.py

community[patch]: import flattening fix (#20110 )

2024-04-10 13:01:19 -04:00

weather.py

community[patch]: upgrade to recent version of mypy (#21616 )

2024-05-13 14:55:07 -04:00

web_base.py

community[minor]: add user agent for web scraping loaders (#22480 )

2024-06-05 15:20:34 +00:00

whatsapp_chat.py

community: Implement lazy_load() for WhatsAppChatLoader (#18677 )

2024-03-06 13:03:46 -05:00

wikipedia.py

community[patch]: upgrade to recent version of mypy (#21616 )

2024-05-13 14:55:07 -04:00

word_document.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

xml.py

community: better support of pathlib paths in document loaders (#18396 )

2024-03-26 11:51:52 -04:00

xorbits.py

…

youtube.py

community[patch]: bugfix for YoutubeLoader's LINES format (#22815 )

2024-06-12 12:29:34 -04:00

yuque.py

community[minor]: add Yuque document loader (#17924 )

2024-03-05 15:54:07 -08:00