community: Fix ConfluenceLoader load() failure caused by deleted pages (#29232)

## Description
This PR modifies the is_public_page function in ConfluenceLoader to
prevent exceptions caused by deleted pages during the execution of
ConfluenceLoader.process_pages().


**Example scenario:**
Consider the following usage of ConfluenceLoader:
```python
import os
from langchain_community.document_loaders import ConfluenceLoader

loader = ConfluenceLoader(
        url=os.getenv("BASE_URL"),
        token=os.getenv("TOKEN"),
        max_pages=1000,
        cql=f'type=page and lastmodified >= "2020-01-01 00:00"',
        include_restricted_content=False,
)

# Raised Exception : HTTPError: Outdated version/old_draft/trashed? Cannot find content Please provide valid ContentId.
documents = loader.load()
```

If a deleted page exists within the query result, the is_public_page
function would previously raise an exception when calling
get_all_restrictions_for_content, causing the loader.load() process to
fail for all pages.



By adding a pre-check for the page's "current" status, unnecessary API
calls to get_all_restrictions_for_content for non-current pages are
avoided.


This fix ensures that such pages are skipped without affecting the rest
of the loading process.





## Issue
N/A (No specific issue number)

## Dependencies
No new dependencies are introduced with this change.

## Twitter handle
[@zenoengine](https://x.com/zenoengine)
This commit is contained in:
Jin Hyung Ahn 2025-01-15 23:56:23 +09:00 committed by GitHub
parent 21eb39dff0
commit 05554265b4
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -523,11 +523,14 @@ class ConfluenceLoader(BaseLoader):
def is_public_page(self, page: dict) -> bool:
"""Check if a page is publicly accessible."""
if page["status"] != "current":
return False
restrictions = self.confluence.get_all_restrictions_for_content(page["id"])
return (
page["status"] == "current"
and not restrictions["read"]["restrictions"]["user"]["results"]
not restrictions["read"]["restrictions"]["user"]["results"]
and not restrictions["read"]["restrictions"]["group"]["results"]
)