Adds a `attachment_filter_func` parameter to the ConfluenceLoader class
which can be used to determine which files are indexed. This is useful
if you are interested in excluding files based on their media type or
other metadata.
## Description
This PR modifies the is_public_page function in ConfluenceLoader to
prevent exceptions caused by deleted pages during the execution of
ConfluenceLoader.process_pages().
**Example scenario:**
Consider the following usage of ConfluenceLoader:
```python
import os
from langchain_community.document_loaders import ConfluenceLoader
loader = ConfluenceLoader(
url=os.getenv("BASE_URL"),
token=os.getenv("TOKEN"),
max_pages=1000,
cql=f'type=page and lastmodified >= "2020-01-01 00:00"',
include_restricted_content=False,
)
# Raised Exception : HTTPError: Outdated version/old_draft/trashed? Cannot find content Please provide valid ContentId.
documents = loader.load()
```
If a deleted page exists within the query result, the is_public_page
function would previously raise an exception when calling
get_all_restrictions_for_content, causing the loader.load() process to
fail for all pages.
By adding a pre-check for the page's "current" status, unnecessary API
calls to get_all_restrictions_for_content for non-current pages are
avoided.
This fix ensures that such pages are skipped without affecting the rest
of the loading process.
## Issue
N/A (No specific issue number)
## Dependencies
No new dependencies are introduced with this change.
## Twitter handle
[@zenoengine](https://x.com/zenoengine)
## Description
This PR enables label inclusion for documents loaded via CQL in the
confluence-loader.
- Updated _lazy_load to pass the include_labels parameter instead of
False in process_pages calls for documents loaded via CQL.
- Ensured that labels can now be fetched and added to the metadata for
documents queried with cql.
## Related Modification History
This PR builds on the previous functionality introduced in
[#28259](https://github.com/langchain-ai/langchain/pull/28259), which
added support for including labels with the include_labels option.
However, this functionality did not work as expected for CQL queries,
and this PR fixes that issue.
If the False handling was intentional due to another issue, please let
me know. I have verified with our Confluence instance that this change
allows labels to be correctly fetched for documents loaded via CQL.
## Issue
Fixes#29088
## Dependencies
No changes.
## Twitter Handle
[@zenoengine](https://x.com/zenoengine)
**Description**: Some confluence instances don't support personal access
token, then cookie is a convenient way to authenticate. This PR adds
support for Confluence cookies.
**Twitter handle**: soulmachine
## **Description:**
Enable `ConfluenceLoader` to include labels with `include_labels` option
(`false` by default for backward compatibility). and the labels are set
to `metadata` in the `Document`. e.g. `{"labels": ["l1", "l2"]}`
## Notes
Confluence API supports to get labels by providing `metadata.labels` to
`expand` query parameter
All of the following functions support `expand` in the same way:
- confluence.get_page_by_id
- confluence.get_all_pages_by_label
- confluence.get_all_pages_from_space
- cql (internally using
[/api/content/search](https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-content/#api-wiki-rest-api-content-search-get))
## **Issue:**
No issue related to this PR.
## **Dependencies:**
No changes.
## **Twitter handle:**
[@gymnstcs](https://x.com/gymnstcs)
- [x] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
---------
Co-authored-by: Erick Friis <erick@langchain.dev>
- community: Allow authorization to Confluence with bearer token
- **Description:** Allow authorization to Confluence with [Personal
Access
Token](https://confluence.atlassian.com/enterprise/using-personal-access-tokens-1026032365.html)
by checking for the keys `['client_id', token: ['access_token',
'token_type']]`
- **Issue:**
Currently the following error occurs when using an personal access token
for authorization.
```python
loader = ConfluenceLoader(
url=os.getenv('CONFLUENCE_URL'),
oauth2={
'token': {"access_token": os.getenv("CONFLUENCE_ACCESS_TOKEN"), "token_type": "bearer"},
'client_id': 'client_id',
},
page_ids=['12345678'],
)
```
```
ValueError: Error(s) while validating input: ["You have either omitted require keys or added extra keys to the oauth2 dictionary. key values should be `['access_token', 'access_token_secret', 'consumer_key', 'key_cert']`"]
```
With this PR the loader runs as expected.
---------
Co-authored-by: Chester Curme <chester.curme@gmail.com>
## Summary
I ran `ruff check --extend-select RUF100 -n` to identify `# noqa`
comments that weren't having any effect in Ruff, and then `ruff check
--extend-select RUF100 -n --fix` on select files to remove all of the
unnecessary `# noqa: F401` violations. It's possible that these were
needed at some point in the past, but they're not necessary in Ruff
v0.1.15 (used by LangChain) or in the latest release.
Co-authored-by: Erick Friis <erick@langchain.dev>
**Description:**
Expanding version in all the Confluence API calls so to get when the
page was last modified/created in all cases.
**Issue:** #12812
**Twitter handle:** zzste