community[minor]: Added propagation of document metadata from O365BaseLoader (#20663)

**Description:**
- Added propagation of document metadata from O365BaseLoader to
FileSystemBlobLoader (O365BaseLoader uses FileSystemBlobLoader under the
hood).
- This is done by passing dictionary `metadata_dict`: key=filename and
value=dictionary containing document's metadata
- Modified `FileSystemBlobLoader` to accept the `metadata_dict`, use
`mimetype` from it (if available) and pass metadata further into blob
loader.

**Issue:**
- `O365BaseLoader` under the hood downloads documents to temp folder and
then uses `FileSystemBlobLoader` on it.
- However metadata about the document in question is lost in this
process. In particular:
- `mime_type`: `FileSystemBlobLoader` guesses `mime_type` from the file
extension, but that does not work 100% of the time.
- `web_url`: this is useful to keep around since in RAG LLM we might
want to provide link to the source document. In order to work well with
document parsers, we pass the `web_url` as `source` (`web_url` is
ignored by parsers, `source` is preserved)

**Dependencies:**
None

**Twitter handle:**
@martintriska1

Please review @baskaryan

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
This commit is contained in:
Martin Triska
2024-05-23 17:42:19 +02:00
committed by GitHub
parent e5541d1da7
commit 2df8ac402a
2 changed files with 25 additions and 4 deletions

View File

@@ -1,4 +1,5 @@
"""Loader that loads data from Sharepoint Document Library"""
from __future__ import annotations
import json
@@ -82,7 +83,9 @@ class SharePointLoader(O365BaseLoader, BaseLoader):
if not isinstance(target_folder, Folder):
raise ValueError("Unable to fetch root folder")
for blob in self._load_from_folder(target_folder):
yield from blob_parser.lazy_parse(blob)
for blob_part in blob_parser.lazy_parse(blob):
blob_part.metadata.update(blob.metadata)
yield blob_part
def authorized_identities(self) -> List:
data = self._fetch_access_token()