community: fix duplicate content (#28003)

Thank you for reading my first PR!

**Description:**
Deduplicate content in AzureSearch vectorstore.
Currently, by default, the content of the retrieval is placed both in
metadata and page_content of a Document.
This PR removes the content from metadata, and leaves it in
page_content.

**Issue:**:
Previously, the content was popped from result before metadata was
populated.
In #25828 , the order was changed which leads to a response with
duplicated content.
This was not the intention of that PR and seems undesirable.

Looking forward to seeing my contribution in the next version!

Cheers, 
Renzo
This commit is contained in:
Renzo-vS 2024-11-20 21:49:03 +01:00 committed by GitHub
parent abaea28417
commit 567dc1e422
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -1798,7 +1798,9 @@ def _result_to_document(result: Dict) -> Document:
fields_metadata = json.loads(result[FIELDS_METADATA]) fields_metadata = json.loads(result[FIELDS_METADATA])
else: else:
fields_metadata = { fields_metadata = {
key: value for key, value in result.items() if key != FIELDS_CONTENT_VECTOR key: value
for key, value in result.items()
if key not in [FIELDS_CONTENT_VECTOR, FIELDS_CONTENT]
} }
# IDs # IDs
if FIELDS_ID in result: if FIELDS_ID in result:
@ -1806,7 +1808,7 @@ def _result_to_document(result: Dict) -> Document:
else: else:
fields_id = {} fields_id = {}
return Document( return Document(
page_content=result.pop(FIELDS_CONTENT), page_content=result[FIELDS_CONTENT],
metadata={ metadata={
**fields_id, **fields_id,
**fields_metadata, **fields_metadata,