langchain

mirror of https://github.com/hwchase17/langchain.git synced 2025-08-14 07:07:34 +00:00

History

Copilot 18c64aed6d feat(core): add `sanitize_for_postgres` utility to fix PostgreSQL NUL byte DataError (#32157 ) This PR fixes the PostgreSQL NUL byte issue that causes `psycopg.DataError` when inserting documents containing `\x00` bytes into PostgreSQL-based vector stores. ## Problem PostgreSQL text fields cannot contain NUL (0x00) bytes. When documents with such characters are processed by PGVector or langchain-postgres implementations, they fail with: ``` (psycopg.DataError) PostgreSQL text fields cannot contain NUL (0x00) bytes ``` This commonly occurs when processing PDFs, documents from various loaders, or text extracted by libraries like unstructured that may contain embedded NUL bytes. ## Solution Added `sanitize_for_postgres()` utility function to `langchain_core.utils.strings` that removes or replaces NUL bytes from text content. ### Key Features - Simple API: `sanitize_for_postgres(text, replacement="")` - Configurable: Replace NUL bytes with empty string (default) or space for readability - Comprehensive: Handles all problematic examples from the original issue - Well-tested: Complete unit tests with real-world examples - Backward compatible: No breaking changes, purely additive ### Usage Example ```python from langchain_core.utils import sanitize_for_postgres from langchain_core.documents import Document # Before: This would fail with DataError problematic_content = "Getting\x00Started with embeddings" # After: Clean the content before database insertion clean_content = sanitize_for_postgres(problematic_content) # Result: "GettingStarted with embeddings" # Or preserve readability with spaces readable_content = sanitize_for_postgres(problematic_content, " ") # Result: "Getting Started with embeddings" # Use in Document processing doc = Document(page_content=clean_content, metadata={...}) ``` ### Integration Pattern PostgreSQL vector store implementations should sanitize content before insertion: ```python def add_documents(self, documents: List[Document]) -> List[str]: # Sanitize documents before insertion sanitized_docs = [] for doc in documents: sanitized_content = sanitize_for_postgres(doc.page_content, " ") sanitized_doc = Document( page_content=sanitized_content, metadata=doc.metadata, id=doc.id ) sanitized_docs.append(sanitized_doc) return self._insert_documents_to_db(sanitized_docs) ``` ## Changes Made - Added `sanitize_for_postgres()` function in `langchain_core/utils/strings.py` - Updated `langchain_core/utils/__init__.py` to export the new function - Added comprehensive unit tests in `tests/unit_tests/utils/test_strings.py` - Validated against all examples from the original issue report ## Testing All tests pass, including: - Basic NUL byte removal and replacement - Multiple consecutive NUL bytes - Empty string handling - Real examples from the GitHub issue - Backward compatibility with existing string utilities This utility enables PostgreSQL integrations in both langchain-community and langchain-postgres packages to handle documents with NUL bytes reliably. Fixes #26033. <!-- START COPILOT CODING AGENT TIPS --> --- 💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click [here](https://survey.alchemer.com/s3/8343779/Copilot-Coding-agent) to start the survey. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: mdrxy <61371264+mdrxy@users.noreply.github.com> Co-authored-by: Mason Daugherty <github@mdrxy.com>		2025-07-21 20:33:20 -04:00
..
__init__.py
test_aiter.py	core: Add mypy strict-equality rule (#31286 )	2025-06-02 18:24:35 +00:00
test_env.py	core: Add ruff rules PT (pytest) (#29381 )	2025-04-01 13:31:07 -04:00
test_function_calling.py	core: support `Union` type args in strict mode of OpenAI function calling / structured output (#30971 )	2025-05-16 16:20:32 -04:00
test_html.py
test_imports.py	feat(core): add `sanitize_for_postgres` utility to fix PostgreSQL NUL byte DataError (#32157 )	2025-07-21 20:33:20 -04:00
test_iter.py	core: Add mypy strict-equality rule (#31286 )	2025-06-02 18:24:35 +00:00
test_json_schema.py	fix(core): JSON Schema reference resolution for list indices (#32088 )	2025-07-17 15:54:38 -04:00
test_pydantic.py	core: Cleanup Pydantic models and handle deprecation warnings (#30799 )	2025-06-20 10:42:52 -04:00
test_rm_titles.py	core: Add ruff rules PT (pytest) (#29381 )	2025-04-01 13:31:07 -04:00
test_strings.py	feat(core): add `sanitize_for_postgres` utility to fix PostgreSQL NUL byte DataError (#32157 )	2025-07-21 20:33:20 -04:00
test_usage.py	core: Add ruff rules PT (pytest) (#29381 )	2025-04-01 13:31:07 -04:00
test_utils.py	core[patch]: Int Combine when Merging Dicts (#31572 )	2025-07-04 14:44:16 -04:00