langchain/libs/core/tests/unit_tests/utils
Copilot 18c64aed6d
feat(core): add sanitize_for_postgres utility to fix PostgreSQL NUL byte DataError (#32157)
This PR fixes the PostgreSQL NUL byte issue that causes
`psycopg.DataError` when inserting documents containing `\x00` bytes
into PostgreSQL-based vector stores.

## Problem

PostgreSQL text fields cannot contain NUL (0x00) bytes. When documents
with such characters are processed by PGVector or langchain-postgres
implementations, they fail with:

```
(psycopg.DataError) PostgreSQL text fields cannot contain NUL (0x00) bytes
```

This commonly occurs when processing PDFs, documents from various
loaders, or text extracted by libraries like unstructured that may
contain embedded NUL bytes.

## Solution

Added `sanitize_for_postgres()` utility function to
`langchain_core.utils.strings` that removes or replaces NUL bytes from
text content.

### Key Features

- **Simple API**: `sanitize_for_postgres(text, replacement="")`
- **Configurable**: Replace NUL bytes with empty string (default) or
space for readability
- **Comprehensive**: Handles all problematic examples from the original
issue
- **Well-tested**: Complete unit tests with real-world examples
- **Backward compatible**: No breaking changes, purely additive

### Usage Example

```python
from langchain_core.utils import sanitize_for_postgres
from langchain_core.documents import Document

# Before: This would fail with DataError
problematic_content = "Getting\x00Started with embeddings"

# After: Clean the content before database insertion
clean_content = sanitize_for_postgres(problematic_content)
# Result: "GettingStarted with embeddings"

# Or preserve readability with spaces
readable_content = sanitize_for_postgres(problematic_content, " ")
# Result: "Getting Started with embeddings"

# Use in Document processing
doc = Document(page_content=clean_content, metadata={...})
```

### Integration Pattern

PostgreSQL vector store implementations should sanitize content before
insertion:

```python
def add_documents(self, documents: List[Document]) -> List[str]:
    # Sanitize documents before insertion
    sanitized_docs = []
    for doc in documents:
        sanitized_content = sanitize_for_postgres(doc.page_content, " ")
        sanitized_doc = Document(
            page_content=sanitized_content,
            metadata=doc.metadata,
            id=doc.id
        )
        sanitized_docs.append(sanitized_doc)
    
    return self._insert_documents_to_db(sanitized_docs)
```

## Changes Made

- Added `sanitize_for_postgres()` function in
`langchain_core/utils/strings.py`
- Updated `langchain_core/utils/__init__.py` to export the new function
- Added comprehensive unit tests in
`tests/unit_tests/utils/test_strings.py`
- Validated against all examples from the original issue report

## Testing

All tests pass, including:
- Basic NUL byte removal and replacement
- Multiple consecutive NUL bytes
- Empty string handling
- Real examples from the GitHub issue
- Backward compatibility with existing string utilities

This utility enables PostgreSQL integrations in both langchain-community
and langchain-postgres packages to handle documents with NUL bytes
reliably.

Fixes #26033.

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 Share your feedback on Copilot coding agent for the chance to win a
$200 gift card! Click
[here](https://survey.alchemer.com/s3/8343779/Copilot-Coding-agent) to
start the survey.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: mdrxy <61371264+mdrxy@users.noreply.github.com>
Co-authored-by: Mason Daugherty <github@mdrxy.com>
2025-07-21 20:33:20 -04:00
..
__init__.py
test_aiter.py core: Add mypy strict-equality rule (#31286) 2025-06-02 18:24:35 +00:00
test_env.py core: Add ruff rules PT (pytest) (#29381) 2025-04-01 13:31:07 -04:00
test_function_calling.py core: support Union type args in strict mode of OpenAI function calling / structured output (#30971) 2025-05-16 16:20:32 -04:00
test_html.py
test_imports.py feat(core): add sanitize_for_postgres utility to fix PostgreSQL NUL byte DataError (#32157) 2025-07-21 20:33:20 -04:00
test_iter.py core: Add mypy strict-equality rule (#31286) 2025-06-02 18:24:35 +00:00
test_json_schema.py fix(core): JSON Schema reference resolution for list indices (#32088) 2025-07-17 15:54:38 -04:00
test_pydantic.py core: Cleanup Pydantic models and handle deprecation warnings (#30799) 2025-06-20 10:42:52 -04:00
test_rm_titles.py core: Add ruff rules PT (pytest) (#29381) 2025-04-01 13:31:07 -04:00
test_strings.py feat(core): add sanitize_for_postgres utility to fix PostgreSQL NUL byte DataError (#32157) 2025-07-21 20:33:20 -04:00
test_usage.py core: Add ruff rules PT (pytest) (#29381) 2025-04-01 13:31:07 -04:00
test_utils.py core[patch]: Int Combine when Merging Dicts (#31572) 2025-07-04 14:44:16 -04:00