langchain

mirror of https://github.com/hwchase17/langchain.git synced 2025-08-01 09:04:03 +00:00

⚡ Building applications with LLMs through composability ⚡

Go to file

Copilot 18c64aed6d feat(core): add `sanitize_for_postgres` utility to fix PostgreSQL NUL byte DataError (#32157 ) This PR fixes the PostgreSQL NUL byte issue that causes `psycopg.DataError` when inserting documents containing `\x00` bytes into PostgreSQL-based vector stores. ## Problem PostgreSQL text fields cannot contain NUL (0x00) bytes. When documents with such characters are processed by PGVector or langchain-postgres implementations, they fail with: ``` (psycopg.DataError) PostgreSQL text fields cannot contain NUL (0x00) bytes ``` This commonly occurs when processing PDFs, documents from various loaders, or text extracted by libraries like unstructured that may contain embedded NUL bytes. ## Solution Added `sanitize_for_postgres()` utility function to `langchain_core.utils.strings` that removes or replaces NUL bytes from text content. ### Key Features - Simple API: `sanitize_for_postgres(text, replacement="")` - Configurable: Replace NUL bytes with empty string (default) or space for readability - Comprehensive: Handles all problematic examples from the original issue - Well-tested: Complete unit tests with real-world examples - Backward compatible: No breaking changes, purely additive ### Usage Example ```python from langchain_core.utils import sanitize_for_postgres from langchain_core.documents import Document # Before: This would fail with DataError problematic_content = "Getting\x00Started with embeddings" # After: Clean the content before database insertion clean_content = sanitize_for_postgres(problematic_content) # Result: "GettingStarted with embeddings" # Or preserve readability with spaces readable_content = sanitize_for_postgres(problematic_content, " ") # Result: "Getting Started with embeddings" # Use in Document processing doc = Document(page_content=clean_content, metadata={...}) ``` ### Integration Pattern PostgreSQL vector store implementations should sanitize content before insertion: ```python def add_documents(self, documents: List[Document]) -> List[str]: # Sanitize documents before insertion sanitized_docs = [] for doc in documents: sanitized_content = sanitize_for_postgres(doc.page_content, " ") sanitized_doc = Document( page_content=sanitized_content, metadata=doc.metadata, id=doc.id ) sanitized_docs.append(sanitized_doc) return self._insert_documents_to_db(sanitized_docs) ``` ## Changes Made - Added `sanitize_for_postgres()` function in `langchain_core/utils/strings.py` - Updated `langchain_core/utils/__init__.py` to export the new function - Added comprehensive unit tests in `tests/unit_tests/utils/test_strings.py` - Validated against all examples from the original issue report ## Testing All tests pass, including: - Basic NUL byte removal and replacement - Multiple consecutive NUL bytes - Empty string handling - Real examples from the GitHub issue - Backward compatibility with existing string utilities This utility enables PostgreSQL integrations in both langchain-community and langchain-postgres packages to handle documents with NUL bytes reliably. Fixes #26033. <!-- START COPILOT CODING AGENT TIPS --> --- 💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click [here](https://survey.alchemer.com/s3/8343779/Copilot-Coding-agent) to start the survey. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: mdrxy <61371264+mdrxy@users.noreply.github.com> Co-authored-by: Mason Daugherty <github@mdrxy.com>		2025-07-21 20:33:20 -04:00
.devcontainer	community[minor]: Add ApertureDB as a vectorstore (#24088 )	2024-07-16 09:32:59 -07:00
.github	chore: update `copilot-instructions.md` (#32159 )	2025-07-21 20:17:41 -04:00
cookbook	chore(docs): bump langgraph in docs & reformat all docs (#32044 )	2025-07-15 15:06:59 +00:00
docs	docs: fix vectorstore feature table - correct "IDs in add Documents" values (#32153 )	2025-07-21 20:29:34 -04:00
libs	feat(core): add `sanitize_for_postgres` utility to fix PostgreSQL NUL byte DataError (#32157 )	2025-07-21 20:33:20 -04:00
scripts	fix: automatically fix issues with ruff (#31897 )	2025-07-07 14:13:10 -04:00
.gitattributes
.gitignore	[performance]: Adding benchmarks for common `langchain-core` imports (#30747 )	2025-04-09 13:00:15 -04:00
.pre-commit-config.yaml	voyageai: remove from monorepo (#31281 )	2025-05-19 16:33:38 +00:00
.readthedocs.yaml	docs(readthedocs): streamline config (#30307 )	2025-03-18 11:47:45 -04:00
CITATION.cff
LICENSE
Makefile	ruff: more rules across the board & fixes (#31898 )	2025-07-07 17:48:01 -04:00
MIGRATE.md	Proofreading and Editing Report for Migration Guide (#28084 )	2024-11-13 11:03:09 -05:00
poetry.toml
pyproject.toml	fix(infra): update some notebook cassettes (#32087 )	2025-07-17 13:57:29 -04:00
README.md	chore: update readme with forum link (#32027 )	2025-07-14 09:15:26 -07:00
SECURITY.md	chore: update SECURITY.md (#32060 )	2025-07-16 10:20:59 -04:00
uv.lock	fix(infra): update some notebook cassettes (#32087 )	2025-07-17 13:57:29 -04:00
yarn.lock	box: add langchain box package and DocumentLoader (#25506 )	2024-08-21 02:23:43 +00:00

README.md

Note

Looking for the JS/TS library? Check out LangChain.js.

LangChain is a framework for building LLM-powered applications. It helps you chain together interoperable components and third-party integrations to simplify AI application development — all while future-proofing decisions as the underlying technology evolves.

pip install -U langchain

To learn more about LangChain, check out the docs. If you’re looking for more advanced customization or agent orchestration, check out LangGraph, our framework for building controllable agent workflows.

Why use LangChain?

LangChain helps developers build applications powered by LLMs through a standard interface for models, embeddings, vector stores, and more.

Use LangChain for:

Real-time data augmentation. Easily connect LLMs to diverse data sources and external / internal systems, drawing from LangChain’s vast library of integrations with model providers, tools, vector stores, retrievers, and more.
Model interoperability. Swap models in and out as your engineering team experiments to find the best choice for your application’s needs. As the industry frontier evolves, adapt quickly — LangChain’s abstractions keep you moving without losing momentum.

LangChain’s ecosystem

While the LangChain framework can be used standalone, it also integrates seamlessly with any LangChain product, giving developers a full suite of tools when building LLM applications.

To improve your LLM application development, pair LangChain with:

LangSmith - Helpful for agent evals and observability. Debug poor-performing LLM app runs, evaluate agent trajectories, gain visibility in production, and improve performance over time.
LangGraph - Build agents that can reliably handle complex tasks with LangGraph, our low-level agent orchestration framework. LangGraph offers customizable architecture, long-term memory, and human-in-the-loop workflows — and is trusted in production by companies like LinkedIn, Uber, Klarna, and GitLab.
LangGraph Platform - Deploy and scale agents effortlessly with a purpose-built deployment platform for long running, stateful workflows. Discover, reuse, configure, and share agents across teams — and iterate quickly with visual prototyping in LangGraph Studio.

Additional resources

Tutorials: Simple walkthroughs with guided examples on getting started with LangChain.
How-to Guides: Quick, actionable code snippets for topics such as tool calling, RAG use cases, and more.
Conceptual Guides: Explanations of key concepts behind the LangChain framework.
LangChain Forum: Connect with the community and share all of your technical questions, ideas, and feedback.
API Reference: Detailed reference on navigating base packages and integrations for LangChain.

README.md Unescape Escape

Why use LangChain?

LangChain’s ecosystem

Additional resources

README.md