Building applications with LLMs through composability
Go to file
Ahmed Tammaa d3ed9b86be
text-splitters[minor]: Replace lxml and XSLT with BeautifulSoup in HTMLHeaderTextSplitter for Improved Large HTML File Processing (#27678)
This pull request updates the `HTMLHeaderTextSplitter` by replacing the
`split_text_from_file` method's implementation. The original method used
`lxml` and XSLT for processing HTML files, which caused
`lxml.etree.xsltapplyerror maxhead` when handling large HTML documents
due to limitations in the XSLT processor. Fixes #13149

By switching to BeautifulSoup (`bs4`), we achieve:

- **Improved Performance and Reliability:** BeautifulSoup efficiently
processes large HTML files without the errors associated with `lxml` and
XSLT.
- **Simplified Dependencies:** Removes the dependency on `lxml` and
external XSLT files, relying instead on the widely used `beautifulsoup4`
library.
- **Maintained Functionality:** The new method replicates the original
behavior, ensuring compatibility with existing code and preserving the
extraction of content and metadata.

**Issue:**

This change addresses issues related to processing large HTML files with
the existing `HTMLHeaderTextSplitter` implementation. It resolves
problems where users encounter lxml.etree.xsltapplyerror maxhead due to
large HTML documents.

**Dependencies:**

- **BeautifulSoup (`beautifulsoup4`):** The `beautifulsoup4` library is
now used for parsing HTML content.
  - Installation: `pip install beautifulsoup4`

**Code Changes:**

Updated the `split_text_from_file` method in `HTMLHeaderTextSplitter` as
follows:

```python
def split_text_from_file(self, file: Any) -> List[Document]:
    """Split HTML file using BeautifulSoup.

    Args:
        file: HTML file path or file-like object.

    Returns:
        List of Document objects with page_content and metadata.
    """
    from bs4 import BeautifulSoup
    from langchain.docstore.document import Document
    import bs4

    # Read the HTML content from the file or file-like object
    if isinstance(file, str):
        with open(file, 'r', encoding='utf-8') as f:
            html_content = f.read()
    else:
        # Assuming file is a file-like object
        html_content = file.read()

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Extract the header tags and their corresponding metadata keys
    headers_to_split_on = [tag[0] for tag in self.headers_to_split_on]
    header_mapping = dict(self.headers_to_split_on)

    documents = []

    # Find the body of the document
    body = soup.body if soup.body else soup

    # Find all header tags in the order they appear
    all_headers = body.find_all(headers_to_split_on)

    # If there's content before the first header, collect it
    first_header = all_headers[0] if all_headers else None
    if first_header:
        pre_header_content = ''
        for elem in first_header.find_all_previous():
            if isinstance(elem, bs4.Tag):
                text = elem.get_text(separator=' ', strip=True)
                if text:
                    pre_header_content = text + ' ' + pre_header_content
        if pre_header_content.strip():
            documents.append(Document(
                page_content=pre_header_content.strip(),
                metadata={}  # No metadata since there's no header
            ))
    else:
        # If no headers are found, return the whole content
        full_text = body.get_text(separator=' ', strip=True)
        if full_text.strip():
            documents.append(Document(
                page_content=full_text.strip(),
                metadata={}
            ))
        return documents

    # Process each header and its associated content
    for header in all_headers:
        current_metadata = {}
        header_name = header.name
        header_text = header.get_text(separator=' ', strip=True)
        current_metadata[header_mapping[header_name]] = header_text

        # Collect all sibling elements until the next header of the same or higher level
        content_elements = []
        for sibling in header.find_next_siblings():
            if sibling.name in headers_to_split_on:
                # Stop at the next header
                break
            if isinstance(sibling, bs4.Tag):
                content_elements.append(sibling)

        # Get the text content of the collected elements
        current_content = ''
        for elem in content_elements:
            text = elem.get_text(separator=' ', strip=True)
            if text:
                current_content += text + ' '

        # Create a Document if there is content
        if current_content.strip():
            documents.append(Document(
                page_content=current_content.strip(),
                metadata=current_metadata.copy()
            ))
        else:
            # If there's no content, but we have metadata, still create a Document
            documents.append(Document(
                page_content='',
                metadata=current_metadata.copy()
            ))

    return documents
```

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2025-01-20 16:10:37 -05:00
.devcontainer community[minor]: Add ApertureDB as a vectorstore (#24088) 2024-07-16 09:32:59 -07:00
.github infra: fix api build (#29274) 2025-01-17 10:41:59 -08:00
cookbook cookbook: fix typo in cookbook/mongodb-langchain-cache-memory.ipynb (#29149) 2025-01-13 10:35:34 -05:00
docs community[minor]: Refactoring PyMuPDF parser, loader and add image blob parsers (#29063) 2025-01-20 15:15:43 -05:00
libs text-splitters[minor]: Replace lxml and XSLT with BeautifulSoup in HTMLHeaderTextSplitter for Improved Large HTML File Processing (#27678) 2025-01-20 16:10:37 -05:00
scripts infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
.gitattributes Update dev container (#6189) 2023-06-16 15:42:14 -07:00
.gitignore infra: gitignore api_ref mds (#25705) 2024-08-23 09:50:30 -07:00
.pre-commit-config.yaml all: Add pre-commit hook (#26993) 2024-12-18 22:22:58 +00:00
.readthedocs.yaml infra: update rtd yaml (#17502) 2024-02-13 18:16:44 -08:00
CITATION.cff rename repo namespace to langchain-ai (#11259) 2023-10-01 15:30:58 -04:00
LICENSE Library Licenses (#13300) 2023-11-28 17:34:27 -08:00
Makefile docs: more api ref links, add linting step to prevent more (#28495) 2024-12-04 04:19:42 +00:00
MIGRATE.md Proofreading and Editing Report for Migration Guide (#28084) 2024-11-13 11:03:09 -05:00
poetry.lock infra: fix notebook tests (#28722) 2024-12-14 15:13:19 +00:00
poetry.toml multiple: use modern installer in poetry (#23998) 2024-07-08 18:50:48 -07:00
pyproject.toml Add Google-style docstring linting and update pyproject.toml (#29303) 2025-01-19 14:37:21 -05:00
README.md docs: fix readme link (#28770) 2024-12-17 18:51:01 +00:00
SECURITY.md docs: single security doc (#28515) 2024-12-04 18:15:34 +00:00
yarn.lock box: add langchain box package and DocumentLoader (#25506) 2024-08-21 02:23:43 +00:00

🦜🔗 LangChain

Build context-aware reasoning applications

Release Notes CI PyPI - License PyPI - Downloads GitHub star chart Open Issues Open in Dev Containers Open in GitHub Codespaces Twitter

Looking for the JS/TS library? Check out LangChain.js.

To help you ship LangChain apps to production faster, check out LangSmith. LangSmith is a unified developer platform for building, testing, and monitoring LLM applications. Fill out this form to speak with our sales team.

Quick Install

With pip:

pip install langchain

With conda:

conda install langchain -c conda-forge

🤔 What is LangChain?

LangChain is a framework for developing applications powered by large language models (LLMs).

For these applications, LangChain simplifies the entire application lifecycle:

  • Open-source libraries: Build your applications using LangChain's open-source components and third-party integrations. Use LangGraph to build stateful agents with first-class streaming and human-in-the-loop support.
  • Productionization: Inspect, monitor, and evaluate your apps with LangSmith so that you can constantly optimize and deploy with confidence.
  • Deployment: Turn your LangGraph applications into production-ready APIs and Assistants with LangGraph Platform.

Open-source libraries

  • langchain-core: Base abstractions.
  • Integration packages (e.g. langchain-openai, langchain-anthropic, etc.): Important integrations have been split into lightweight packages that are co-maintained by the LangChain team and the integration developers.
  • langchain: Chains, agents, and retrieval strategies that make up an application's cognitive architecture.
  • langchain-community: Third-party integrations that are community maintained.
  • LangGraph: Build robust and stateful multi-actor applications with LLMs by modeling steps as edges and nodes in a graph. Integrates smoothly with LangChain, but can be used without it. To learn more about LangGraph, check out our first LangChain Academy course, Introduction to LangGraph, available here.

Productionization:

  • LangSmith: A developer platform that lets you debug, test, evaluate, and monitor chains built on any LLM framework and seamlessly integrates with LangChain.

Deployment:

  • LangGraph Platform: Turn your LangGraph applications into production-ready APIs and Assistants.

Diagram outlining the hierarchical organization of the LangChain framework, displaying the interconnected parts across multiple layers. Diagram outlining the hierarchical organization of the LangChain framework, displaying the interconnected parts across multiple layers.

🧱 What can you build with LangChain?

Question answering with RAG

🧱 Extracting structured output

🤖 Chatbots

And much more! Head to the Tutorials section of the docs for more.

🚀 How does LangChain help?

The main value props of the LangChain libraries are:

  1. Components: composable building blocks, tools and integrations for working with language models. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not.
  2. Easy orchestration with LangGraph: LangGraph, built on top of langchain-core, has built-in support for messages, tools, and other LangChain abstractions. This makes it easy to combine components into production-ready applications with persistence, streaming, and other key features. Check out the LangChain tutorials page for examples.

Components

Components fall into the following modules:

📃 Model I/O

This includes prompt management and a generic interface for chat models, including a consistent interface for tool-calling and structured output across model providers.

📚 Retrieval

Retrieval Augmented Generation involves loading data from a variety of sources, preparing it, then searching over (a.k.a. retrieving from) it for use in the generation step.

🤖 Agents

Agents allow an LLM autonomy over how a task is accomplished. Agents make decisions about which Actions to take, then take that Action, observe the result, and repeat until the task is complete. LangGraph makes it easy to use LangChain components to build both custom and built-in LLM agents.

📖 Documentation

Please see here for full documentation, which includes:

  • Introduction: Overview of the framework and the structure of the docs.
  • Tutorials: If you're looking to build something specific or are more of a hands-on learner, check out our tutorials. This is the best place to get started.
  • How-to guides: Answers to “How do I….?” type questions. These guides are goal-oriented and concrete; they're meant to help you complete a specific task.
  • Conceptual guide: Conceptual explanations of the key parts of the framework.
  • API Reference: Thorough documentation of every class and method.

🌐 Ecosystem

  • 🦜🛠️ LangSmith: Trace and evaluate your language model applications and intelligent agents to help you move from prototype to production.
  • 🦜🕸️ LangGraph: Create stateful, multi-actor applications with LLMs. Integrates smoothly with LangChain, but can be used without it.
  • 🦜🕸️ LangGraph Platform: Deploy LLM applications built with LangGraph into production.

💁 Contributing

As an open-source project in a rapidly developing field, we are extremely open to contributions, whether it be in the form of a new feature, improved infrastructure, or better documentation.

For detailed information on how to contribute, see here.

🌟 Contributors

langchain contributors