langchain/libs/community/langchain_community/document_loaders/parsers/language/sql.py
Anusha Karkhanis 26bdf40072
Langchain_Community: SQL LanguageParser (#28430)
## Description
(This PR has contributions from @khushiDesai, @ashvini8, and
@ssumaiyaahmed).

This PR addresses **Issue #11229** which addresses the need for SQL
support in document parsing. This is integrated into the generic
TreeSitter parsing library, allowing LangChain users to easily load
codebases in SQL into smaller, manageable "documents."

This pull request adds a new ```SQLSegmenter``` class, which provides
the SQL integration.

## Issue
**Issue #11229**: Add support for a variety of languages to
LanguageParser

## Testing
We created a file ```test_sql.py``` with several tests to ensure the
```SQLSegmenter``` is functional. Below are the tests we added:

- ```def test_is_valid```: Checks SQL validity.
- ```def test_extract_functions_classes```: Extracts individual SQL
statements.
- ```def test_simplify_code```: Simplifies SQL code with comments.

---------

Co-authored-by: Syeda Sumaiya Ahmed <114104419+ssumaiyaahmed@users.noreply.github.com>
Co-authored-by: ashvini hunagund <97271381+ashvini8@users.noreply.github.com>
Co-authored-by: Khushi Desai <khushi.desai@advantawitty.com>
Co-authored-by: Khushi Desai <59741309+khushiDesai@users.noreply.github.com>
Co-authored-by: ccurme <chester.curme@gmail.com>
2024-12-19 20:30:57 +00:00

66 lines
2.0 KiB
Python

from typing import TYPE_CHECKING
from langchain_community.document_loaders.parsers.language.tree_sitter_segmenter import ( # noqa: E501
TreeSitterSegmenter,
)
if TYPE_CHECKING:
from tree_sitter import Language
CHUNK_QUERY = """
[
(create_table_statement) @create
(select_statement) @select
(insert_statement) @insert
(update_statement) @update
(delete_statement) @delete
]
"""
class SQLSegmenter(TreeSitterSegmenter):
"""Code segmenter for SQL.
This class uses Tree-sitter to segment SQL code into its
constituent statements (e.g., SELECT, CREATE TABLE).
It also provides functionality to extract these
statements and simplify the code into commented descriptions.
"""
def get_language(self) -> "Language":
"""Return the SQL language grammar for Tree-sitter."""
from tree_sitter_languages import get_language
return get_language("sql")
def get_chunk_query(self) -> str:
"""Return the Tree-sitter query for SQL segmentation."""
return CHUNK_QUERY
def extract_functions_classes(self) -> list[str]:
"""Extract SQL statements from the code.
Ensures that all SQL statements end with a semicolon
for consistency.
"""
extracted = super().extract_functions_classes()
# Ensure all statements end with a semicolon
return [
stmt.strip() + ";" if not stmt.strip().endswith(";") else stmt.strip()
for stmt in extracted
]
def simplify_code(self) -> str:
"""Simplify the extracted SQL code into comments.
Converts SQL statements into commented descriptions
for easy readability.
"""
return "\n".join(
[
f"-- Code for: {stmt.strip()}"
for stmt in self.extract_functions_classes()
]
)
def make_line_comment(self, text: str) -> str:
"""Create a line comment in SQL style."""
return f"-- {text}"