mirror of
https://github.com/hwchase17/langchain.git
synced 2026-02-21 06:33:41 +00:00
## Description I am submitting this for a school project as part of a team of 5. Other team members are @LeilaChr, @maazh10, @Megabear137, @jelalalamy. This PR also has contributions from community members @Harrolee and @Mario928. Initial context is in the issue we opened (#11229). This pull request adds: - Generic framework for expanding the languages that `LanguageParser` can handle, using the [tree-sitter](https://github.com/tree-sitter/py-tree-sitter#py-tree-sitter) parsing library and existing language-specific parsers written for it - Support for the following additional languages in `LanguageParser`: - C - C++ - C# - Go - Java (contributed by @Mario928 https://github.com/ThatsJustCheesy/langchain/pull/2) - Kotlin - Lua - Perl - Ruby - Rust - Scala - TypeScript (contributed by @Harrolee https://github.com/ThatsJustCheesy/langchain/pull/1) Here is the [design document](https://docs.google.com/document/d/17dB14cKCWAaiTeSeBtxHpoVPGKrsPye8W0o_WClz2kk) if curious, but no need to read it. ## Issues - Closes #11229 - Closes #10996 - Closes #8405 ## Dependencies `tree_sitter` and `tree_sitter_languages` on PyPI. We have tried to add these as optional dependencies. ## Documentation We have updated the list of supported languages, and also added a section to `source_code.ipynb` detailing how to add support for additional languages using our framework. ## Maintainer - @hwchase17 (previously reviewed https://github.com/langchain-ai/langchain/pull/6486) Thanks!! ## Git commits We will gladly squash any/all of our commits (esp merge commits) if necessary. Let us know if this is desirable, or if you will be squash-merging anyway. <!-- Thank you for contributing to LangChain! Replace this entire comment with: - **Description:** a description of the change, - **Issue:** the issue # it fixes (if applicable), - **Dependencies:** any dependencies required for this change, - **Tag maintainer:** for a quicker response, tag the relevant maintainer (see below), - **Twitter handle:** we announce bigger features on Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! Please make sure your PR is passing linting and testing before submitting. Run `make format`, `make lint` and `make test` to check this locally. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/extras` directory. If no one reviews your PR within a few days, please @-mention one of @baskaryan, @eyurtsev, @hwchase17. --> --------- Co-authored-by: Maaz Hashmi <mhashmi373@gmail.com> Co-authored-by: LeilaChr <87657694+LeilaChr@users.noreply.github.com> Co-authored-by: Jeremy La <jeremylai511@gmail.com> Co-authored-by: Megabear137 <zubair.alnoor27@gmail.com> Co-authored-by: Lee Harrold <lhharrold@sep.com> Co-authored-by: Mario928 <88029051+Mario928@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
477 lines
12 KiB
Plaintext
477 lines
12 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "213a38a2",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Source Code\n",
|
|
"\n",
|
|
"This notebook covers how to load source code files using a special approach with language parsing: each top-level function and class in the code is loaded into separate documents. Any remaining code top-level code outside the already loaded functions and classes will be loaded into a separate document.\n",
|
|
"\n",
|
|
"This approach can potentially improve the accuracy of QA models over source code.\n",
|
|
"\n",
|
|
"The supported languages for code parsing are:\n",
|
|
"\n",
|
|
"- C (*)\n",
|
|
"- C++ (*)\n",
|
|
"- C# (*)\n",
|
|
"- COBOL\n",
|
|
"- Go (*)\n",
|
|
"- Java (*)\n",
|
|
"- JavaScript (requires package `esprima`)\n",
|
|
"- Kotlin (*)\n",
|
|
"- Lua (*)\n",
|
|
"- Perl (*)\n",
|
|
"- Python\n",
|
|
"- Ruby (*)\n",
|
|
"- Rust (*)\n",
|
|
"- Scala (*)\n",
|
|
"- TypeScript (*)\n",
|
|
"\n",
|
|
"Items marked with (*) require the packages `tree_sitter` and `tree_sitter_languages`.\n",
|
|
"It is straightforward to add support for additional languages using `tree_sitter`,\n",
|
|
"although this currently requires modifying LangChain.\n",
|
|
"\n",
|
|
"The language used for parsing can be configured, along with the minimum number of\n",
|
|
"lines required to activate the splitting based on syntax.\n",
|
|
"\n",
|
|
"If a language is not explicitly specified, `LanguageParser` will infer one from\n",
|
|
"filename extensions, if present."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7fa47b2e",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%pip install -qU esprima esprima tree_sitter tree_sitter_languages"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "beb55c2f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import warnings\n",
|
|
"\n",
|
|
"warnings.filterwarnings(\"ignore\")\n",
|
|
"from pprint import pprint\n",
|
|
"\n",
|
|
"from langchain.text_splitter import Language\n",
|
|
"from langchain_community.document_loaders.generic import GenericLoader\n",
|
|
"from langchain_community.document_loaders.parsers import LanguageParser"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "64056e07",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"loader = GenericLoader.from_filesystem(\n",
|
|
" \"./example_data/source_code\",\n",
|
|
" glob=\"*\",\n",
|
|
" suffixes=[\".py\", \".js\"],\n",
|
|
" parser=LanguageParser(),\n",
|
|
")\n",
|
|
"docs = loader.load()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "8af79bd7",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"6"
|
|
]
|
|
},
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"len(docs)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "85edf3fc",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"{'content_type': 'functions_classes',\n",
|
|
" 'language': <Language.PYTHON: 'python'>,\n",
|
|
" 'source': 'example_data/source_code/example.py'}\n",
|
|
"{'content_type': 'functions_classes',\n",
|
|
" 'language': <Language.PYTHON: 'python'>,\n",
|
|
" 'source': 'example_data/source_code/example.py'}\n",
|
|
"{'content_type': 'simplified_code',\n",
|
|
" 'language': <Language.PYTHON: 'python'>,\n",
|
|
" 'source': 'example_data/source_code/example.py'}\n",
|
|
"{'content_type': 'functions_classes',\n",
|
|
" 'language': <Language.JS: 'js'>,\n",
|
|
" 'source': 'example_data/source_code/example.js'}\n",
|
|
"{'content_type': 'functions_classes',\n",
|
|
" 'language': <Language.JS: 'js'>,\n",
|
|
" 'source': 'example_data/source_code/example.js'}\n",
|
|
"{'content_type': 'simplified_code',\n",
|
|
" 'language': <Language.JS: 'js'>,\n",
|
|
" 'source': 'example_data/source_code/example.js'}\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"for document in docs:\n",
|
|
" pprint(document.metadata)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "f44e3e37",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"class MyClass:\n",
|
|
" def __init__(self, name):\n",
|
|
" self.name = name\n",
|
|
"\n",
|
|
" def greet(self):\n",
|
|
" print(f\"Hello, {self.name}!\")\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"def main():\n",
|
|
" name = input(\"Enter your name: \")\n",
|
|
" obj = MyClass(name)\n",
|
|
" obj.greet()\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"# Code for: class MyClass:\n",
|
|
"\n",
|
|
"\n",
|
|
"# Code for: def main():\n",
|
|
"\n",
|
|
"\n",
|
|
"if __name__ == \"__main__\":\n",
|
|
" main()\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"class MyClass {\n",
|
|
" constructor(name) {\n",
|
|
" this.name = name;\n",
|
|
" }\n",
|
|
"\n",
|
|
" greet() {\n",
|
|
" console.log(`Hello, ${this.name}!`);\n",
|
|
" }\n",
|
|
"}\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"function main() {\n",
|
|
" const name = prompt(\"Enter your name:\");\n",
|
|
" const obj = new MyClass(name);\n",
|
|
" obj.greet();\n",
|
|
"}\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"// Code for: class MyClass {\n",
|
|
"\n",
|
|
"// Code for: function main() {\n",
|
|
"\n",
|
|
"main();\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(\"\\n\\n--8<--\\n\\n\".join([document.page_content for document in docs]))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "69aad0ed",
|
|
"metadata": {},
|
|
"source": [
|
|
"The parser can be disabled for small files. \n",
|
|
"\n",
|
|
"The parameter `parser_threshold` indicates the minimum number of lines that the source code file must have to be segmented using the parser."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "ae024794",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"loader = GenericLoader.from_filesystem(\n",
|
|
" \"./example_data/source_code\",\n",
|
|
" glob=\"*\",\n",
|
|
" suffixes=[\".py\"],\n",
|
|
" parser=LanguageParser(language=Language.PYTHON, parser_threshold=1000),\n",
|
|
")\n",
|
|
"docs = loader.load()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"id": "5d3b372a",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"1"
|
|
]
|
|
},
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"len(docs)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"id": "89e546ad",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"class MyClass:\n",
|
|
" def __init__(self, name):\n",
|
|
" self.name = name\n",
|
|
"\n",
|
|
" def greet(self):\n",
|
|
" print(f\"Hello, {self.name}!\")\n",
|
|
"\n",
|
|
"\n",
|
|
"def main():\n",
|
|
" name = input(\"Enter your name: \")\n",
|
|
" obj = MyClass(name)\n",
|
|
" obj.greet()\n",
|
|
"\n",
|
|
"\n",
|
|
"if __name__ == \"__main__\":\n",
|
|
" main()\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(docs[0].page_content)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c9c71e61",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Splitting\n",
|
|
"\n",
|
|
"Additional splitting could be needed for those functions, classes, or scripts that are too big."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"id": "adbaa79f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"loader = GenericLoader.from_filesystem(\n",
|
|
" \"./example_data/source_code\",\n",
|
|
" glob=\"*\",\n",
|
|
" suffixes=[\".js\"],\n",
|
|
" parser=LanguageParser(language=Language.JS),\n",
|
|
")\n",
|
|
"docs = loader.load()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"id": "c44c0d3f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.text_splitter import (\n",
|
|
" Language,\n",
|
|
" RecursiveCharacterTextSplitter,\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"id": "b1e0053d",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"js_splitter = RecursiveCharacterTextSplitter.from_language(\n",
|
|
" language=Language.JS, chunk_size=60, chunk_overlap=0\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"id": "7dbe6188",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"result = js_splitter.split_documents(docs)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"id": "8a80d089",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"7"
|
|
]
|
|
},
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"len(result)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"id": "000a6011",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"class MyClass {\n",
|
|
" constructor(name) {\n",
|
|
" this.name = name;\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"}\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"greet() {\n",
|
|
" console.log(`Hello, ${this.name}!`);\n",
|
|
" }\n",
|
|
"}\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"function main() {\n",
|
|
" const name = prompt(\"Enter your name:\");\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"const obj = new MyClass(name);\n",
|
|
" obj.greet();\n",
|
|
"}\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"// Code for: class MyClass {\n",
|
|
"\n",
|
|
"// Code for: function main() {\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"main();\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(\"\\n\\n--8<--\\n\\n\".join([document.page_content for document in result]))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Adding Languages using Tree-sitter Template\n",
|
|
"\n",
|
|
"Expanding language support using the Tree-Sitter template involves a few essential steps:\n",
|
|
"\n",
|
|
"1. **Creating a New Language File**:\n",
|
|
" - Begin by creating a new file in the designated directory (langchain/libs/community/langchain_community/document_loaders/parsers/language).\n",
|
|
" - Model this file based on the structure and parsing logic of existing language files like **`cpp.py`**.\n",
|
|
" - You will also need to create a file in the langchain directory (langchain/libs/langchain/langchain/document_loaders/parsers/language).\n",
|
|
"2. **Parsing Language Specifics**:\n",
|
|
" - Mimic the structure used in the **`cpp.py`** file, adapting it to suit the language you are incorporating.\n",
|
|
" - The primary alteration involves adjusting the chunk query array to suit the syntax and structure of the language you are parsing.\n",
|
|
"3. **Testing the Language Parser**:\n",
|
|
" - For thorough validation, generate a test file specific to the new language. Create **`test_language.py`** in the designated directory(langchain/libs/community/tests/unit_tests/document_loaders/parsers/language).\n",
|
|
" - Follow the example set by **`test_cpp.py`** to establish fundamental tests for the parsed elements in the new language.\n",
|
|
"4. **Integration into the Parser and Text Splitter**:\n",
|
|
" - Incorporate your new language within the **`language_parser.py`** file. Ensure to update LANGUAGE_EXTENSIONS and LANGUAGE_SEGMENTERS along with the docstring for LanguageParser to recognize and handle the added language.\n",
|
|
" - Also, confirm that your language is included in **`text_splitter.py`** in class Language for proper parsing.\n",
|
|
"\n",
|
|
"By following these steps and ensuring comprehensive testing and integration, you'll successfully extend language support using the Tree-Sitter template.\n",
|
|
"\n",
|
|
"Best of luck!"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.5"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|