community[minor]: add github file loader to load any github file content b… (#15305)

### Description support load any github file content based on file extension. Why not use [git loader](https://python.langchain.com/docs/integrations/document_loaders/git#load-existing-repository-from-disk) ? git loader clones the whole repo even only interested part of files, that's too heavy. This GithubFileLoader only downloads that you are interested files. ### Twitter handle my twitter: @shufanhaotop --------- Co-authored-by: Hao Fan <h_fan@apple.com> Co-authored-by: Bagatur <baskaryan@gmail.com>
2025-09-08 14:31:55 +00:00 · 2024-02-07 01:42:33 +08:00
parent ac662b3698
commit ef082c77b1
5 changed files with 232 additions and 99 deletions
--- a/docs/docs/integrations/document_loaders/github.ipynb
+++ b/docs/docs/integrations/document_loaders/github.ipynb
@@ -6,7 +6,7 @@
   "source": [
    "# GitHub\n",
    "\n",
-    "This notebooks shows how you can load issues and pull requests (PRs) for a given repository on [GitHub](https://github.com/). We will use the LangChain Python repository as an example."
+    "This notebooks shows how you can load issues and pull requests (PRs) for a given repository on [GitHub](https://github.com/). Also shows how you can load github files for agiven repository on [GitHub](https://github.com/). We will use the LangChain Python repository as an example."
   ]
  },
  {
@@ -46,7 +46,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": null,
   "metadata": {
    "tags": []
   },
@@ -57,7 +57,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -91,7 +91,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -100,27 +100,9 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": null,
   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "# Creates GitHubLoader (#5257)\r\n",
-      "\r\n",
-      "GitHubLoader is a DocumentLoader that loads issues and PRs from GitHub.\r\n",
-      "\r\n",
-      "Fixes #5257\r\n",
-      "\r\n",
-      "Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:\r\n",
-      "DataLoaders\r\n",
-      "- @eyurtsev\r\n",
-      "\n",
-      "{'url': 'https://github.com/langchain-ai/langchain/pull/5408', 'title': 'DocumentLoader for GitHub', 'creator': 'UmerHA', 'created_at': '2023-05-29T14:50:53Z', 'comments': 0, 'state': 'open', 'labels': ['enhancement', 'lgtm', 'doc loader'], 'assignee': None, 'milestone': None, 'locked': False, 'number': 5408, 'is_pull_request': True}\n"
-     ]
-    }
-   ],
+   "outputs": [],
   "source": [
    "print(docs[0].page_content)\n",
    "print(docs[0].metadata)"
@@ -142,7 +124,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -157,84 +139,68 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": null,
   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "### System Info\n",
-      "\n",
-      "LangChain version = 0.0.167\r\n",
-      "Python version = 3.11.0\r\n",
-      "System = Windows 11 (using Jupyter)\n",
-      "\n",
-      "### Who can help?\n",
-      "\n",
-      "- @hwchase17\r\n",
-      "- @agola11\r\n",
-      "- @UmerHA (I have a fix ready, will submit a PR)\n",
-      "\n",
-      "### Information\n",
-      "\n",
-      "- [ ] The official example notebooks/scripts\n",
-      "- [X] My own modified scripts\n",
-      "\n",
-      "### Related Components\n",
-      "\n",
-      "- [X] LLMs/Chat Models\n",
-      "- [ ] Embedding Models\n",
-      "- [X] Prompts / Prompt Templates / Prompt Selectors\n",
-      "- [ ] Output Parsers\n",
-      "- [ ] Document Loaders\n",
-      "- [ ] Vector Stores / Retrievers\n",
-      "- [ ] Memory\n",
-      "- [ ] Agents / Agent Executors\n",
-      "- [ ] Tools / Toolkits\n",
-      "- [ ] Chains\n",
-      "- [ ] Callbacks/Tracing\n",
-      "- [ ] Async\n",
-      "\n",
-      "### Reproduction\n",
-      "\n",
-      "```\r\n",
-      "import os\r\n",
-      "os.environ[\"OPENAI_API_KEY\"] = \"...\"\r\n",
-      "\r\n",
-      "from langchain.chains import LLMChain\r\n",
-      "from langchain_openai import ChatOpenAI\r\n",
-      "from langchain.prompts import PromptTemplate\r\n",
-      "from langchain.prompts.chat import ChatPromptTemplate\r\n",
-      "from langchain.schema import messages_from_dict\r\n",
-      "\r\n",
-      "role_strings = [\r\n",
-      "    (\"system\", \"you are a bird expert\"), \r\n",
-      "    (\"human\", \"which bird has a point beak?\")\r\n",
-      "]\r\n",
-      "prompt = ChatPromptTemplate.from_role_strings(role_strings)\r\n",
-      "chain = LLMChain(llm=ChatOpenAI(), prompt=prompt)\r\n",
-      "chain.run({})\r\n",
-      "```\n",
-      "\n",
-      "### Expected behavior\n",
-      "\n",
-      "Chain should run\n",
-      "{'url': 'https://github.com/langchain-ai/langchain/issues/5027', 'title': \"ChatOpenAI models don't work with prompts created via ChatPromptTemplate.from_role_strings\", 'creator': 'UmerHA', 'created_at': '2023-05-20T10:39:18Z', 'comments': 1, 'state': 'open', 'labels': [], 'assignee': None, 'milestone': None, 'locked': False, 'number': 5027, 'is_pull_request': False}\n"
-     ]
-    }
-   ],
+   "outputs": [],
   "source": [
    "print(docs[0].page_content)\n",
    "print(docs[0].metadata)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load Github File Content\n",
+    "\n",
+    "For below code, loads all markdown file in rpeo `langchain-ai/langchain`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import GithubFileLoader"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
-   "source": []
+   "source": [
+    "loader = GithubFileLoader(\n",
+    "    repo=\"langchain-ai/langchain\",  # the repo name\n",
+    "    access_token=ACCESS_TOKEN,\n",
+    "    github_api_url=\"https://api.github.com\",\n",
+    "    file_filter=lambda file_path: file_path.endswith(\n",
+    "        \".md\"\n",
+    "    ),  # load all markdowns files.\n",
+    ")\n",
+    "documents = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "example output of one of document: \n",
+    "\n",
+    "```json\n",
+    "documents.metadata: \n",
+    "    {\n",
+    "      \"path\": \"README.md\",\n",
+    "      \"sha\": \"82f1c4ea88ecf8d2dfsfx06a700e84be4\",\n",
+    "      \"source\": \"https://github.com/langchain-ai/langchain/blob/master/README.md\"\n",
+    "    }\n",
+    "documents.content:\n",
+    "    mock content\n",
+    "```"
+   ]
  }
 ],
 "metadata": {
@@ -253,7 +219,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.11.3"
+   "version": "3.9.1"
  }
 },
 "nbformat": 4,