{ "cells": [ { "cell_type": "markdown", "id": "1f3a5ebf", "metadata": {}, "source": [ "# AirbyteLoader" ] }, { "cell_type": "markdown", "id": "35ac77b1-449b-44f7-b8f3-3494d55c286e", "metadata": {}, "source": [ ">[Airbyte](https://github.com/airbytehq/airbyte) is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.\n", "\n", "This covers how to load any source from Airbyte into LangChain documents\n", "\n", "## Installation\n", "\n", "In order to use `AirbyteLoader` you need to install the `langchain-airbyte` integration package." ] }, { "cell_type": "code", "execution_count": 1, "id": "180c8b74", "metadata": {}, "outputs": [], "source": [ "% pip install -qU langchain-airbyte" ] }, { "cell_type": "markdown", "id": "3dd92c62", "metadata": {}, "source": [ "Note: Currently, the `airbyte` library does not support Pydantic v2.\n", "Please downgrade to Pydantic v1 to use this package.\n", "\n", "Note: This package also currently requires Python 3.10+.\n", "\n", "## Loading Documents\n", "\n", "By default, the `AirbyteLoader` will load any structured data from a stream and output yaml-formatted documents." ] }, { "cell_type": "code", "execution_count": 6, "id": "721d9316", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "```yaml\n", "academic_degree: PhD\n", "address:\n", " city: Lauderdale Lakes\n", " country_code: FI\n", " postal_code: '75466'\n", " province: New Jersey\n", " state: Hawaii\n", " street_name: Stoneyford\n", " street_number: '1112'\n", "age: 44\n", "blood_type: \"O\\u2212\"\n", "created_at: '2004-04-02T13:05:27+00:00'\n", "email: bread2099+1@outlook.com\n", "gender: Fluid\n", "height: '1.62'\n", "id: 1\n", "language: Belarusian\n", "name: Moses\n", "nationality: Dutch\n", "occupation: Track Worker\n", "telephone: 1-467-194-2318\n", "title: M.Sc.Tech.\n", "updated_at: '2024-02-27T16:41:01+00:00'\n", "weight: 6\n" ] } ], "source": [ "from langchain_airbyte import AirbyteLoader\n", "\n", "loader = AirbyteLoader(\n", " source=\"source-faker\",\n", " stream=\"users\",\n", " config={\"count\": 10},\n", ")\n", "docs = loader.load()\n", "print(docs[0].page_content[:500])" ] }, { "cell_type": "markdown", "id": "fca024cb", "metadata": { "scrolled": true }, "source": [ "You can also specify a custom prompt template for formatting documents:" ] }, { "cell_type": "code", "execution_count": 7, "id": "9fa002a5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "My name is Verdie and I am 1.73 meters tall.\n" ] } ], "source": [ "from langchain_core.prompts import PromptTemplate\n", "\n", "loader_templated = AirbyteLoader(\n", " source=\"source-faker\",\n", " stream=\"users\",\n", " config={\"count\": 10},\n", " template=PromptTemplate.from_template(\n", " \"My name is {name} and I am {height} meters tall.\"\n", " ),\n", ")\n", "docs_templated = loader_templated.load()\n", "print(docs_templated[0].page_content)" ] }, { "cell_type": "markdown", "id": "d3e6d887", "metadata": {}, "source": [ "## Lazy Loading Documents\n", "\n", "One of the powerful features of `AirbyteLoader` is its ability to load large documents from upstream sources. When working with large datasets, the default `.load()` behavior can be slow and memory-intensive. To avoid this, you can use the `.lazy_load()` method to load documents in a more memory-efficient manner." ] }, { "cell_type": "code", "execution_count": 11, "id": "684b9187", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Just calling lazy load is quick! This took 0.0001 seconds\n" ] } ], "source": [ "import time\n", "\n", "loader = AirbyteLoader(\n", " source=\"source-faker\",\n", " stream=\"users\",\n", " config={\"count\": 3},\n", " template=PromptTemplate.from_template(\n", " \"My name is {name} and I am {height} meters tall.\"\n", " ),\n", ")\n", "\n", "start_time = time.time()\n", "my_iterator = loader.lazy_load()\n", "print(\n", " f\"Just calling lazy load is quick! This took {time.time() - start_time:.4f} seconds\"\n", ")" ] }, { "cell_type": "markdown", "id": "6b24a64b", "metadata": {}, "source": [ "And you can iterate over documents as they're yielded:" ] }, { "cell_type": "code", "execution_count": 12, "id": "3e8355d0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "My name is Andera and I am 1.91 meters tall.\n", "My name is Jody and I am 1.85 meters tall.\n", "My name is Zonia and I am 1.53 meters tall.\n" ] } ], "source": [ "for doc in my_iterator:\n", " print(doc.page_content)" ] }, { "cell_type": "markdown", "id": "d1040d81", "metadata": {}, "source": [ "You can also lazy load documents in an async manner with `.alazy_load()`:" ] }, { "cell_type": "code", "execution_count": 13, "id": "dc5d0911", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "My name is Carmelina and I am 1.74 meters tall.\n", "My name is Ali and I am 1.90 meters tall.\n", "My name is Rochell and I am 1.83 meters tall.\n" ] } ], "source": [ "loader = AirbyteLoader(\n", " source=\"source-faker\",\n", " stream=\"users\",\n", " config={\"count\": 3},\n", " template=PromptTemplate.from_template(\n", " \"My name is {name} and I am {height} meters tall.\"\n", " ),\n", ")\n", "\n", "my_async_iterator = loader.alazy_load()\n", "\n", "async for doc in my_async_iterator:\n", " print(doc.page_content)" ] }, { "cell_type": "markdown", "id": "ba4ede33", "metadata": {}, "source": [ "## Configuration\n", "\n", "`AirbyteLoader` can be configured with the following options:\n", "\n", "- `source` (str, required): The name of the Airbyte source to load from.\n", "- `stream` (str, required): The name of the stream to load from (Airbyte sources can return multiple streams)\n", "- `config` (dict, required): The configuration for the Airbyte source\n", "- `template` (PromptTemplate, optional): A custom prompt template for formatting documents\n", "- `include_metadata` (bool, optional, default True): Whether to include all fields as metadata in the output documents\n", "\n", "The majority of the configuration will be in `config`, and you can find the specific configuration options in the \"Config field reference\" for each source in the [Airbyte documentation](https://docs.airbyte.com/integrations/)." ] }, { "cell_type": "markdown", "id": "2e2ed269", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }