Synthetic Data generation (#9472)

--------- Co-authored-by: William Fu-Hinthorn <13333726+hinthornw@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com>
2025-08-31 10:23:18 +00:00 · 2023-09-28 18:16:05 -07:00
parent a4e0cf6300
commit 5d7c6d1bca
7 changed files with 516 additions and 17 deletions
--- a/docs/extras/use_cases/more/data_generation.ipynb
+++ b/docs/extras/use_cases/more/data_generation.ipynb
@@ -1,46 +1,229 @@
 {
 "cells": [
  {
+   "attachments": {},
   "cell_type": "markdown",
   "id": "aa3571cc",
   "metadata": {},
   "source": [
-    "# Data generation\n",
+    "# Synthetic Data generation\n",
    "\n",
    "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/data_generation.ipynb)\n",
    "\n",
    "## Use case\n",
    "\n",
-    "Creating synthethic language data can be beneficial for multiple reasons:\n",
-    "- providing data augmentation\n",
-    "- obtaining domain-specific examples\n",
-    "- increasing data diversity\n",
-    "- enabling quick iteration and experimentation\n",
+    "Synthetic data is artificially generated data, rather than data collected from real-world events. It's used to simulate real data without compromising privacy or encountering real-world limitations. \n",
+    "\n",
+    "Benefits of Synthetic Data:\n",
+    "\n",
+    "1. **Privacy and Security**: No real personal data at risk of breaches.\n",
+    "2. **Data Augmentation**: Expands datasets for machine learning.\n",
+    "3. **Flexibility**: Create specific or rare scenarios.\n",
+    "4. **Cost-effective**: Often cheaper than real-world data collection.\n",
+    "5. **Regulatory Compliance**: Helps navigate strict data protection laws.\n",
+    "6. **Model Robustness**: Can lead to better generalizing AI models.\n",
+    "7. **Rapid Prototyping**: Enables quick testing without real data.\n",
+    "8. **Controlled Experimentation**: Simulate specific conditions.\n",
+    "9. **Access to Data**: Alternative when real data isn't available.\n",
+    "\n",
+    "Note: Despite the benefits, synthetic data should be used carefully, as it may not always capture real-world complexities.\n",
    "\n",
    "## Quickstart\n",
    "\n",
-    "Let's see a very straightforward example of how we can use OpenAI functions for creating synthetic data in LangChain."
+    "In this notebook, we'll dive deep into generating synthetic medical billing records using the langchain library. This tool is particularly useful when you want to develop or test algorithms but don't want to use real patient data due to privacy concerns or data availability issues."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "bca57012",
+   "metadata": {},
+   "source": [
+    "### Setup\n",
+    "First, you'll need to have the langchain library installed, along with its dependencies. Since we're using the OpenAI generator chain, we'll install that as well. Since this is an experimental lib, we'll need to include `langchain_experimental` in our installs. We'll then import the necessary modules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "7ae36b66",
-   "metadata": {
-    "scrolled": true
-   },
+   "id": "a0377478",
+   "metadata": {},
   "outputs": [],
   "source": [
-    "!pip install langchain openai \n",
-    "\n",
+    "!pip install -U langchain langchain_experimental openai\n",
    "# Set env var OPENAI_API_KEY or load from a .env file:\n",
    "# import dotenv\n",
-    "# dotenv.load_dotenv()"
+    "# dotenv.load_dotenv()\n",
+    "\n",
+    "from langchain.prompts import FewShotPromptTemplate, PromptTemplate\n",
+    "from langchain.chat_models import ChatOpenAI\n",
+    "from langchain.pydantic_v1 import BaseModel\n",
+    "from langchain_experimental.tabular_synthetic_data.base import SyntheticDataGenerator\n",
+    "from langchain_experimental.tabular_synthetic_data.openai import create_openai_data_generator, OPENAI_TEMPLATE\n",
+    "from langchain_experimental.tabular_synthetic_data.prompts import SYNTHETIC_FEW_SHOT_SUFFIX, SYNTHETIC_FEW_SHOT_PREFIX\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "a5a0917b",
+   "metadata": {},
+   "source": [
+    "## 1. Define Your Data Model\n",
+    "Every dataset has a structure or a \"schema\". The MedicalBilling class below serves as our schema for the synthetic data. By defining this, we're informing our synthetic data generator about the shape and nature of data we expect."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": null,
+   "id": "291bad6e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class MedicalBilling(BaseModel):\n",
+    "    patient_id: int\n",
+    "    patient_name: str\n",
+    "    diagnosis_code: str\n",
+    "    procedure_code: str\n",
+    "    total_charge: float\n",
+    "    insurance_claim_amount: float\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "2059ca63",
+   "metadata": {},
+   "source": [
+    "For instance, every record will have a `patient_id` that's an integer, a `patient_name` that's a string, and so on.\n",
+    "\n",
+    "## 2. Sample Data\n",
+    "To guide the synthetic data generator, it's useful to provide it with a few real-world-like examples. These examples serve as a \"seed\" - they're representative of the kind of data you want, and the generator will use them to create more data that looks similar.\n",
+    "\n",
+    "Here are some fictional medical billing records:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b989b792",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "examples = [\n",
+    "    {\"example\": \"\"\"Patient ID: 123456, Patient Name: John Doe, Diagnosis Code: \n",
+    "        J20.9, Procedure Code: 99203, Total Charge: $500, Insurance Claim Amount: $350\"\"\"},\n",
+    "    {\"example\": \"\"\"Patient ID: 789012, Patient Name: Johnson Smith, Diagnosis \n",
+    "        Code: M54.5, Procedure Code: 99213, Total Charge: $150, Insurance Claim Amount: $120\"\"\"},\n",
+    "    {\"example\": \"\"\"Patient ID: 345678, Patient Name: Emily Stone, Diagnosis Code: \n",
+    "        E11.9, Procedure Code: 99214, Total Charge: $300, Insurance Claim Amount: $250\"\"\"},\n",
+    "]\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "57e28809",
+   "metadata": {},
+   "source": [
+    "## 3. Craft a Prompt Template\n",
+    "The generator doesn't magically know how to create our data; we need to guide it. We do this by creating a prompt template. This template helps instruct the underlying language model on how to produce synthetic data in the desired format."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ea6e042e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "OPENAI_TEMPLATE = PromptTemplate(input_variables=[\"example\"], template=\"{example}\")\n",
+    "\n",
+    "prompt_template = FewShotPromptTemplate(\n",
+    "    prefix=SYNTHETIC_FEW_SHOT_PREFIX,\n",
+    "    examples=examples,\n",
+    "    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,\n",
+    "    input_variables=[\"subject\", \"extra\"],\n",
+    "    example_prompt=OPENAI_TEMPLATE,\n",
+    ")\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "fa6da3cb",
+   "metadata": {},
+   "source": [
+    "The `FewShotPromptTemplate` includes:\n",
+    "\n",
+    "- `prefix` and `suffix`: These likely contain guiding context or instructions.\n",
+    "- `examples`: The sample data we defined earlier.\n",
+    "- `input_variables`: These variables (\"subject\", \"extra\") are placeholders you can dynamically fill later. For instance, \"subject\" might be filled with \"medical_billing\" to guide the model further.\n",
+    "- `example_prompt`: This prompt template is the format we want each example row to take in our prompt.\n",
+    "\n",
+    "## 4. Creating the Data Generator\n",
+    "With the schema and the prompt ready, the next step is to create the data generator. This object knows how to communicate with the underlying language model to get synthetic data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1b9ba911",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "synthetic_data_generator = create_openai_data_generator(\n",
+    "    output_schema=MedicalBilling,\n",
+    "    llm=ChatOpenAI(temperature=1),  # You'll need to replace with your actual Language Model instance\n",
+    "    prompt=prompt_template,\n",
+    ")\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "a4198bd6",
+   "metadata": {},
+   "source": [
+    "## 5. Generate Synthetic Data\n",
+    "Finally, let's get our synthetic data!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a424c890",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "synthetic_results = synthetic_data_generator.generate(\n",
+    "    subject=\"medical_billing\",\n",
+    "    extra=\"the name must be chosen at random. Make it something you wouldn't normally choose.\",\n",
+    "    runs=10,\n",
+    ")\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "fa4402e9",
+   "metadata": {},
+   "source": [
+    "This command asks the generator to produce 10 synthetic medical billing records. The results are stored in `synthetic_results`. The output will be a list of the MedicalBilling pydantic models."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "53a4cbf9",
+   "metadata": {},
+   "source": [
+    "### Other implementations\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
   "id": "9e715d94",
   "metadata": {
    "scrolled": true
@@ -429,7 +612,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.16"
+   "version": "3.11.4"
  }
 },
 "nbformat": 4,
--- a/libs/experimental/langchain_experimental/tabular_synthetic_data/init.py
+++ b/libs/experimental/langchain_experimental/tabular_synthetic_data/init.py
--- a/libs/experimental/langchain_experimental/tabular_synthetic_data/base.py
+++ b/libs/experimental/langchain_experimental/tabular_synthetic_data/base.py
@@ -0,0 +1,135 @@
+import asyncio
+from typing import Any, Dict, List, Optional, Union
+
+from langchain.chains.base import Chain
+from langchain.chains.llm import LLMChain
+from langchain.prompts.few_shot import FewShotPromptTemplate
+from langchain.pydantic_v1 import BaseModel, root_validator
+from langchain.schema.language_model import BaseLanguageModel
+
+
+class SyntheticDataGenerator(BaseModel):
+    """Generates synthetic data using the given LLM and few-shot template.
+
+    Utilizes the provided LLM to produce synthetic data based on the
+    few-shot prompt template.
+
+    Attributes:
+        template (FewShotPromptTemplate): Template for few-shot prompting.
+        llm (Optional[BaseLanguageModel]): Large Language Model to use for generation.
+        llm_chain (Optional[Chain]): LLM chain with the LLM and few-shot template.
+        example_input_key (str): Key to use for storing example inputs.
+
+    Usage Example:
+        >>> template = FewShotPromptTemplate(...)
+        >>> llm = BaseLanguageModel(...)
+        >>> generator = SyntheticDataGenerator(template=template, llm=llm)
+        >>> results = generator.generate(subject="climate change", runs=5)
+    """
+
+    template: FewShotPromptTemplate
+    llm: Optional[BaseLanguageModel] = None
+    results: list = []
+    llm_chain: Optional[Chain] = None
+    example_input_key: str = "example"
+
+    class Config:
+        validate_assignment = True
+
+    @root_validator(pre=False, skip_on_failure=True)
+    def set_llm_chain(cls, values: Dict[str, Any]) -> Dict[str, Any]:
+        llm_chain = values.get("llm_chain")
+        llm = values.get("llm")
+        few_shot_template = values.get("template")
+
+        if not llm_chain:  # If llm_chain is None or not present
+            if llm is None or few_shot_template is None:
+                raise ValueError(
+                    "Both llm and few_shot_template must be provided if llm_chain is "
+                    "not given."
+                )
+            values["llm_chain"] = LLMChain(llm=llm, prompt=few_shot_template)
+
+        return values
+
+    @staticmethod
+    def _format_dict_to_string(input_dict: Dict) -> str:
+        formatted_str = ", ".join(
+            [f"{key}: {value}" for key, value in input_dict.items()]
+        )
+        return formatted_str
+
+    def _update_examples(self, example: Union[BaseModel, Dict[str, Any], str]) -> None:
+        """Prevents duplicates by adding previously generated examples to the few shot
+        list."""
+        if self.template and self.template.examples:
+            if isinstance(example, BaseModel):
+                formatted_example = self._format_dict_to_string(example.dict())
+            elif isinstance(example, dict):
+                formatted_example = self._format_dict_to_string(example)
+            else:
+                formatted_example = str(example)
+            self.template.examples.pop(0)
+            self.template.examples.append({self.example_input_key: formatted_example})
+
+    def generate(self, subject: str, runs: int, *args: Any, **kwargs: Any) -> List[str]:
+        """Generate synthetic data using the given subject string.
+
+        Args:
+            subject (str): The subject the synthetic data will be about.
+            runs (int): Number of times to generate the data.
+            extra (str): Extra instructions for steerability in data generation.
+
+        Returns:
+            List[str]: List of generated synthetic data.
+
+        Usage Example:
+            >>> results = generator.generate(subject="climate change", runs=5,
+            extra="Focus on environmental impacts.")
+        """
+        if self.llm_chain is None:
+            raise ValueError(
+                "llm_chain is none, either set either llm_chain or llm at generator "
+                "construction"
+            )
+        for _ in range(runs):
+            result = self.llm_chain.run(subject=subject, *args, **kwargs)
+            self.results.append(result)
+            self._update_examples(result)
+        return self.results
+
+    async def agenerate(
+        self, subject: str, runs: int, extra: str = "", *args: Any, **kwargs: Any
+    ) -> List[str]:
+        """Generate synthetic data using the given subject asynchronously.
+
+        Note: Since the LLM calls run concurrently,
+        you may have fewer duplicates by adding specific instructions to
+        the "extra" keyword argument.
+
+        Args:
+            subject (str): The subject the synthetic data will be about.
+            runs (int): Number of times to generate the data asynchronously.
+            extra (str): Extra instructions for steerability in data generation.
+
+        Returns:
+            List[str]: List of generated synthetic data for the given subject.
+
+        Usage Example:
+            >>> results = await generator.agenerate(subject="climate change", runs=5,
+            extra="Focus on env impacts.")
+        """
+
+        async def run_chain(
+            subject: str, extra: str = "", *args: Any, **kwargs: Any
+        ) -> None:
+            if self.llm_chain is not None:
+                result = await self.llm_chain.arun(
+                    subject=subject, extra=extra, *args, **kwargs
+                )
+                self.results.append(result)
+
+        await asyncio.gather(
+            *(run_chain(subject=subject, extra=extra) for _ in range(runs))
+        )
+        return self.results
--- a/libs/experimental/langchain_experimental/tabular_synthetic_data/openai.py
+++ b/libs/experimental/langchain_experimental/tabular_synthetic_data/openai.py
@@ -0,0 +1,64 @@
+from typing import Any, Dict, Optional, Type, Union
+
+from langchain.chains.openai_functions import create_structured_output_chain
+from langchain.chat_models import ChatOpenAI
+from langchain.prompts import PromptTemplate
+from langchain.pydantic_v1 import BaseModel
+from langchain.schema import BaseLLMOutputParser, BasePromptTemplate
+
+from langchain_experimental.tabular_synthetic_data.base import SyntheticDataGenerator
+
+OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")
+
+
+def create_openai_data_generator(
+    output_schema: Union[Dict[str, Any], Type[BaseModel]],
+    llm: ChatOpenAI,
+    prompt: BasePromptTemplate,
+    output_parser: Optional[BaseLLMOutputParser] = None,
+    **kwargs: Any
+) -> SyntheticDataGenerator:
+    """
+    Create an instance of SyntheticDataGenerator tailored for OpenAI models.
+
+    This function creates an LLM chain designed for structured output based on the
+    provided schema, language model, and prompt template. The resulting chain is then
+    used to instantiate and return a SyntheticDataGenerator.
+
+    Args:
+        output_schema (Union[Dict[str, Any], Type[BaseModel]]): Schema for expected
+        output. This can be either a dictionary representing a valid JsonSchema or a
+        Pydantic BaseModel class.
+
+
+        llm (ChatOpenAI): OpenAI language model to use.
+
+        prompt (BasePromptTemplate): Template to be used for generating prompts.
+
+
+        output_parser (Optional[BaseLLMOutputParser], optional): Parser for
+        processing model outputs. If none is provided, a default will be inferred
+        from the function types.
+
+
+        **kwargs: Additional keyword arguments to be passed to
+        `create_structured_output_chain`.
+
+
+    Returns: SyntheticDataGenerator: An instance of the data generator set up with
+    the constructed chain.
+
+    Usage:
+        To generate synthetic data with a structured output, first define your desired
+        output schema. Then, use this function to create a SyntheticDataGenerator
+        instance. After obtaining the generator, you can utilize its methods to produce
+        the desired synthetic data.
+    """
+    # Create function calling chain to ensure structured output
+    chain = create_structured_output_chain(
+        output_schema, llm, prompt, output_parser=output_parser, **kwargs
+    )
+
+    # Create the SyntheticDataGenerator instance with the created chain
+    generator = SyntheticDataGenerator(template=prompt, llm_chain=chain)
+    return generator
--- a/libs/experimental/langchain_experimental/tabular_synthetic_data/prompts.py
+++ b/libs/experimental/langchain_experimental/tabular_synthetic_data/prompts.py
@@ -0,0 +1,13 @@
+from langchain.prompts.prompt import PromptTemplate
+
+DEFAULT_INPUT_KEY = "example"
+DEFAULT_PROMPT = PromptTemplate(
+    input_variables=[DEFAULT_INPUT_KEY], template="{example}"
+)
+
+SYNTHETIC_FEW_SHOT_PREFIX = (
+    "This is a test about generating synthetic data about {subject}. Examples below:"
+)
+SYNTHETIC_FEW_SHOT_SUFFIX = (
+    """Now you generate synthetic data about {subject}. Make sure to {extra}:"""
+)
--- a/libs/experimental/tests/integration_tests/chains/test_synthetic_data_openai.py
+++ b/libs/experimental/tests/integration_tests/chains/test_synthetic_data_openai.py
@@ -0,0 +1,104 @@
+import pytest
+from langchain.chat_models import ChatOpenAI
+from langchain.prompts.few_shot import FewShotPromptTemplate
+from langchain.pydantic_v1 import BaseModel
+
+from langchain_experimental.tabular_synthetic_data.base import SyntheticDataGenerator
+from langchain_experimental.tabular_synthetic_data.openai import (
+    OPENAI_TEMPLATE,
+    create_openai_data_generator,
+)
+from langchain_experimental.tabular_synthetic_data.prompts import (
+    SYNTHETIC_FEW_SHOT_PREFIX,
+    SYNTHETIC_FEW_SHOT_SUFFIX,
+)
+
+
+# Define the desired output schema for individual medical billing record
+class MedicalBilling(BaseModel):
+    patient_id: int
+    patient_name: str
+    diagnosis_code: str
+    procedure_code: str
+    total_charge: float
+    insurance_claim_amount: float
+
+
+examples = [
+    {
+        "example": """Patient ID: 123456, Patient Name: John Doe, Diagnosis Code: 
+        J20.9, Procedure Code: 99203, Total Charge: $500, Insurance Claim Amount: 
+        $350"""
+    },
+    {
+        "example": """Patient ID: 789012, Patient Name: Johnson Smith, Diagnosis 
+        Code: M54.5, Procedure Code: 99213, Total Charge: $150, Insurance Claim 
+        Amount: $120"""
+    },
+    {
+        "example": """Patient ID: 345678, Patient Name: Emily Stone, Diagnosis Code: 
+        E11.9, Procedure Code: 99214, Total Charge: $300, Insurance Claim Amount: 
+        $250"""
+    },
+    {
+        "example": """Patient ID: 901234, Patient Name: Robert Miles, Diagnosis Code: 
+        B07.9, Procedure Code: 99204, Total Charge: $200, Insurance Claim Amount: 
+        $160"""
+    },
+    {
+        "example": """Patient ID: 567890, Patient Name: Clara Jensen, Diagnosis Code: 
+        F41.9, Procedure Code: 99205, Total Charge: $450, Insurance Claim Amount: 
+        $310"""
+    },
+    {
+        "example": """Patient ID: 234567, Patient Name: Alan Turing, Diagnosis Code: 
+        G40.909, Procedure Code: 99215, Total Charge: $220, Insurance Claim Amount: 
+        $180"""
+    },
+]
+
+prompt_template = FewShotPromptTemplate(
+    prefix=SYNTHETIC_FEW_SHOT_PREFIX,
+    examples=examples,
+    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
+    input_variables=["subject", "extra"],
+    example_prompt=OPENAI_TEMPLATE,
+)
+
+
+@pytest.fixture(scope="function")
+def synthetic_data_generator() -> SyntheticDataGenerator:
+    return create_openai_data_generator(
+        output_schema=MedicalBilling,
+        llm=ChatOpenAI(temperature=1),  # replace with your LLM instance
+        prompt=prompt_template,
+    )
+
+
+@pytest.mark.requires("openai")
+def test_generate_synthetic(synthetic_data_generator: SyntheticDataGenerator) -> None:
+    synthetic_results = synthetic_data_generator.generate(
+        subject="medical_billing",
+        extra="""the name must be chosen at random. Make it something you wouldn't 
+        normally choose.""",
+        runs=10,
+    )
+    assert len(synthetic_results) == 10
+    for row in synthetic_results:
+        assert isinstance(row, MedicalBilling)
+
+
+@pytest.mark.requires("openai")
+@pytest.mark.asyncio
+async def test_agenerate_synthetic(
+    synthetic_data_generator: SyntheticDataGenerator,
+) -> None:
+    synthetic_results = await synthetic_data_generator.agenerate(
+        subject="medical_billing",
+        extra="""the name must be chosen at random. Make it something you wouldn't 
+        normally choose.""",
+        runs=10,
+    )
+    assert len(synthetic_results) == 10
+    for row in synthetic_results:
+        assert isinstance(row, MedicalBilling)
--- a/libs/langchain/langchain/prompts/base.py
+++ b/libs/langchain/langchain/prompts/base.py
@@ -27,7 +27,7 @@ def jinja2_formatter(template: str, **kwargs: Any) -> str:
 def validate_jinja2(template: str, input_variables: List[str]) -> None:
    """
    Validate that the input variables are valid for the template.
-    Issues an warning if missing or extra variables are found.
+    Issues a warning if missing or extra variables are found.

    Args:
        template: The template string.