feat: Add Google Speech to Text API Document Loader (#12298)

- Add Document Loader for Google Speech to Text - Similar Structure to [Assembly AI Document Loader][1] [1]: https://python.langchain.com/docs/integrations/document_loaders/assemblyai
2025-09-05 21:12:48 +00:00 · 2023-10-27 11:34:26 -05:00
parent 52c194ec3a
commit 134f085824
6 changed files with 396 additions and 4 deletions
--- a/docs/api_reference/guide_imports.json
+++ b/docs/api_reference/guide_imports.json
--- a/docs/docs/integrations/document_loaders/google_speech_to_text.ipynb
+++ b/docs/docs/integrations/document_loaders/google_speech_to_text.ipynb
@@ -0,0 +1,202 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Google Speech-to-Text Audio Transcripts\n",
    "\n",
    "The `GoogleSpeechToTextLoader` allows to transcribe audio files with the [Google Cloud Speech-to-Text API](https://cloud.google.com/speech-to-text) and loads the transcribed text into documents.\n",
    "\n",
    "To use it, you should have the `google-cloud-speech` python package installed, and a Google Cloud project with the [Speech-to-Text API enabled](https://cloud.google.com/speech-to-text/v2/docs/transcribe-client-libraries#before_you_begin).\n",
    "\n",
    "- [Bringing the power of large models to Google Cloud’s Speech API](https://cloud.google.com/blog/products/ai-machine-learning/bringing-power-large-models-google-clouds-speech-api)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation & setup\n",
    "\n",
    "First, you need to install the `google-cloud-speech` python package.\n",
    "\n",
    "You can find more info about it on the [Speech-to-Text client libraries](https://cloud.google.com/speech-to-text/v2/docs/libraries) page.\n",
    "\n",
    "Follow the [quickstart guide](https://cloud.google.com/speech-to-text/v2/docs/sync-recognize) in the Google Cloud documentation to create a project and enable the API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install google-cloud-speech\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example\n",
    "\n",
    "The `GoogleSpeechToTextLoader` must include the `project_id` and `file_path` arguments. Audio files can be specified as a Google Cloud Storage URI (`gs://...`) or a local file path.\n",
    "\n",
    "Only synchronous requests are supported by the loader, which has a [limit of 60 seconds or 10MB](https://cloud.google.com/speech-to-text/v2/docs/sync-recognize#:~:text=60%20seconds%20and/or%2010%20MB) per audio file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import GoogleSpeechToTextLoader\n",
    "\n",
    "project_id = \"<PROJECT_ID>\"\n",
    "file_path = \"gs://cloud-samples-data/speech/audio.flac\"\n",
    "# or a local file path: file_path = \"./audio.wav\"\n",
    "\n",
    "loader = GoogleSpeechToTextLoader(project_id=project_id, file_path=file_path)\n",
    "\n",
    "docs = loader.load()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: Calling `loader.load()` blocks until the transcription is finished."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The transcribed text is available in the `page_content`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "docs[0].page_content\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "\"How old is the Brooklyn Bridge?\"\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `metadata` contains the full JSON response with more meta information:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "docs[0].metadata\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```json\n",
    "{\n",
    "  'language_code': 'en-US',\n",
    "  'result_end_offset': datetime.timedelta(seconds=1)\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Recognition Config\n",
    "\n",
    "You can specify the `config` argument to use different speech recognition models and enable specific features.\n",
    "\n",
    "Refer to the [Speech-to-Text recognizers documentation](https://cloud.google.com/speech-to-text/v2/docs/recognizers) and the [`RecognizeRequest`](https://cloud.google.com/python/docs/reference/speech/latest/google.cloud.speech_v2.types.RecognizeRequest) API reference for information on how to set a custom configuation.\n",
    "\n",
    "If you don't specify a `config`, the following options will be selected automatically:\n",
    "\n",
    "- Model: [Chirp Universal Speech Model](https://cloud.google.com/speech-to-text/v2/docs/chirp-model)\n",
    "- Language: `en-US`\n",
    "- Audio Encoding: Automatically Detected\n",
    "- Automatic Punctuation: Enabled"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from google.cloud.speech_v2 import AutoDetectDecodingConfig, RecognitionConfig, RecognitionFeatures\n",
    "from langchain.document_loaders import GoogleSpeechToTextLoader\n",
    "\n",
    "project_id = \"<PROJECT_ID>\"\n",
    "location = \"global\"\n",
    "recognizer_id = \"<RECOGNIZER_ID>\"\n",
    "file_path = \"./audio.wav\"\n",
    "\n",
    "config = RecognitionConfig(\n",
    "        auto_decoding_config=AutoDetectDecodingConfig(),\n",
    "        language_codes=[\"en-US\"],\n",
    "        model=\"long\",\n",
    "        features=RecognitionFeatures(\n",
    "            enable_automatic_punctuation=False,\n",
    "            profanity_filter=True,\n",
    "            enable_spoken_punctuation=True,\n",
    "            enable_spoken_emojis=True\n",
    "        ),\n",
    "    )\n",
    "\n",
    "loader = GoogleSpeechToTextLoader(\n",
    "    project_id=project_id,\n",
    "    location=location,\n",
    "    recognizer_id=recognizer_id,\n",
    "    file_path=file_path,\n",
    "    config=config\n",
    ")\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/docs/docs/integrations/platforms/google.mdx
+++ b/docs/docs/integrations/platforms/google.mdx
@@ -89,10 +89,28 @@ See a [usage example and authorizing instructions](/docs/integrations/document_l
 from langchain.document_loaders import GoogleDriveLoader
 ```
-## Vector Store
+### Speech-to-Text
 ### Google Vertex AI Vector Search
-> [Google Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/matching-engine/overview),
+> [Google Cloud Speech-to-Text](https://cloud.google.com/speech-to-text) is an audio transcription API powered by Google's speech recognition models.
 This document loader transcribes audio files and outputs the text results as Documents.
 First, we need to install the python package.
 ```bash
 pip install google-cloud-speech
 ```
 See a [usage example and authorizing instructions](/docs/integrations/document_loaders/google_speech_to_text).
 ```python
 from langchain.document_loaders import GoogleSpeechToTextLoader
 ```
 ## Vector Store
 ### Vertex AI Vector Search
 > [Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/matching-engine/overview),
 > formerly known as Vertex AI Matching Engine, provides the industry's leading high-scale 
 > low latency vector database. These vector databases are commonly
 > referred to as vector similarity-matching or an approximate nearest neighbor (ANN) service.
--- a/libs/langchain/langchain/document_loaders/init.py
+++ b/libs/langchain/langchain/document_loaders/init.py
@@ -87,6 +87,7 @@ from langchain.document_loaders.geodataframe import GeoDataFrameLoader
 from langchain.document_loaders.git import GitLoader
 from langchain.document_loaders.gitbook import GitbookLoader
 from langchain.document_loaders.github import GitHubIssuesLoader
 from langchain.document_loaders.google_speech_to_text import GoogleSpeechToTextLoader
 from langchain.document_loaders.googledrive import GoogleDriveLoader
 from langchain.document_loaders.gutenberg import GutenbergLoader
 from langchain.document_loaders.hn import HNLoader
@@ -267,6 +268,7 @@ __all__ = [
    "GitbookLoader",
    "GoogleApiClient",
    "GoogleApiYoutubeLoader",
    "GoogleSpeechToTextLoader",
    "GoogleDriveLoader",
    "GutenbergLoader",
    "HNLoader",
--- a/libs/langchain/langchain/document_loaders/google_speech_to_text.py
+++ b/libs/langchain/langchain/document_loaders/google_speech_to_text.py
@@ -0,0 +1,136 @@
 from __future__ import annotations
 from typing import TYPE_CHECKING, List, Optional
 from langchain.docstore.document import Document
 from langchain.document_loaders.base import BaseLoader
 from langchain.utilities.vertexai import get_client_info
 if TYPE_CHECKING:
    from google.cloud.speech_v2 import RecognitionConfig
    from google.protobuf.field_mask_pb2 import FieldMask
 class GoogleSpeechToTextLoader(BaseLoader):
    """
    Loader for Google Cloud Speech-to-Text audio transcripts.
    It uses the Google Cloud Speech-to-Text API to transcribe audio files
    and loads the transcribed text into one or more Documents,
    depending on the specified format.
    To use, you should have the ``google-cloud-speech`` python package installed.
    Audio files can be specified via a Google Cloud Storage uri or a local file path.
    For a detailed explanation of Google Cloud Speech-to-Text, refer to the product
    documentation.
    https://cloud.google.com/speech-to-text
    """
    def __init__(
        self,
        project_id: str,
        file_path: str,
        location: str = "us-central1",
        recognizer_id: str = "_",
        config: Optional[RecognitionConfig] = None,
        config_mask: Optional[FieldMask] = None,
    ):
        """
        Initializes the GoogleSpeechToTextLoader.
        Args:
            project_id: Google Cloud Project ID.
            file_path: A Google Cloud Storage URI or a local file path.
            location: Speech-to-Text recognizer location.
            recognizer_id: Speech-to-Text recognizer id.
            config: Recognition options and features.
                For more information:
                https://cloud.google.com/python/docs/reference/speech/latest/google.cloud.speech_v2.types.RecognitionConfig
            config_mask: The list of fields in config that override the values in the
                ``default_recognition_config`` of the recognizer during this
                recognition request.
                For more information:
                https://cloud.google.com/python/docs/reference/speech/latest/google.cloud.speech_v2.types.RecognizeRequest
        """
        try:
            from google.api_core.client_options import ClientOptions
            from google.cloud.speech_v2 import (
                AutoDetectDecodingConfig,
                RecognitionConfig,
                RecognitionFeatures,
                SpeechClient,
            )
        except ImportError as exc:
            raise ImportError(
                "Could not import google-cloud-speech python package. "
                "Please install it with `pip install google-cloud-speech`."
            ) from exc
        self.project_id = project_id
        self.file_path = file_path
        self.location = location
        self.recognizer_id = recognizer_id
        # Config must be set in speech recognition request.
        self.config = config or RecognitionConfig(
            auto_decoding_config=AutoDetectDecodingConfig(),
            language_codes=["en-US"],
            model="chirp",
            features=RecognitionFeatures(
                # Automatic punctuation could be useful for language applications
                enable_automatic_punctuation=True,
            ),
        )
        self.config_mask = config_mask
        self._client = SpeechClient(
            client_info=get_client_info(module="speech-to-text"),
            client_options=(
                ClientOptions(api_endpoint=f"{location}-speech.googleapis.com")
                if location != "global"
                else None
            ),
        )
        self._recognizer_path = self._client.recognizer_path(
            project_id, location, recognizer_id
        )
    def load(self) -> List[Document]:
        """Transcribes the audio file and loads the transcript into documents.
        It uses the Google Cloud Speech-to-Text API to transcribe the audio file
        and blocks until the transcription is finished.
        """
        try:
            from google.cloud.speech_v2 import RecognizeRequest
        except ImportError as exc:
            raise ImportError(
                "Could not import google-cloud-speech python package. "
                "Please install it with `pip install google-cloud-speech`."
            ) from exc
        request = RecognizeRequest(
            recognizer=self._recognizer_path,
            config=self.config,
            config_mask=self.config_mask,
        )
        if "gs://" in self.file_path:
            request.uri = self.file_path
        else:
            with open(self.file_path, "rb") as f:
                request.content = f.read()
        response = self._client.recognize(request=request)
        return [
            Document(
                page_content=result.alternatives[0].transcript,
                metadata={
                    "language_code": result.language_code,
                    "result_end_offset": result.result_end_offset,
                },
            )
            for result in response.results
        ]
--- a/libs/langchain/tests/integration_tests/document_loaders/test_google_speech_to_text.py
+++ b/libs/langchain/tests/integration_tests/document_loaders/test_google_speech_to_text.py
@@ -0,0 +1,34 @@
 """Test Google Speech-to-Text document loader.
 You need to create a Google Cloud project and enable the Speech-to-Text API to run the
 integration tests.
 Follow the instructions in the example notebook:
 google_speech_to_text.ipynb
 to set up the app and configure authentication.
 """
 import pytest
 from langchain.document_loaders.google_speech_to_text import GoogleSpeechToTextLoader
@pytest.mark.requires("google_api_core")
 def test_initialization() -> None:
    loader = GoogleSpeechToTextLoader(
        project_id="test_project_id", file_path="./testfile.mp3"
    )
    assert loader.project_id == "test_project_id"
    assert loader.file_path == "./testfile.mp3"
    assert loader.location == "us-central1"
    assert loader.recognizer_id == "_"
@pytest.mark.requires("google.api_core")
 def test_load() -> None:
    loader = GoogleSpeechToTextLoader(
        project_id="test_project_id", file_path="./testfile.mp3"
    )
    docs = loader.load()
    assert len(docs) == 1
    assert docs[0].page_content == "Test transcription text"
    assert docs[0].metadata["language_code"] == "en-US"