mirror of
https://github.com/hwchase17/langchain.git
synced 2025-06-22 23:00:00 +00:00
feat: Add Google Speech to Text API Document Loader (#12298)
- Add Document Loader for Google Speech to Text - Similar Structure to [Assembly AI Document Loader][1] [1]: https://python.langchain.com/docs/integrations/document_loaders/assemblyai
This commit is contained in:
parent
52c194ec3a
commit
134f085824
File diff suppressed because one or more lines are too long
@ -0,0 +1,202 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Google Speech-to-Text Audio Transcripts\n",
|
||||||
|
"\n",
|
||||||
|
"The `GoogleSpeechToTextLoader` allows to transcribe audio files with the [Google Cloud Speech-to-Text API](https://cloud.google.com/speech-to-text) and loads the transcribed text into documents.\n",
|
||||||
|
"\n",
|
||||||
|
"To use it, you should have the `google-cloud-speech` python package installed, and a Google Cloud project with the [Speech-to-Text API enabled](https://cloud.google.com/speech-to-text/v2/docs/transcribe-client-libraries#before_you_begin).\n",
|
||||||
|
"\n",
|
||||||
|
"- [Bringing the power of large models to Google Cloud’s Speech API](https://cloud.google.com/blog/products/ai-machine-learning/bringing-power-large-models-google-clouds-speech-api)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Installation & setup\n",
|
||||||
|
"\n",
|
||||||
|
"First, you need to install the `google-cloud-speech` python package.\n",
|
||||||
|
"\n",
|
||||||
|
"You can find more info about it on the [Speech-to-Text client libraries](https://cloud.google.com/speech-to-text/v2/docs/libraries) page.\n",
|
||||||
|
"\n",
|
||||||
|
"Follow the [quickstart guide](https://cloud.google.com/speech-to-text/v2/docs/sync-recognize) in the Google Cloud documentation to create a project and enable the API."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"%pip install google-cloud-speech\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Example\n",
|
||||||
|
"\n",
|
||||||
|
"The `GoogleSpeechToTextLoader` must include the `project_id` and `file_path` arguments. Audio files can be specified as a Google Cloud Storage URI (`gs://...`) or a local file path.\n",
|
||||||
|
"\n",
|
||||||
|
"Only synchronous requests are supported by the loader, which has a [limit of 60 seconds or 10MB](https://cloud.google.com/speech-to-text/v2/docs/sync-recognize#:~:text=60%20seconds%20and/or%2010%20MB) per audio file."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 2,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from langchain.document_loaders import GoogleSpeechToTextLoader\n",
|
||||||
|
"\n",
|
||||||
|
"project_id = \"<PROJECT_ID>\"\n",
|
||||||
|
"file_path = \"gs://cloud-samples-data/speech/audio.flac\"\n",
|
||||||
|
"# or a local file path: file_path = \"./audio.wav\"\n",
|
||||||
|
"\n",
|
||||||
|
"loader = GoogleSpeechToTextLoader(project_id=project_id, file_path=file_path)\n",
|
||||||
|
"\n",
|
||||||
|
"docs = loader.load()\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Note: Calling `loader.load()` blocks until the transcription is finished."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"The transcribed text is available in the `page_content`:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"docs[0].page_content\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"```\n",
|
||||||
|
"\"How old is the Brooklyn Bridge?\"\n",
|
||||||
|
"```"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"The `metadata` contains the full JSON response with more meta information:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"docs[0].metadata\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"```json\n",
|
||||||
|
"{\n",
|
||||||
|
" 'language_code': 'en-US',\n",
|
||||||
|
" 'result_end_offset': datetime.timedelta(seconds=1)\n",
|
||||||
|
"}\n",
|
||||||
|
"```"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Recognition Config\n",
|
||||||
|
"\n",
|
||||||
|
"You can specify the `config` argument to use different speech recognition models and enable specific features.\n",
|
||||||
|
"\n",
|
||||||
|
"Refer to the [Speech-to-Text recognizers documentation](https://cloud.google.com/speech-to-text/v2/docs/recognizers) and the [`RecognizeRequest`](https://cloud.google.com/python/docs/reference/speech/latest/google.cloud.speech_v2.types.RecognizeRequest) API reference for information on how to set a custom configuation.\n",
|
||||||
|
"\n",
|
||||||
|
"If you don't specify a `config`, the following options will be selected automatically:\n",
|
||||||
|
"\n",
|
||||||
|
"- Model: [Chirp Universal Speech Model](https://cloud.google.com/speech-to-text/v2/docs/chirp-model)\n",
|
||||||
|
"- Language: `en-US`\n",
|
||||||
|
"- Audio Encoding: Automatically Detected\n",
|
||||||
|
"- Automatic Punctuation: Enabled"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 6,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from google.cloud.speech_v2 import AutoDetectDecodingConfig, RecognitionConfig, RecognitionFeatures\n",
|
||||||
|
"from langchain.document_loaders import GoogleSpeechToTextLoader\n",
|
||||||
|
"\n",
|
||||||
|
"project_id = \"<PROJECT_ID>\"\n",
|
||||||
|
"location = \"global\"\n",
|
||||||
|
"recognizer_id = \"<RECOGNIZER_ID>\"\n",
|
||||||
|
"file_path = \"./audio.wav\"\n",
|
||||||
|
"\n",
|
||||||
|
"config = RecognitionConfig(\n",
|
||||||
|
" auto_decoding_config=AutoDetectDecodingConfig(),\n",
|
||||||
|
" language_codes=[\"en-US\"],\n",
|
||||||
|
" model=\"long\",\n",
|
||||||
|
" features=RecognitionFeatures(\n",
|
||||||
|
" enable_automatic_punctuation=False,\n",
|
||||||
|
" profanity_filter=True,\n",
|
||||||
|
" enable_spoken_punctuation=True,\n",
|
||||||
|
" enable_spoken_emojis=True\n",
|
||||||
|
" ),\n",
|
||||||
|
" )\n",
|
||||||
|
"\n",
|
||||||
|
"loader = GoogleSpeechToTextLoader(\n",
|
||||||
|
" project_id=project_id,\n",
|
||||||
|
" location=location,\n",
|
||||||
|
" recognizer_id=recognizer_id,\n",
|
||||||
|
" file_path=file_path,\n",
|
||||||
|
" config=config\n",
|
||||||
|
")\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": ".venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.11.0"
|
||||||
|
},
|
||||||
|
"orig_nbformat": 4
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 2
|
||||||
|
}
|
@ -89,10 +89,28 @@ See a [usage example and authorizing instructions](/docs/integrations/document_l
|
|||||||
from langchain.document_loaders import GoogleDriveLoader
|
from langchain.document_loaders import GoogleDriveLoader
|
||||||
```
|
```
|
||||||
|
|
||||||
## Vector Store
|
### Speech-to-Text
|
||||||
### Google Vertex AI Vector Search
|
|
||||||
|
|
||||||
> [Google Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/matching-engine/overview),
|
> [Google Cloud Speech-to-Text](https://cloud.google.com/speech-to-text) is an audio transcription API powered by Google's speech recognition models.
|
||||||
|
|
||||||
|
This document loader transcribes audio files and outputs the text results as Documents.
|
||||||
|
|
||||||
|
First, we need to install the python package.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install google-cloud-speech
|
||||||
|
```
|
||||||
|
|
||||||
|
See a [usage example and authorizing instructions](/docs/integrations/document_loaders/google_speech_to_text).
|
||||||
|
|
||||||
|
```python
|
||||||
|
from langchain.document_loaders import GoogleSpeechToTextLoader
|
||||||
|
```
|
||||||
|
|
||||||
|
## Vector Store
|
||||||
|
### Vertex AI Vector Search
|
||||||
|
|
||||||
|
> [Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/matching-engine/overview),
|
||||||
> formerly known as Vertex AI Matching Engine, provides the industry's leading high-scale
|
> formerly known as Vertex AI Matching Engine, provides the industry's leading high-scale
|
||||||
> low latency vector database. These vector databases are commonly
|
> low latency vector database. These vector databases are commonly
|
||||||
> referred to as vector similarity-matching or an approximate nearest neighbor (ANN) service.
|
> referred to as vector similarity-matching or an approximate nearest neighbor (ANN) service.
|
||||||
|
@ -87,6 +87,7 @@ from langchain.document_loaders.geodataframe import GeoDataFrameLoader
|
|||||||
from langchain.document_loaders.git import GitLoader
|
from langchain.document_loaders.git import GitLoader
|
||||||
from langchain.document_loaders.gitbook import GitbookLoader
|
from langchain.document_loaders.gitbook import GitbookLoader
|
||||||
from langchain.document_loaders.github import GitHubIssuesLoader
|
from langchain.document_loaders.github import GitHubIssuesLoader
|
||||||
|
from langchain.document_loaders.google_speech_to_text import GoogleSpeechToTextLoader
|
||||||
from langchain.document_loaders.googledrive import GoogleDriveLoader
|
from langchain.document_loaders.googledrive import GoogleDriveLoader
|
||||||
from langchain.document_loaders.gutenberg import GutenbergLoader
|
from langchain.document_loaders.gutenberg import GutenbergLoader
|
||||||
from langchain.document_loaders.hn import HNLoader
|
from langchain.document_loaders.hn import HNLoader
|
||||||
@ -267,6 +268,7 @@ __all__ = [
|
|||||||
"GitbookLoader",
|
"GitbookLoader",
|
||||||
"GoogleApiClient",
|
"GoogleApiClient",
|
||||||
"GoogleApiYoutubeLoader",
|
"GoogleApiYoutubeLoader",
|
||||||
|
"GoogleSpeechToTextLoader",
|
||||||
"GoogleDriveLoader",
|
"GoogleDriveLoader",
|
||||||
"GutenbergLoader",
|
"GutenbergLoader",
|
||||||
"HNLoader",
|
"HNLoader",
|
||||||
|
@ -0,0 +1,136 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import TYPE_CHECKING, List, Optional
|
||||||
|
|
||||||
|
from langchain.docstore.document import Document
|
||||||
|
from langchain.document_loaders.base import BaseLoader
|
||||||
|
from langchain.utilities.vertexai import get_client_info
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from google.cloud.speech_v2 import RecognitionConfig
|
||||||
|
from google.protobuf.field_mask_pb2 import FieldMask
|
||||||
|
|
||||||
|
|
||||||
|
class GoogleSpeechToTextLoader(BaseLoader):
|
||||||
|
"""
|
||||||
|
Loader for Google Cloud Speech-to-Text audio transcripts.
|
||||||
|
|
||||||
|
It uses the Google Cloud Speech-to-Text API to transcribe audio files
|
||||||
|
and loads the transcribed text into one or more Documents,
|
||||||
|
depending on the specified format.
|
||||||
|
|
||||||
|
To use, you should have the ``google-cloud-speech`` python package installed.
|
||||||
|
|
||||||
|
Audio files can be specified via a Google Cloud Storage uri or a local file path.
|
||||||
|
|
||||||
|
For a detailed explanation of Google Cloud Speech-to-Text, refer to the product
|
||||||
|
documentation.
|
||||||
|
https://cloud.google.com/speech-to-text
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
project_id: str,
|
||||||
|
file_path: str,
|
||||||
|
location: str = "us-central1",
|
||||||
|
recognizer_id: str = "_",
|
||||||
|
config: Optional[RecognitionConfig] = None,
|
||||||
|
config_mask: Optional[FieldMask] = None,
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Initializes the GoogleSpeechToTextLoader.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
project_id: Google Cloud Project ID.
|
||||||
|
file_path: A Google Cloud Storage URI or a local file path.
|
||||||
|
location: Speech-to-Text recognizer location.
|
||||||
|
recognizer_id: Speech-to-Text recognizer id.
|
||||||
|
config: Recognition options and features.
|
||||||
|
For more information:
|
||||||
|
https://cloud.google.com/python/docs/reference/speech/latest/google.cloud.speech_v2.types.RecognitionConfig
|
||||||
|
config_mask: The list of fields in config that override the values in the
|
||||||
|
``default_recognition_config`` of the recognizer during this
|
||||||
|
recognition request.
|
||||||
|
For more information:
|
||||||
|
https://cloud.google.com/python/docs/reference/speech/latest/google.cloud.speech_v2.types.RecognizeRequest
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
from google.api_core.client_options import ClientOptions
|
||||||
|
from google.cloud.speech_v2 import (
|
||||||
|
AutoDetectDecodingConfig,
|
||||||
|
RecognitionConfig,
|
||||||
|
RecognitionFeatures,
|
||||||
|
SpeechClient,
|
||||||
|
)
|
||||||
|
except ImportError as exc:
|
||||||
|
raise ImportError(
|
||||||
|
"Could not import google-cloud-speech python package. "
|
||||||
|
"Please install it with `pip install google-cloud-speech`."
|
||||||
|
) from exc
|
||||||
|
|
||||||
|
self.project_id = project_id
|
||||||
|
self.file_path = file_path
|
||||||
|
self.location = location
|
||||||
|
self.recognizer_id = recognizer_id
|
||||||
|
# Config must be set in speech recognition request.
|
||||||
|
self.config = config or RecognitionConfig(
|
||||||
|
auto_decoding_config=AutoDetectDecodingConfig(),
|
||||||
|
language_codes=["en-US"],
|
||||||
|
model="chirp",
|
||||||
|
features=RecognitionFeatures(
|
||||||
|
# Automatic punctuation could be useful for language applications
|
||||||
|
enable_automatic_punctuation=True,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
self.config_mask = config_mask
|
||||||
|
|
||||||
|
self._client = SpeechClient(
|
||||||
|
client_info=get_client_info(module="speech-to-text"),
|
||||||
|
client_options=(
|
||||||
|
ClientOptions(api_endpoint=f"{location}-speech.googleapis.com")
|
||||||
|
if location != "global"
|
||||||
|
else None
|
||||||
|
),
|
||||||
|
)
|
||||||
|
self._recognizer_path = self._client.recognizer_path(
|
||||||
|
project_id, location, recognizer_id
|
||||||
|
)
|
||||||
|
|
||||||
|
def load(self) -> List[Document]:
|
||||||
|
"""Transcribes the audio file and loads the transcript into documents.
|
||||||
|
|
||||||
|
It uses the Google Cloud Speech-to-Text API to transcribe the audio file
|
||||||
|
and blocks until the transcription is finished.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
from google.cloud.speech_v2 import RecognizeRequest
|
||||||
|
except ImportError as exc:
|
||||||
|
raise ImportError(
|
||||||
|
"Could not import google-cloud-speech python package. "
|
||||||
|
"Please install it with `pip install google-cloud-speech`."
|
||||||
|
) from exc
|
||||||
|
|
||||||
|
request = RecognizeRequest(
|
||||||
|
recognizer=self._recognizer_path,
|
||||||
|
config=self.config,
|
||||||
|
config_mask=self.config_mask,
|
||||||
|
)
|
||||||
|
|
||||||
|
if "gs://" in self.file_path:
|
||||||
|
request.uri = self.file_path
|
||||||
|
else:
|
||||||
|
with open(self.file_path, "rb") as f:
|
||||||
|
request.content = f.read()
|
||||||
|
|
||||||
|
response = self._client.recognize(request=request)
|
||||||
|
|
||||||
|
return [
|
||||||
|
Document(
|
||||||
|
page_content=result.alternatives[0].transcript,
|
||||||
|
metadata={
|
||||||
|
"language_code": result.language_code,
|
||||||
|
"result_end_offset": result.result_end_offset,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
for result in response.results
|
||||||
|
]
|
@ -0,0 +1,34 @@
|
|||||||
|
"""Test Google Speech-to-Text document loader.
|
||||||
|
|
||||||
|
You need to create a Google Cloud project and enable the Speech-to-Text API to run the
|
||||||
|
integration tests.
|
||||||
|
Follow the instructions in the example notebook:
|
||||||
|
google_speech_to_text.ipynb
|
||||||
|
to set up the app and configure authentication.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from langchain.document_loaders.google_speech_to_text import GoogleSpeechToTextLoader
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.requires("google_api_core")
|
||||||
|
def test_initialization() -> None:
|
||||||
|
loader = GoogleSpeechToTextLoader(
|
||||||
|
project_id="test_project_id", file_path="./testfile.mp3"
|
||||||
|
)
|
||||||
|
assert loader.project_id == "test_project_id"
|
||||||
|
assert loader.file_path == "./testfile.mp3"
|
||||||
|
assert loader.location == "us-central1"
|
||||||
|
assert loader.recognizer_id == "_"
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.requires("google.api_core")
|
||||||
|
def test_load() -> None:
|
||||||
|
loader = GoogleSpeechToTextLoader(
|
||||||
|
project_id="test_project_id", file_path="./testfile.mp3"
|
||||||
|
)
|
||||||
|
docs = loader.load()
|
||||||
|
assert len(docs) == 1
|
||||||
|
assert docs[0].page_content == "Test transcription text"
|
||||||
|
assert docs[0].metadata["language_code"] == "en-US"
|
Loading…
Reference in New Issue
Block a user