community[patch]: Load YouTube transcripts (captions) as fixed-duration chunks with start times (#21710)

- **Description:** Add a new format, `CHUNKS`, to
`langchain_community.document_loaders.youtube.YoutubeLoader` which
creates multiple `Document` objects from YouTube video transcripts
(captions), each of a fixed duration. The metadata of each chunk
`Document` includes the start time of each one and a URL to that time in
the video on the YouTube website.
  
I had implemented this for UMich (@umich-its-ai) in a local module, but
it makes sense to contribute this to LangChain community for all to
benefit and to simplify maintenance.

- **Issue:** N/A
- **Dependencies:** N/A
- **Twitter:** lsloan_umich
- **Mastodon:**
[lsloan@mastodon.social](https://mastodon.social/@lsloan)

With regards to **tests and documentation**, most existing features of
the `YoutubeLoader` class are not tested. Only the
`YoutubeLoader.extract_video_id()` static method had a test. However,
while I was waiting for this PR to be reviewed and merged, I had time to
add a test for the chunking feature I've proposed in this PR.

I have added an example of using chunking to the
`docs/docs/integrations/document_loaders/youtube_transcript.ipynb`
notebook.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
This commit is contained in:
Mr. Lance E Sloan «UMich» 2024-06-11 13:44:36 -04:00 committed by GitHub
parent 71811e0547
commit 84dc2dd059
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 336 additions and 56 deletions

View File

@ -15,47 +15,45 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"id": "427d5745", "id": "427d5745",
"metadata": {}, "metadata": {},
"source": "from langchain_community.document_loaders import YoutubeLoader",
"outputs": [], "outputs": [],
"source": [ "execution_count": null
"from langchain_community.document_loaders import YoutubeLoader"
]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"id": "34a25b57", "id": "34a25b57",
"metadata": { "metadata": {
"scrolled": true "scrolled": true
}, },
"outputs": [],
"source": [ "source": [
"%pip install --upgrade --quiet youtube-transcript-api" "%pip install --upgrade --quiet youtube-transcript-api"
] ],
"outputs": [],
"execution_count": null
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"id": "bc8b308a", "id": "bc8b308a",
"metadata": {}, "metadata": {},
"outputs": [],
"source": [ "source": [
"loader = YoutubeLoader.from_youtube_url(\n", "loader = YoutubeLoader.from_youtube_url(\n",
" \"https://www.youtube.com/watch?v=QsYGlZkevEg\", add_video_info=False\n", " \"https://www.youtube.com/watch?v=QsYGlZkevEg\", add_video_info=False\n",
")" ")"
] ],
"outputs": [],
"execution_count": null
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"id": "d073dd36", "id": "d073dd36",
"metadata": {}, "metadata": {},
"outputs": [],
"source": [ "source": [
"loader.load()" "loader.load()"
] ],
"outputs": [],
"execution_count": null
}, },
{ {
"attachments": {}, "attachments": {},
@ -68,26 +66,26 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"id": "ba28af69", "id": "ba28af69",
"metadata": {}, "metadata": {},
"outputs": [],
"source": [ "source": [
"%pip install --upgrade --quiet pytube" "%pip install --upgrade --quiet pytube"
] ],
"outputs": [],
"execution_count": null
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"id": "9b8ea390", "id": "9b8ea390",
"metadata": {}, "metadata": {},
"outputs": [],
"source": [ "source": [
"loader = YoutubeLoader.from_youtube_url(\n", "loader = YoutubeLoader.from_youtube_url(\n",
" \"https://www.youtube.com/watch?v=QsYGlZkevEg\", add_video_info=True\n", " \"https://www.youtube.com/watch?v=QsYGlZkevEg\", add_video_info=True\n",
")\n", ")\n",
"loader.load()" "loader.load()"
] ],
"outputs": [],
"execution_count": null
}, },
{ {
"attachments": {}, "attachments": {},
@ -104,10 +102,8 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"id": "08510625", "id": "08510625",
"metadata": {}, "metadata": {},
"outputs": [],
"source": [ "source": [
"loader = YoutubeLoader.from_youtube_url(\n", "loader = YoutubeLoader.from_youtube_url(\n",
" \"https://www.youtube.com/watch?v=QsYGlZkevEg\",\n", " \"https://www.youtube.com/watch?v=QsYGlZkevEg\",\n",
@ -116,7 +112,41 @@
" translation=\"en\",\n", " translation=\"en\",\n",
")\n", ")\n",
"loader.load()" "loader.load()"
] ],
"outputs": [],
"execution_count": null
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"### Get transcripts as timestamped chunks\n",
"\n",
"Get one or more `Document` objects, each containing a chunk of the video transcript. The length of the chunks, in seconds, may be specified. Each chunk's metadata includes a URL of the video on YouTube, which will start the video at the beginning of the specific chunk.\n",
"\n",
"`transcript_format` param: One of the `langchain_community.document_loaders.youtube.TranscriptFormat` values. In this case, `TranscriptFormat.CHUNKS`.\n",
"\n",
"`chunk_size_seconds` param: An integer number of video seconds to be represented by each chunk of transcript data. Default is 120 seconds."
],
"id": "69f4e399a9764d73"
},
{
"metadata": {},
"cell_type": "code",
"source": [
"from langchain_community.document_loaders.youtube import TranscriptFormat\n",
"\n",
"loader = YoutubeLoader.from_youtube_url(\n",
" \"https://www.youtube.com/watch?v=TKCMw0utiak\",\n",
" add_video_info=True,\n",
" transcript_format=TranscriptFormat.CHUNKS,\n",
" chunk_size_seconds=30,\n",
")\n",
"print(\"\\n\\n\".join(map(repr, loader.load())))"
],
"id": "540bbf19182f38bc",
"outputs": [],
"execution_count": null
}, },
{ {
"attachments": {}, "attachments": {},
@ -142,10 +172,8 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"id": "c345bc43", "id": "c345bc43",
"metadata": {}, "metadata": {},
"outputs": [],
"source": [ "source": [
"# Init the GoogleApiClient\n", "# Init the GoogleApiClient\n",
"from pathlib import Path\n", "from pathlib import Path\n",
@ -170,7 +198,9 @@
"\n", "\n",
"# returns a list of Documents\n", "# returns a list of Documents\n",
"youtube_loader_channel.load()" "youtube_loader_channel.load()"
] ],
"outputs": [],
"execution_count": null
} }
], ],
"metadata": { "metadata": {

View File

@ -4,7 +4,7 @@ from __future__ import annotations
import logging import logging
from enum import Enum from enum import Enum
from pathlib import Path from pathlib import Path
from typing import Any, Dict, List, Optional, Sequence, Union from typing import Any, Dict, Generator, List, Optional, Sequence, Union
from urllib.parse import parse_qs, urlparse from urllib.parse import parse_qs, urlparse
from langchain_core.documents import Document from langchain_core.documents import Document
@ -99,8 +99,8 @@ class GoogleApiClient:
return creds return creds
ALLOWED_SCHEMAS = {"http", "https"} ALLOWED_SCHEMES = {"http", "https"}
ALLOWED_NETLOCK = { ALLOWED_NETLOCS = {
"youtu.be", "youtu.be",
"m.youtube.com", "m.youtube.com",
"youtube.com", "youtube.com",
@ -111,13 +111,13 @@ ALLOWED_NETLOCK = {
def _parse_video_id(url: str) -> Optional[str]: def _parse_video_id(url: str) -> Optional[str]:
"""Parse a youtube url and return the video id if valid, otherwise None.""" """Parse a YouTube URL and return the video ID if valid, otherwise None."""
parsed_url = urlparse(url) parsed_url = urlparse(url)
if parsed_url.scheme not in ALLOWED_SCHEMAS: if parsed_url.scheme not in ALLOWED_SCHEMES:
return None return None
if parsed_url.netloc not in ALLOWED_NETLOCK: if parsed_url.netloc not in ALLOWED_NETLOCS:
return None return None
path = parsed_url.path path = parsed_url.path
@ -141,14 +141,15 @@ def _parse_video_id(url: str) -> Optional[str]:
class TranscriptFormat(Enum): class TranscriptFormat(Enum):
"""Transcript format.""" """Output formats of transcripts from `YoutubeLoader`."""
TEXT = "text" TEXT = "text"
LINES = "lines" LINES = "lines"
CHUNKS = "chunks"
class YoutubeLoader(BaseLoader): class YoutubeLoader(BaseLoader):
"""Load `YouTube` transcripts.""" """Load `YouTube` video transcripts."""
def __init__( def __init__(
self, self,
@ -158,9 +159,11 @@ class YoutubeLoader(BaseLoader):
translation: Optional[str] = None, translation: Optional[str] = None,
transcript_format: TranscriptFormat = TranscriptFormat.TEXT, transcript_format: TranscriptFormat = TranscriptFormat.TEXT,
continue_on_failure: bool = False, continue_on_failure: bool = False,
chunk_size_seconds: int = 120,
): ):
"""Initialize with YouTube video ID.""" """Initialize with YouTube video ID."""
self.video_id = video_id self.video_id = video_id
self._metadata = {"source": video_id}
self.add_video_info = add_video_info self.add_video_info = add_video_info
self.language = language self.language = language
if isinstance(language, str): if isinstance(language, str):
@ -170,25 +173,69 @@ class YoutubeLoader(BaseLoader):
self.translation = translation self.translation = translation
self.transcript_format = transcript_format self.transcript_format = transcript_format
self.continue_on_failure = continue_on_failure self.continue_on_failure = continue_on_failure
self.chunk_size_seconds = chunk_size_seconds
@staticmethod @staticmethod
def extract_video_id(youtube_url: str) -> str: def extract_video_id(youtube_url: str) -> str:
"""Extract video id from common YT urls.""" """Extract video ID from common YouTube URLs."""
video_id = _parse_video_id(youtube_url) video_id = _parse_video_id(youtube_url)
if not video_id: if not video_id:
raise ValueError( raise ValueError(
f"Could not determine the video ID for the URL {youtube_url}" f'Could not determine the video ID for the URL "{youtube_url}".'
) )
return video_id return video_id
@classmethod @classmethod
def from_youtube_url(cls, youtube_url: str, **kwargs: Any) -> YoutubeLoader: def from_youtube_url(cls, youtube_url: str, **kwargs: Any) -> YoutubeLoader:
"""Given youtube URL, load video.""" """Given a YouTube URL, construct a loader.
See `YoutubeLoader()` constructor for a list of keyword arguments.
"""
video_id = cls.extract_video_id(youtube_url) video_id = cls.extract_video_id(youtube_url)
return cls(video_id, **kwargs) return cls(video_id, **kwargs)
def _make_chunk_document(
self, chunk_pieces: List[Dict], chunk_start_seconds: int
) -> Document:
"""Create Document from chunk of transcript pieces."""
m, s = divmod(chunk_start_seconds, 60)
h, m = divmod(m, 60)
return Document(
page_content=" ".join(
map(lambda chunk_piece: chunk_piece["text"].strip(" "), chunk_pieces)
),
metadata={
**self._metadata,
"start_seconds": chunk_start_seconds,
"start_timestamp": f"{h:02d}:{m:02d}:{s:02d}",
"source":
# replace video ID with URL to start time
f"https://www.youtube.com/watch?v={self.video_id}"
f"&t={chunk_start_seconds}s",
},
)
def _get_transcript_chunks(
self, transcript_pieces: List[Dict]
) -> Generator[Document, None, None]:
chunk_pieces: List[Dict[str, Any]] = []
chunk_start_seconds = 0
chunk_time_limit = self.chunk_size_seconds
for transcript_piece in transcript_pieces:
piece_end = transcript_piece["start"] + transcript_piece["duration"]
if piece_end > chunk_time_limit:
if chunk_pieces:
yield self._make_chunk_document(chunk_pieces, chunk_start_seconds)
chunk_pieces = []
chunk_start_seconds = chunk_time_limit
chunk_time_limit += self.chunk_size_seconds
chunk_pieces.append(transcript_piece)
if len(chunk_pieces) > 0:
yield self._make_chunk_document(chunk_pieces, chunk_start_seconds)
def load(self) -> List[Document]: def load(self) -> List[Document]:
"""Load documents.""" """Load YouTube transcripts into `Document` objects."""
try: try:
from youtube_transcript_api import ( from youtube_transcript_api import (
NoTranscriptFound, NoTranscriptFound,
@ -197,17 +244,15 @@ class YoutubeLoader(BaseLoader):
) )
except ImportError: except ImportError:
raise ImportError( raise ImportError(
"Could not import youtube_transcript_api python package. " 'Could not import "youtube_transcript_api" Python package. '
"Please install it with `pip install youtube-transcript-api`." "Please install it with `pip install youtube-transcript-api`."
) )
metadata = {"source": self.video_id}
if self.add_video_info: if self.add_video_info:
# Get more video meta info # Get more video meta info
# Such as title, description, thumbnail url, publish_date # Such as title, description, thumbnail url, publish_date
video_info = self._get_video_info() video_info = self._get_video_info()
metadata.update(video_info) self._metadata.update(video_info)
try: try:
transcript_list = YouTubeTranscriptApi.list_transcripts(self.video_id) transcript_list = YouTubeTranscriptApi.list_transcripts(self.video_id)
@ -222,31 +267,45 @@ class YoutubeLoader(BaseLoader):
if self.translation is not None: if self.translation is not None:
transcript = transcript.translate(self.translation) transcript = transcript.translate(self.translation)
transcript_pieces = transcript.fetch() transcript_pieces: List[Dict[str, Any]] = transcript.fetch()
if self.transcript_format == TranscriptFormat.TEXT: if self.transcript_format == TranscriptFormat.TEXT:
transcript = " ".join([t["text"].strip(" ") for t in transcript_pieces]) transcript = " ".join(
return [Document(page_content=transcript, metadata=metadata)] map(
elif self.transcript_format == TranscriptFormat.LINES: lambda transcript_piece: transcript_piece["text"].strip(" "),
return [ transcript_pieces,
Document(
page_content=t["text"].strip(" "),
metadata=dict((key, t[key]) for key in t if key != "text"),
) )
for t in transcript_pieces )
] return [Document(page_content=transcript, metadata=self._metadata)]
elif self.transcript_format == TranscriptFormat.LINES:
return list(
map(
lambda transcript_piece: Document(
page_content=transcript_piece["text"].strip(" "),
metadata={
filter(
lambda item: item[0] != "text", transcript_piece.items()
)
},
),
transcript_pieces,
)
)
elif self.transcript_format == TranscriptFormat.CHUNKS:
return list(self._get_transcript_chunks(transcript_pieces))
else: else:
raise ValueError("Unknown transcript format.") raise ValueError("Unknown transcript format.")
def _get_video_info(self) -> dict: def _get_video_info(self) -> Dict:
"""Get important video information. """Get important video information.
Components are: Components include:
- title - title
- description - description
- thumbnail url, - thumbnail URL,
- publish_date - publish_date
- channel_author - channel author
- and more. - and more.
""" """
try: try:
@ -254,7 +313,7 @@ class YoutubeLoader(BaseLoader):
except ImportError: except ImportError:
raise ImportError( raise ImportError(
"Could not import pytube python package. " 'Could not import "pytube" Python package. '
"Please install it with `pip install pytube`." "Please install it with `pip install pytube`."
) )
yt = YouTube(f"https://www.youtube.com/watch?v={self.video_id}") yt = YouTube(f"https://www.youtube.com/watch?v={self.video_id}")

View File

@ -1,6 +1,8 @@
import pytest import pytest
from langchain_core.documents import Document
from langchain_community.document_loaders import YoutubeLoader from langchain_community.document_loaders import YoutubeLoader
from langchain_community.document_loaders.youtube import TranscriptFormat
@pytest.mark.parametrize( @pytest.mark.parametrize(
@ -25,3 +27,192 @@ from langchain_community.document_loaders import YoutubeLoader
def test_video_id_extraction(youtube_url: str, expected_video_id: str) -> None: def test_video_id_extraction(youtube_url: str, expected_video_id: str) -> None:
"""Test that the video id is extracted from a youtube url""" """Test that the video id is extracted from a youtube url"""
assert YoutubeLoader.extract_video_id(youtube_url) == expected_video_id assert YoutubeLoader.extract_video_id(youtube_url) == expected_video_id
def test__get_transcript_chunks() -> None:
test_transcript_pieces = [
{"text": "♪ Hail to the victors valiant ♪", "start": 3.719, "duration": 5.0},
{"text": "♪ Hail to the conquering heroes ♪", "start": 8.733, "duration": 5.0},
{"text": "♪ Hail, hail to Michigan ♪", "start": 14.541, "duration": 5.0},
{"text": "♪ The leaders and best ♪", "start": 19.785, "duration": 5.0},
{"text": "♪ Hail to the victors valiant ♪", "start": 25.661, "duration": 4.763},
{"text": "♪ Hail to the conquering heroes ♪", "start": 30.424, "duration": 5.0},
{"text": "♪ Hail, hail to Michigan ♪", "start": 36.37, "duration": 4.91},
{"text": "♪ The champions of the west ♪", "start": 41.28, "duration": 2.232},
{"text": "♪ Hail to the victors valiant ♪", "start": 43.512, "duration": 4.069},
{
"text": "♪ Hail to the conquering heroes ♪",
"start": 47.581,
"duration": 4.487,
},
{"text": "♪ Hail, hail to Michigan ♪", "start": 52.068, "duration": 4.173},
{"text": "♪ The leaders and best ♪", "start": 56.241, "duration": 4.542},
{"text": "♪ Hail to victors valiant ♪", "start": 60.783, "duration": 3.944},
{
"text": "♪ Hail to the conquering heroes ♪",
"start": 64.727,
"duration": 4.117,
},
{"text": "♪ Hail, hail to Michigan ♪", "start": 68.844, "duration": 3.969},
{"text": "♪ The champions of the west ♪", "start": 72.813, "duration": 4.232},
{"text": "(choir clapping rhythmically)", "start": 77.045, "duration": 3.186},
{"text": "- Go blue!", "start": 80.231, "duration": 0.841},
{"text": "(choir clapping rhythmically)", "start": 81.072, "duration": 3.149},
{"text": "Go blue!", "start": 84.221, "duration": 0.919},
{"text": "♪ It's great to be ♪", "start": 85.14, "duration": 1.887},
{
"text": "♪ A Michigan Wolverine ♪\n- Go blue!",
"start": 87.027,
"duration": 2.07,
},
{"text": "♪ It's great to be ♪", "start": 89.097, "duration": 1.922},
{
"text": "♪ A Michigan Wolverine ♪\n- Go blue!",
"start": 91.019,
"duration": 2.137,
},
{
"text": "♪ It's great to be ♪\n(choir scatting)",
"start": 93.156,
"duration": 1.92,
},
{
"text": "♪ a Michigan Wolverine ♪\n(choir scatting)",
"start": 95.076,
"duration": 2.118,
},
{
"text": "♪ It's great to be ♪\n(choir scatting)",
"start": 97.194,
"duration": 1.85,
},
{
"text": "♪ A Michigan ♪\n(choir scatting)",
"start": 99.044,
"duration": 1.003,
},
{"text": "- Let's go blue!", "start": 100.047, "duration": 1.295},
{
"text": "♪ Hail to the victors valiant ♪",
"start": 101.342,
"duration": 1.831,
},
{
"text": "♪ Hail to the conquering heroes ♪",
"start": 103.173,
"duration": 2.21,
},
{"text": "♪ Hail, hail to Michigan ♪", "start": 105.383, "duration": 1.964},
{"text": "♪ The leaders and best ♪", "start": 107.347, "duration": 2.21},
{
"text": "♪ Hail to the victors valiant ♪",
"start": 109.557,
"duration": 1.643,
},
{
"text": "♪ Hail to the conquering heroes ♪",
"start": 111.2,
"duration": 2.129,
},
{"text": "♪ Hail, hail to Michigan ♪", "start": 113.329, "duration": 2.091},
{"text": "♪ The champions of the west ♪", "start": 115.42, "duration": 2.254},
{
"text": "♪ Hail to the victors valiant ♪",
"start": 117.674,
"duration": 4.039,
},
{
"text": "♪ Hail to the conquering heroes ♪",
"start": 121.713,
"duration": 4.103,
},
{
"text": "♪ Hail to the blue, hail to the blue ♪",
"start": 125.816,
"duration": 1.978,
},
{
"text": "♪ Hail to the blue, hail to the blue ♪",
"start": 127.794,
"duration": 2.095,
},
{
"text": "♪ Hail to the blue, hail to the blue ♪",
"start": 129.889,
"duration": 1.932,
},
{
"text": "♪ Hail to the blue, hail to the blue ♪",
"start": 131.821,
"duration": 2.091,
},
{
"text": "♪ Hail to the blue, hail to the blue ♪",
"start": 133.912,
"duration": 2.109,
},
{"text": "♪ Hail to the blue, hail ♪", "start": 136.021, "duration": 3.643},
{"text": "♪ To Michigan ♪", "start": 139.664, "duration": 4.105},
{"text": "♪ The champions of the west ♪", "start": 143.769, "duration": 3.667},
{"text": "♪ Go blue ♪", "start": 154.122, "duration": 2.167},
]
test_transcript_chunks = [
Document(
page_content="♪ Hail to the victors valiant ♪ ♪ Hail to the conquering heroes ♪ ♪ Hail, hail to Michigan ♪ ♪ The leaders and best ♪", # noqa: E501
metadata={
"source": "https://www.youtube.com/watch?v=TKCMw0utiak&t=0s",
"start_seconds": 0,
"start_timestamp": "00:00:00",
},
),
Document(
page_content="♪ Hail to the victors valiant ♪ ♪ Hail to the conquering heroes ♪ ♪ Hail, hail to Michigan ♪ ♪ The champions of the west ♪ ♪ Hail to the victors valiant ♪ ♪ Hail to the conquering heroes ♪ ♪ Hail, hail to Michigan ♪", # noqa: E501
metadata={
"source": "https://www.youtube.com/watch?v=TKCMw0utiak&t=30s",
"start_seconds": 30,
"start_timestamp": "00:00:30",
},
),
Document(
page_content="♪ The leaders and best ♪ ♪ Hail to victors valiant ♪ ♪ Hail to the conquering heroes ♪ ♪ Hail, hail to Michigan ♪ ♪ The champions of the west ♪ (choir clapping rhythmically) - Go blue! (choir clapping rhythmically) Go blue! ♪ It's great to be ♪ ♪ A Michigan Wolverine ♪\n- Go blue!", # noqa: E501
metadata={
"source": "https://www.youtube.com/watch?v=TKCMw0utiak&t=60s",
"start_seconds": 60,
"start_timestamp": "00:01:00",
},
),
Document(
page_content="♪ It's great to be ♪ ♪ A Michigan Wolverine ♪\n- Go blue! ♪ It's great to be ♪\n(choir scatting) ♪ a Michigan Wolverine ♪\n(choir scatting) ♪ It's great to be ♪\n(choir scatting) ♪ A Michigan ♪\n(choir scatting) - Let's go blue! ♪ Hail to the victors valiant ♪ ♪ Hail to the conquering heroes ♪ ♪ Hail, hail to Michigan ♪ ♪ The leaders and best ♪ ♪ Hail to the victors valiant ♪ ♪ Hail to the conquering heroes ♪ ♪ Hail, hail to Michigan ♪ ♪ The champions of the west ♪", # noqa: E501
metadata={
"source": "https://www.youtube.com/watch?v=TKCMw0utiak&t=90s",
"start_seconds": 90,
"start_timestamp": "00:01:30",
},
),
Document(
page_content="♪ Hail to the victors valiant ♪ ♪ Hail to the conquering heroes ♪ ♪ Hail to the blue, hail to the blue ♪ ♪ Hail to the blue, hail to the blue ♪ ♪ Hail to the blue, hail to the blue ♪ ♪ Hail to the blue, hail to the blue ♪ ♪ Hail to the blue, hail to the blue ♪ ♪ Hail to the blue, hail ♪ ♪ To Michigan ♪ ♪ The champions of the west ♪", # noqa: E501
metadata={
"source": "https://www.youtube.com/watch?v=TKCMw0utiak&t=120s",
"start_seconds": 120,
"start_timestamp": "00:02:00",
},
),
Document(
page_content="♪ Go blue ♪",
metadata={
"source": "https://www.youtube.com/watch?v=TKCMw0utiak&t=150s",
"start_seconds": 150,
"start_timestamp": "00:02:30",
},
),
]
ytl = YoutubeLoader(
"TKCMw0utiak",
transcript_format=TranscriptFormat.CHUNKS,
chunk_size_seconds=30,
)
assert (
list(ytl._get_transcript_chunks(test_transcript_pieces))
== test_transcript_chunks
)