langchain/docs
Mr. Lance E Sloan «UMich» 84dc2dd059
community[patch]: Load YouTube transcripts (captions) as fixed-duration chunks with start times (#21710)
- **Description:** Add a new format, `CHUNKS`, to
`langchain_community.document_loaders.youtube.YoutubeLoader` which
creates multiple `Document` objects from YouTube video transcripts
(captions), each of a fixed duration. The metadata of each chunk
`Document` includes the start time of each one and a URL to that time in
the video on the YouTube website.
  
I had implemented this for UMich (@umich-its-ai) in a local module, but
it makes sense to contribute this to LangChain community for all to
benefit and to simplify maintenance.

- **Issue:** N/A
- **Dependencies:** N/A
- **Twitter:** lsloan_umich
- **Mastodon:**
[lsloan@mastodon.social](https://mastodon.social/@lsloan)

With regards to **tests and documentation**, most existing features of
the `YoutubeLoader` class are not tested. Only the
`YoutubeLoader.extract_video_id()` static method had a test. However,
while I was waiting for this PR to be reviewed and merged, I had time to
add a test for the chunking feature I've proposed in this PR.

I have added an example of using chunking to the
`docs/docs/integrations/document_loaders/youtube_transcript.ipynb`
notebook.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2024-06-11 17:44:36 +00:00
..
api_reference docs: Correct return type in docstring (#22597) 2024-06-06 14:51:46 +00:00
data 👥 Update LangChain people data (#22388) 2024-06-01 07:36:45 -07:00
docs community[patch]: Load YouTube transcripts (captions) as fixed-duration chunks with start times (#21710) 2024-06-11 17:44:36 +00:00
scripts docs: arxiv page update (#22574) 2024-06-06 16:51:02 -07:00
src docs: add Microsoft Azure to ChatModelTabs (#22367) 2024-06-03 10:19:00 -04:00
static docs: Fix pixelation in stack graphic (#21554) 2024-06-10 22:52:22 +00:00
.gitignore infra: cleanup docs build (#21134) 2024-05-01 17:34:05 -07:00
.yarnrc.yml docs[minor]: Add thumbs up/down to all docs pages (#18526) 2024-03-04 15:14:28 -08:00
babel.config.js Restructure docs (#11620) 2023-10-10 12:55:19 -07:00
docusaurus.config.js docs: link GH org (#22308) 2024-05-30 00:17:59 -07:00
Makefile docs: edit links, direct for notebooks (#22051) 2024-05-24 19:44:46 +00:00
package.json docs: v0.2 docs in master (#21438) 2024-05-08 12:29:59 -07:00
README.md docs: developer docs (#14776) 2023-12-17 12:55:49 -08:00
sidebars.js docs: make llm cache its own section (#22301) 2024-05-30 00:17:33 -07:00
vercel_build.sh infra: use nbconvert for docs build (#21135) 2024-05-07 12:30:17 -07:00
vercel_requirements.txt docs: remove postgres from docs build (#21847) 2024-05-17 15:36:35 -07:00
vercel.json docs[patch]: Add robots.txt and root sitemap (#22492) 2024-06-04 11:26:40 -07:00
yarn.lock docs: v0.2 docs in master (#21438) 2024-05-08 12:29:59 -07:00

LangChain Documentation

For more information on contributing to our documentation, see the Documentation Contributing Guide