mirror of
https://github.com/hwchase17/langchain.git
synced 2025-06-01 20:49:17 +00:00
- **Description:** Add a new format, `CHUNKS`, to `langchain_community.document_loaders.youtube.YoutubeLoader` which creates multiple `Document` objects from YouTube video transcripts (captions), each of a fixed duration. The metadata of each chunk `Document` includes the start time of each one and a URL to that time in the video on the YouTube website. I had implemented this for UMich (@umich-its-ai) in a local module, but it makes sense to contribute this to LangChain community for all to benefit and to simplify maintenance. - **Issue:** N/A - **Dependencies:** N/A - **Twitter:** lsloan_umich - **Mastodon:** [lsloan@mastodon.social](https://mastodon.social/@lsloan) With regards to **tests and documentation**, most existing features of the `YoutubeLoader` class are not tested. Only the `YoutubeLoader.extract_video_id()` static method had a test. However, while I was waiting for this PR to be reviewed and merged, I had time to add a test for the chunking feature I've proposed in this PR. I have added an example of using chunking to the `docs/docs/integrations/document_loaders/youtube_transcript.ipynb` notebook. --------- Co-authored-by: Bagatur <baskaryan@gmail.com> |
||
---|---|---|
.. | ||
api_reference | ||
data | ||
docs | ||
scripts | ||
src | ||
static | ||
.gitignore | ||
.yarnrc.yml | ||
babel.config.js | ||
docusaurus.config.js | ||
Makefile | ||
package.json | ||
README.md | ||
sidebars.js | ||
vercel_build.sh | ||
vercel_requirements.txt | ||
vercel.json | ||
yarn.lock |
LangChain Documentation
For more information on contributing to our documentation, see the Documentation Contributing Guide