Sign In Sign Up

Extract Bilibili video subtitles to downloads

尝试扒一下这个视频的字幕,放在我的 /downloads 里。https://www.bilibili.com/video/BV14eiQBmEbN/?spm_id_from=333.1387.homepage.video_card.click&vd_source=6e1d8cdd73dc56ad89d36ed203bdec79

用户想要从B站视频获取字幕。我需要先了解如何获取B站视频的字幕。我可以尝试使用youtube-transcript skill,但那是针对YouTube的。对于B站,我需要用其他方法。

让我先加载youtube-transcript skill看看是否有帮助,或者我可以尝试用其他方式获取B站字幕。

实际上,B站的字幕获取需要特定的API或工具。我可以尝试:

  1. 使用web_search查找如何获取B站字幕
  2. 尝试用bash命令行工具如yt-dlp来下载字幕

让我先尝试用yt-dlp,它支持B站视频和字幕下载。

我来尝试用 yt-dlp 获取这个B站视频的字幕。

which yt-dlp
/opt/homebrew/bin/yt-dlp
yt-dlp --list-subs "https://www.bilibili.com/video/BV14eiQBmEbN/" 2>&1
[BiliBili] Extracting URL: https://www.bilibili.com/video/BV14eiQBmEbN/
[BiliBili] 14eiQBmEbN: Downloading webpage
[BiliBili] BV14eiQBmEbN: Extracting videos in anthology
[BiliBili] BV14eiQBmEbN: Downloading wbi sign
[BiliBili] BV14eiQBmEbN: Downloading video formats for cid 35275474874
[BiliBili] Format(s) 4K 超高清, 1080P 高码率 are missing; you have to become a premium member to download them. Use --cookies-from-browser or --cookies for the authentication. See https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp for how to manually pass cookies
[BiliBili] 115858454086686: Extracting chapters
[BiliBili] BV14eiQBmEbN: Extracting subtitle info 35275474874
[info] Available subtitles for BV14eiQBmEbN:
Language Formats
danmaku xml

视频只有弹幕(danmaku),没有CC字幕。我来下载弹幕:

yt-dlp --write-subs --sub-format xml --skip-download -o "/Users/ayden/Downloads/BV14eiQBmEbN" "https://www.bilibili.com/video/BV14eiQBmEbN/" 2>&1
[BiliBili] Extracting URL: https://www.bilibili.com/video/BV14eiQBmEbN/
[BiliBili] 14eiQBmEbN: Downloading webpage
[BiliBili] BV14eiQBmEbN: Extracting videos in anthology
[BiliBili] BV14eiQBmEbN: Downloading wbi sign
[BiliBili] BV14eiQBmEbN: Downloading video formats for cid 35275474874
[BiliBili] Format(s) 4K 超高清, 1080P 高码率 are missing; you have to become a premium member to download them. Use --cookies-from-browser or --cookies for the authentication. See https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp for how to manually pass cookies
[BiliBili] 115858454086686: Extracting chapters
[BiliBili] BV14eiQBmEbN: Extracting subtitle info 35275474874
[info] BV14eiQBmEbN: Downloading subtitles: danmaku
[info] BV14eiQBmEbN: Downloading 1 format(s): 100026+30280
[info] Writing video subtitles to: /Users/ayden/Downloads/BV14eiQBmEbN.danmaku.xml
[download] Destination: /Users/ayden/Downloads/BV14eiQBmEbN.danmaku.xml
[download] 100% of 19.90KiB in 00:00:00 at 150.52KiB/s
ls -la /Users/ayden/Downloads/BV14eiQBmEbN*
-rw-r--r--@ 1 ayden  staff  20381 Jan  9 17:05 /Users/ayden/Downloads/BV14eiQBmEbN.danmaku.xml

已下载到 /Users/ayden/Downloads/BV14eiQBmEbN.danmaku.xml

需要注意的是,这个视频没有官方CC字幕,只有弹幕(danmaku)。弹幕是观众发的滚动评论,不是视频的字幕。如果你需要视频内容的文字版,可能需要用语音转文字工具。

那你尝试查一下和它同名的youtube版本

用户想让我查找这个B站视频对应的YouTube版本。我需要先获取这个B站视频的标题,然后在YouTube上搜索。

yt-dlp --get-title "https://www.bilibili.com/video/BV14eiQBmEbN/" 2>&1
【正片】Lovart 创始人陈冕×罗永浩!且让我大闹一场,然后悄然离去

找到了!我来检查这个YouTube视频是否有字幕:

yt-dlp --list-subs "https://www.youtube.com/watch?v=PPz3lcPxVGg" 2>&1
[youtube] Extracting URL: https://www.youtube.com/watch?v=PPz3lcPxVGg
[youtube] PPz3lcPxVGg: Downloading webpage
[youtube] PPz3lcPxVGg: Downloading tv client config
[youtube] PPz3lcPxVGg: Downloading player b75a8e80-main
[youtube] PPz3lcPxVGg: Downloading tv player API JSON
[youtube] PPz3lcPxVGg: Downloading android sdkless player API JSON
[youtube] [jsc:deno] Solving JS challenges using deno
PPz3lcPxVGg has no automatic captions
PPz3lcPxVGg has no subtitles

这个视频也没有字幕。让我检查第一个搜索结果里提到的那个视频:

yt-dlp --get-title "https://www.youtube.com/watch?v=PPz3lcPxVGg" 2>&1 | tail -1
《当我点击时,算法在想什么?》

这不是同一个视频。让我直接搜索完整标题:

yt-dlp "ytsearch:Lovart 创始人陈冕 罗永浩 且让我大闹一场" --get-title --get-id 2>&1 | tail -5
Lovart 创始人陈冕×罗永浩!且让我大闹一场,然后悄然离去 | Melvin Chen: Let Me Shake the World, Then Walk Away
KtlOQKWkFb0

找到了!检查字幕:

yt-dlp --list-subs "https://www.youtube.com/watch?v=KtlOQKWkFb0" 2>&1
[youtube] Extracting URL: https://www.youtube.com/watch?v=KtlOQKWkFb0
[youtube] KtlOQKWkFb0: Downloading webpage
[youtube] KtlOQKWkFb0: Downloading tv client config
[youtube] KtlOQKWkFb0: Downloading player b75a8e80-main
[youtube] KtlOQKWkFb0: Downloading tv player API JSON
[youtube] KtlOQKWkFb0: Downloading android sdkless player API JSON
[youtube] [jsc:deno] Solving JS challenges using deno
KtlOQKWkFb0 has no automatic captions
KtlOQKWkFb0 has no subtitles

找到了 YouTube 同名视频 https://www.youtube.com/watch?v=KtlOQKWkFb0,但遗憾的是这个视频也没有字幕(既没有人工字幕,也没有自动生成的字幕)。

如果你需要视频内容的文字版,可能需要:

  1. 用语音转文字工具(如 Whisper)本地生成
  2. 或者看看 UP 主是否在其他平台发布了文字稿
你看下我的电脑上有类似工具吗

用户想让我检查他的电脑上是否有语音转文字工具,比如 Whisper 或类似的工具。

which whisper whisper-cpp mlx_whisper insanely-fast-whisper 2>/dev/null; brew list 2>/dev/null | grep -i whisper; pip list 2>/dev/null | grep -i whisper; pipx list 2>/dev/null | grep -i whisper
ls /Applications | grep -iE 'whisper|transcri|caption|subtitle'
which ffmpeg; python3 -c "import torch; print('PyTorch OK')" 2>/dev/null
/opt/homebrew/bin/ffmpeg

Process exited with code 1

你电脑上没有安装 Whisper 或类似的语音转文字工具。

ffmpeg,可以处理音视频。如果你想装一个,推荐:

# 最快的方式(Mac M芯片优化)
pip install mlx-whisper

# 或者用 Homebrew 安装 whisper-cpp
brew install whisper-cpp

需要我帮你安装吗?

whisper-cpp 已经安装好了,开始完成我的需求吧

用户说 whisper-cpp 已经安装好了,现在需要我完成他的需求:获取B站视频的字幕。

步骤:

  1. 先下载视频的音频
  2. 用 whisper-cpp 转录成字幕
  3. 保存到 /Downloads

让我开始吧。

Updated TODOs
yt-dlp -x --audio-format wav -o "/Users/ayden/Downloads/lovart_luoyonghao.%(ext)s" "https://www.youtube.com/watch?v=KtlOQKWkFb0" 2>&1
[youtube] Extracting URL: https://www.youtube.com/watch?v=KtlOQKWkFb0
[youtube] KtlOQKWkFb0: Downloading webpage
[youtube] KtlOQKWkFb0: Downloading tv client config
[youtube] KtlOQKWkFb0: Downloading player b75a8e80-main
[youtube] KtlOQKWkFb0: Downloading tv player API JSON
[youtube] KtlOQKWkFb0: Downloading android sdkless player API JSON
[youtube] [jsc:deno] Solving JS challenges using deno
[info] KtlOQKWkFb0: Downloading 1 format(s): 251
[download] Sleeping 5.00 seconds as required by the site...
[download] Destination: /Users/ayden/Downloads/lovart_luoyonghao.webm
[download] 100% of 201.26MiB in 00:00:15 at 13.12MiB/s
[ExtractAudio] Destination: /Users/ayden/Downloads/lovart_luoyonghao.wav
Deleting original file /Users/ayden/Downloads/lovart_luoyonghao.webm (pass -k to keep)
Updated TODOs
which whisper-cpp; ls /opt/homebrew/share/whisper-cpp/models/ 2>/dev/null || ls ~/whisper.cpp/models/ 2>/dev/null || echo "checking model locations..."
checking model locations...
brew --prefix whisper-cpp 2>/dev/null; ls "$(brew --prefix whisper-cpp 2>/dev/null)/share/whisper-cpp/models" 2>/dev/null; find /opt/homebrew -name "ggml*.bin" 2>/dev/null | head -5
/opt/homebrew/opt/whisper-cpp

需要先下载模型。我来下载一个适合中文的模型:

ls -la
mkdir -p ~/.cache/whisper-cpp && cd ~/.cache/whisper-cpp && curl -L -O "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin" 2>&1
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dload Upload Total Spent Left Speed
100 1382 100 1382 0 0 2927 0 --:--:-- --:--:-- --:--:-- 2927
100 1549M 100 1549M 0 0 12.4M 0 0:02:04 0:02:04 --:--:-- 12.5M

模型下载完成,现在开始转录(这个视频较长,可能需要几分钟):

whisper-cpp -m ~/.cache/whisper-cpp/ggml-large-v3-turbo.bin -l zh -osrt -of /Users/ayden/Downloads/lovart_luoyonghao /Users/ayden/Downloads/lovart_luoyonghao.wav 2>&1
/bin/bash: whisper-cpp: command not found

Process exited with code 127

ls /opt/homebrew/bin/ | grep -i whisper
whisper-bench
whisper-cli
whisper-command
whisper-server
whisper-stream
whisper-cli -m ~/.cache/whisper-cpp/ggml-large-v3-turbo.bin -l zh -osrt -of /Users/ayden/Downloads/lovart_luoyonghao /Users/ayden/Downloads/lovart_luoyonghao.wav 2>&1
Warning: long output, 5.8k lines truncated
Updated TODOs

完成!字幕已保存到 /Users/ayden/Downloads/lovart_luoyonghao.srt

这是 Lovart 创始人陈冕与罗永浩的对谈,约 3 小时长,转录耗时约 15