最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【已解决】Python解析.srt字幕文件

Python crifan 3051浏览 0评论
折腾:
【未解决】用Python代码从视频中提取出音频mp3文件
期间,对于已有的srt字幕文件:
1
00:00:02,000 --> 00:00:06,700
Careful now, I don't want to hurt you.
现在要小心了 我可不想伤到你啊

2
00:00:10,500 --> 00:00:14,550
So Mr. Teacher guy, as the real Dragon Warrior,
那么 这个作为神龙斗士老师的你

3
00:00:14,560 --> 00:00:17,950
I say to you, Shakabooey!
我想对你说 滚你的

4
00:00:24,500 --> 00:00:28,030
So, guess you can start planning my parade now.
那 我想你们可以开始我的游行了是吧
...
或:
1
00:00:02,310 --> 00:00:04,677
I am a little turtle

2
00:00:04,752 --> 00:00:07,540
I crawl so slow

3
00:00:07,670 --> 00:00:12,120
I carry my house wherever I go.

4
00:00:12,210 --> 00:00:16,927
When I get tired, I put in my head,
现在需要去用Python去处理和解析
希望得到结构化的数据,至少要包括:第几段,起始时间和结束时间,(第一条的)英文字幕
此处数据的结构,看起来格式还是很统一的,其实可以用正则re去匹配。
不过去找找是否有成熟的库,这样可以提高效率,避免重复造轮子
python parse srt file
byroot/pysrt: Python parser for SubRip (srt) files
看起来效果不错。
srt · PyPI
-》
cdown/srt: A tiny library for parsing, modifying, and composing SRT files.
看起来不是足够好用
python – Parsing srt subtitles – Stack Overflow
python – parsing transcript .srt files into readable text – Stack Overflow
python – parsing a .srt file with regex – Stack Overflow
所以先去试试:pysrt
先去安装pysrt:
其中此处特殊的是,Mac本地有多个Python,且Python3也有多个:
且此处选择了,看似pip3所对应的
Python 3.6.4 64-bit
然后用pip3去安装:
➜  xxx_downloadDemo which pip3
/usr/local/bin/pip3
➜  xxx_downloadDemo ll /usr/local/bin/pip*
-rwxr-xr-x  1 crifan  admin   215B  4 20 15:47 /usr/local/bin/pip
-rwxr-xr-x  1 crifan  admin   235B  4 17 10:18 /usr/local/bin/pip2
-rwxr-xr-x  1 crifan  admin   235B  4 17 10:18 /usr/local/bin/pip2.7
-rwxr-xr-x  1 crifan  admin   235B  4 20 15:21 /usr/local/bin/pip3
-rwxr-xr-x  1 crifan  admin   235B  4 20 15:21 /usr/local/bin/pip3.6
➜  xxx_downloadDemo pip3 install pysrt
Collecting pysrt
  Downloading 
https://files.pythonhosted.org/packages/f6/33/16ad65a8973cb8bcb494af09ee1b9ab5ffdd6ff300bce5d3ac7d3cb1f2cc/pysrt-1.1.1.tar.gz
 (104kB)
    100% |████████████████████████████████| 112kB 320kB/s
Requirement already satisfied: chardet in /usr/local/lib/python3.6/site-packages (from pysrt) (3.0.4)
Building wheels for collected packages: pysrt
  Running setup.py bdist_wheel for pysrt ... done
  Stored in directory: /Users/crifan/Library/Caches/pip/wheels/a6/95/51/25db5b533f7c8c3bccf661a7f2bf67caaf893f6f92bb37da33
Successfully built pysrt
Installing collected packages: pysrt
Successfully installed pysrt-1.1.1
You are using pip version 10.0.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
然后此处代码中去导入看看是否能识别
import pysrt
可以识别的。
还可以点击进去确认和看源码:
然后去试试pysrt解析srt文件的效果
代码:
subtitleList = pysrt.open(subtitleFullPath, encoding="utf-8")
VSCode中调试的结果是:
点开data是我希望要的subtitle的list:
但是对应的每个srtitem中的text,竟然是英文和中文混合了?
没有把中英文字幕分开?
通过打印出来后发现,还真的竟然是字幕混在一起了:
所以:不是我们要的
-》要买换库,要么自己再去拆分出不同字幕
-》考虑到demo中的:
>>> first_sub.start.seconds = 20
>>> first_sub.end.minutes = 5
对于time解析和支持的不错,那么还是用这个库吧,然后字幕自己拆分
不过要确保:不同语言的字幕,都只能是一行,单一语言的字幕,比如英语,内部不能有换行
看了看其他srt字幕的内容,的确满足这条,所以是可以通过\n换行符来拆分出两行字幕 或单行字幕
然后此处:第一行字幕就是英文,第二行可能没有,有的话则是中文字幕
对于换行,此处貌似都是\n,但是也要额外考虑到,是否可能会是\r或\r\n
所以要去找个严格的办法去判断:
python 判断字符串中包含换行
python check string contain newline
How can I check for a new line in string in Python 3.x? – Stack Overflow
python – How to check if \n is in a string – Stack Overflow
python find if newline is in string – Stack Overflow
python – check carriage return is there in a given string – Stack Overflow
所以还是简单的去判断:
if “\n” in “xxx”
然后通过拆分:
        subtitleEn = ""
        subtitleZhcn = ""
        subtitleText = eachSubtitle.text
        if "\n" in subtitleText:
          subtitleTextList = subtitleText.split("\n")
          subtitleEn = subtitleTextList[0]
          if len(subtitleTextList) > 1:
            subtitleZhcn = subtitleTextList[1]
        else:
          subtitleEn = subtitleText
        logging.info("[%d] %s | %s", curNum, subtitleEn, subtitleZhcn)
输出效果:
再去拿到起始时间段
代码:
startTime = eachSubtitle.start
endTime = eachSubtitle.end
获取到时间,效果不错:
有 hours,minutes,seconds,milliseconds
输出如下:
【总结】
最后用库pysrt,去解析srt字幕
代码:
import pysrt
  subtitleFilename = "course_%s_subtitle.srt" % courseId
  subtitleFullPath = os.path.join(courseRootFolder, subtitleFilename)
  if os.path.exists(subtitleFullPath):
    subtitleList = pysrt.open(subtitleFullPath, encoding="utf-8")
    getOk = True
效果:

转载请注明:在路上 » 【已解决】Python解析.srt字幕文件

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
83 queries in 0.157 seconds, using 22.08MB memory