使用正则表达式 C# 解析字幕文件 [英] Parse subtitle file using regex C#
问题描述
我需要找到数字、输入和输出时间码点以及文本的所有行.
I need to find the number, the in and out timecode points and all lines of the text.
9
00:09:48,347 --> 00:09:52,818
- Let's see... what else she's got?
- Yea... ha, ha.
10
00:09:56,108 --> 00:09:58,788
What you got down there, missy?
11
00:09:58,830 --> 00:10:00,811
I wouldn't do that!
12
00:10:03,566 --> 00:10:07,047
-Shit, that's not enough!
-Pull her back!
我目前正在使用这种模式,但它忘记了所有两行文本
I'm currently using this pattern but it forgets all two lines text
(?<Order>\d+)\r\n(?<StartTime>(\d\d:){2}\d\d,\d{3}) --> (?<EndTime>(\d\d:){2}\d\d,\d{3})\r\n(?<Sub>.+)(?=\r\n\r\n\d+|$)
任何帮助将不胜感激.
推荐答案
我认为正则表达式有两个问题.第一个是 (?.+)
末尾附近的 .
不匹配换行符.因此,您可以将其修改为:
I think there's two problems with the regex. The first is that the .
near the end in (?<Sub>.+)
is not matching newlines. So you could modify it to:
(?<Sub>(.|[\r\n])+?)
或者您可以指定 RegexOptions.Singleline
作为正则表达式的选项.该选项唯一能做的就是让点匹配换行符.
Or you could specify RegexOptions.Singleline
as an option to the regex. The only thing the option does is make the dot match newlines.
第二个问题是 .+
匹配尽可能多的行.你可以让它不贪婪:
The second problem is that .+
matches as many lines as it can. You can make it non-greedy like:
(?<Sub>(.|[\r\n])+?(?=\r\n\r\n|$))
这匹配最少数量的以空行或字符串结尾结尾的文本.
This matches the least amount of text that ends with an empty line or the end of the string.
这篇关于使用正则表达式 C# 解析字幕文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!