从字幕文件中删除不在单词列表中的(常用单词)单词 [英] Remove words from a subtitle file that aren't in a wordlist (of common words)

查看：103 发布时间：2021/5/13 19:31:45 python text grep subtitle

本文介绍了从字幕文件中删除不在单词列表中的(常用单词)单词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些字幕文件，并且我不想学习这些字幕中的每个单词，因此无需学习一些较难的术语，例如:颅骨，发育不良...

I have some subtitle files, and I'm not intending to learn every single word in these subtitles, there is no need to learn some hard terms like: cleidocranial, dysplasia...

我在这里找到此脚本:从中删除单词不在列表中的单元格.但是我不知道如何修改或运行它.(我正在使用linux)

I found this script here: Remove words from a cell that aren't in a list. But I have no idea how to modify it or run it. (I'm using linux)

这是我们的示例:

字幕文件(.srt):

2
00:00:13,000->00:00:15,000
颅骨发育不良的人很好.

2
00:00:13,000 --> 00:00:15,000
People with cleidocranial dysplasia are good.

3000个常用单词(.txt)的单词列表:

...
人
与
是
好
...

...
people
with
are
good
...

我们需要的输出(.srt):

2
00:00:13,000->00:00:15,000
有* *的人很好.

2
00:00:13,000 --> 00:00:15,000
People with * * are good.

或仅在可能的情况下将其标记为(.srt):

2
00:00:13,000->00:00:15,000
颅骨*不典型增生*的人很好.

2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.

如果有一种仅适用于纯文本(没有时间码)的解决方案，没关系，只需说明如何运行
谢谢.

If there is a solution working just with plain texts (without timecodes), it's ok, just explain how to run it
Thank you.

推荐答案

以下仅处理每个'.srt'文件的第三行.它可以轻松地用于处理其他行和/或其他文件.

The following processes the 3rd line only of every '.srt' file. It can be easily adapted to process other lines and/or other files.

import os
import re
from glob import glob

with open('words.txt') as f:
    keep_words = {line.strip().lower() for line in f}

for filename_in in glob('*.srt'):
    filename_out = f'{os.path.splitext(filename_in)[0]}_new.srt'
    with open(filename_in) as fin, open(filename_out, 'w') as fout:
        for i, line in enumerate(fin):
            if i == 2:
                parts = re.split(r"([\w']+)", line.strip())
                parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
                line = ''.join(parts) + '\n'
            fout.write(line)

结果(对于您作为示例给出的 subtitle.rst :

Result (for the subtitle.rst you gave as example:

! cat subtitle_new.rst
2
00:00:13,000 --> 00:00:15,000
People with * * are good.

替代方法:只需在词汇以外的单词旁边添加'*':

Alternative: just add a '*' next to out-of-vocabulary words:

# replace:
#                 parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
                parts[1::2] = [w if w.lower() in keep_words else f'{w}*' for w in parts[1::2]]

则输出为:

2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.

说明:

第一个 open 用于读取所有想要的单词，确保它们都是小写，然后将它们放入 set 中(用于快速成员资格测试)./li>
我们使用 glob 查找以'.srt'结尾的所有文件名.
对于每个这样的文件，我们都构造一个新文件名，作为'..._ new.srt'.
我们阅读了所有行，但仅修改了 i == 2 行(即第三行，因为默认情况下 enumerate 从0开始).
line.strip()删除尾随的换行符.
我们本可以使用 line.strip().split()将行拆分为单词，但是最后将'good.'保留为最后一个单词;不好.使用的正则表达式通常用于拆分单词(特别是，它用单引号引起来，例如"do n't" ；它可能不是您想要的，当然可以随意使用)
我们使用捕获组拆分 r(([\ w'] +)" ，而不是拆分非单词char，这样我们既拥有单词又将它们分隔在零件.例如，好人".成为 ["，人"，，"，谁"，，"，，"，好"'，'.'] .
单词本身是 parts 的所有其他元素，从索引1开始.
如果单词的小写形式不是 keep_words ，我们会用'*'替换这些单词.
最后，我们重新组装该行，并通常将所有行输出到新文件中.

The first open is used to read in all wanted words, make sure they are in lowercase, and put them into a set (for fast membership test).
We use glob to find all filenames ending in '.srt'.
For each such file, we construct a new filename derived from it as '..._new.srt'.
We read in all lines, but modify only line i == 2 (i.e. the 3rd line, since enumerate by default starts at 0).
line.strip() removes the trailing newline.
We could have used line.strip().split() to split the line into words, but it would have left 'good.' as the last word; not good. The regex used is often used to split words (in particular, it leaves in single quotes such as "don't"; it may or may not be what you want, adapt at will of course).
We use a capturing group split r"([\w']+)" instead of splitting on non-word chars, so that we have both words and what separates them in parts. For example, 'People, who are good.' becomes ['', 'People', ', ', 'who', ' ', 'are', ' ', 'good', '.'].
The words themselves are every other element of parts, starting at index 1.
We replace the words by '*' if their lowercase form is not in keep_words.
Finally we re-assemble that line, and generally output all lines to the new file.

这篇关于从字幕文件中删除不在单词列表中的(常用单词)单词的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从字幕文件中删除不在单词列表中的(常用单词)单词 [英] Remove words from a subtitle file that aren't in a wordlist (of common words)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从字幕文件中删除不在单词列表中的(常用单词)单词 [英] Remove words from a subtitle file that aren&#39;t in a wordlist (of common words)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

从字幕文件中删除不在单词列表中的(常用单词)单词 [英] Remove words from a subtitle file that aren't in a wordlist (of common words)

登录关闭