从字幕文件中删除不在单词列表中的(常用单词)单词 [英] Remove words from a subtitle file that aren't in a wordlist (of common words)

查看:103
本文介绍了从字幕文件中删除不在单词列表中的(常用单词)单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些字幕文件,并且我不想学习这些字幕中的每个单词,因此无需学习一些较难的术语,例如:颅骨,发育不良...

I have some subtitle files, and I'm not intending to learn every single word in these subtitles, there is no need to learn some hard terms like: cleidocranial, dysplasia...

我在这里找到此脚本:从中删除单词不在列表中的单元格.但是我不知道如何修改或运行它.(我正在使用linux)

I found this script here: Remove words from a cell that aren't in a list. But I have no idea how to modify it or run it. (I'm using linux)

这是我们的示例:

字幕文件(.srt):

2
00:00:13,000->00:00:15,000
颅骨发育不良的人很好.

2
00:00:13,000 --> 00:00:15,000
People with cleidocranial dysplasia are good.

3000个常用单词(.txt)的单词列表:

...




...

...
people
with
are
good
...

我们需要的输出(.srt):

2
00:00:13,000->00:00:15,000
有* *的人很好.

2
00:00:13,000 --> 00:00:15,000
People with * * are good.

或仅在可能的情况下将其标记为(.srt):

2
00:00:13,000->00:00:15,000
颅骨*不典型增生*的人很好.

2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.

如果有一种仅适用于纯文本(没有时间码)的解决方案,没关系,只需说明如何运行
谢谢.

If there is a solution working just with plain texts (without timecodes), it's ok, just explain how to run it
Thank you.

推荐答案

以下仅处理每个'.srt'文件的第三行.它可以轻松地用于处理其他行和/或其他文件.

The following processes the 3rd line only of every '.srt' file. It can be easily adapted to process other lines and/or other files.

import os
import re
from glob import glob

with open('words.txt') as f:
    keep_words = {line.strip().lower() for line in f}

for filename_in in glob('*.srt'):
    filename_out = f'{os.path.splitext(filename_in)[0]}_new.srt'
    with open(filename_in) as fin, open(filename_out, 'w') as fout:
        for i, line in enumerate(fin):
            if i == 2:
                parts = re.split(r"([\w']+)", line.strip())
                parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
                line = ''.join(parts) + '\n'
            fout.write(line)

结果(对于您作为示例给出的 subtitle.rst :

Result (for the subtitle.rst you gave as example:

! cat subtitle_new.rst
2
00:00:13,000 --> 00:00:15,000
People with * * are good.

替代方法:只需在词汇以外的单词旁边添加'*':

Alternative: just add a '*' next to out-of-vocabulary words:

# replace:
#                 parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
                parts[1::2] = [w if w.lower() in keep_words else f'{w}*' for w in parts[1::2]]

则输出为:

2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.

说明:

  • 第一个 open 用于读取所有想要的单词,确保它们都是小写,然后将它们放入 set 中(用于快速成员资格测试)./li>
  • 我们使用 glob 查找以'.srt'结尾的所有文件名.
  • 对于每个这样的文件,我们都构造一个新文件名,作为'..._ new.srt'.
  • 我们阅读了所有行,但仅修改了 i == 2 行(即第三行,因为默认情况下 enumerate 从0开始).
  • line.strip()删除尾随的换行符.
  • 我们本可以使用 line.strip().split()将行拆分为单词,但是最后将'good.'保留为最后一个单词;不好.使用的正则表达式通常用于拆分单词(特别是,它用单引号引起来,例如"do n't" ;它可能不是您想要的,当然可以随意使用)
  • 我们使用捕获组拆分 r(([\ w'] +)" ,而不是拆分非单词char,这样我们既拥有单词又将它们分隔在零件.例如,好人".成为 [",人",,",谁",,",,",好"','.'] .
  • 单词本身是 parts 的所有其他元素,从索引1开始.
  • 如果单词的小写形式不是 keep_words ,我们会用'*'替换这些单词.
  • 最后,我们重新组装该行,并通常将所有行输出到新文件中.
  • The first open is used to read in all wanted words, make sure they are in lowercase, and put them into a set (for fast membership test).
  • We use glob to find all filenames ending in '.srt'.
  • For each such file, we construct a new filename derived from it as '..._new.srt'.
  • We read in all lines, but modify only line i == 2 (i.e. the 3rd line, since enumerate by default starts at 0).
  • line.strip() removes the trailing newline.
  • We could have used line.strip().split() to split the line into words, but it would have left 'good.' as the last word; not good. The regex used is often used to split words (in particular, it leaves in single quotes such as "don't"; it may or may not be what you want, adapt at will of course).
  • We use a capturing group split r"([\w']+)" instead of splitting on non-word chars, so that we have both words and what separates them in parts. For example, 'People, who are good.' becomes ['', 'People', ', ', 'who', ' ', 'are', ' ', 'good', '.'].
  • The words themselves are every other element of parts, starting at index 1.
  • We replace the words by '*' if their lowercase form is not in keep_words.
  • Finally we re-assemble that line, and generally output all lines to the new file.

这篇关于从字幕文件中删除不在单词列表中的(常用单词)单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆