删除包含单词没有在字典里的所有行 [英] Remove all lines containing word NOT in dictionary
问题描述
我的英文单词从的/ usr /共享/字典/字的字典
我有一行句子的一个巨大的文件,行。我试图通过对词典进行比较,除去外国进出词汇这些奇怪的句子。
I have a huge file of sentences, line by line. I'm trying to remove these weird sentences with foreign and out of vocabulary words by comparing against the dictionary.
Master.txt
Thanks to Your Greatness (谢谢你的美好)
Himatnagar has a small Railway Station
Pu$haz Ink
谁能帮助?我试着用差异
但它只能在比较字级,而不是语句级
Can anyone help? I tried using diff
but it can only compare on word-level and not sentence-level
推荐答案
您需要这样做的阶段。
首先,使用 TR
(或者 SED
- 稍微慢一点,但更灵活,让更多的precise等去除标点符号),你块句子文件进言:
First, using tr
(or maybe sed
- slightly slower but more flexible, allows more precise removing of punctuation and so on), you chunk the sentence file into words:
tr " " "\n" < hugefile | sort | uniq | grep -v -F -f dictionary > blacklist.txt
添加 -i
选项的grep
为不区分大小写(见注释由Scott)。
Add -i
option to grep
for case insensitivity (see comment by Scott).
您再使用 uniq的
收集独特的话,的grep -v -F -f字典
来获取所有话是的不的字典。
You then use uniq
to gather unique words, and grep -v -F -f dictionary
to get all words that are not in dictionary.
一旦你有了这个黑名单中,您可以要求所有的的行不的包含在黑名单本身的任何一句话。同样,你可能要考虑大/小写,还是不行:
Once you have this "blacklist", you can request all the lines that do not contain any word in the blacklist itself. Again, you may want to consider upper/lower case, or not:
grep -v -F -f blacklist.txt > goodlines.txt
在Python中,你可以按照一个也许更有效的方式相同的方法:
In Python you can follow the same approach in a maybe more efficient way:
- 加载字典成一个列表ð。
- 对于输入hugefile的每一行
- 它拆分成单词,使这个小单子独特。让这为W
- 计算两个列表W和D的交集。
- 如果它的长度是一样的瓦长度,行不包含未知词。
这篇关于删除包含单词没有在字典里的所有行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!