使用grep从根目录已经存在的字典中删除单词 [英] Use grep to remove words from dictionary whose roots are already present
问题描述
ablaze
able
abler
ablest
abloom
ably
只会以
点燃
able
abloom
ably
因为abler和ablest包含以前使用过的。
我更喜欢用grep来做这件事,这样我可以更多地了解它是如何工作的。我可以用c或python编写一个程序来执行此操作。
如果列表被排序,总是在较长的字符串之前,你可以从一个简单的Awk脚本中获得相当好的性能。
awk'$ 1〜r &安培;&安培; pk in {{next} {k [$ 1] ++;打印; r =^$ 1;如果当前单词与前缀正则表达式匹配 r>
(在某一时刻定义)和前缀 p
(ditto)在可见键列表中,跳过。否则,将当前单词添加到前缀键中,打印当前行,创建一个匹配当前行的正则表达式(现在是前缀正则表达式 r
)并记住 p
中的前缀字符串。如果所有相似的字符串总是相邻的(因为它们是如果你对文件进行词法排序),我猜可以完全避免 k
和 p
>。 p>
awk'NR> 1&& $ 1〜r {next} {print; r =^$ 1}'单词
I am trying to write a random passphrase generator. I have a dictionary with a bunch of words and I would like to remove words whose root is already in the dictionary, so that a dictionary that looks like:
ablaze
able
abler
ablest
abloom
ably
would end up with only
ablaze
able
abloom
ably
because abler and ablest contain able which was previously used.
I would prefer to do this with grep so that I can learn more about how that works. I am capable of writing a program in c or python that will do this.
解决方案 If the list is sorted so that shorter strings always precede longer strings, you might be able to get fairly good performance out of a simple Awk script.
awk '$1~r && p in k { next } { k[$1]++; print; r= "^" $1; p=$1 }' words
If the current word matches the prefix regex r
(defined in a moment) and the prefix p
(ditto) is in the list of seen keys, skip. Otherwise, add the current word to the prefix keys, print the current line, create a regex which matches the current word at beginning of line (this is now the prefix regex r
) and also remember the prefix string in p
.
If all the similar strings are always adjacent (as they would be if you sort the file lexically), you could do away with k
and p
entirely too, I guess.
awk 'NR>1 && $1~r { next } { print; r="^" $1 }' words
这篇关于使用grep从根目录已经存在的字典中删除单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!