如何使用linux命令从纯文本文件中删除重复的单词 [英] How to remove duplicate words from a plain text file using linux command
本文介绍了如何使用linux命令从纯文本文件中删除重复的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个带有单词的纯文本文件,这些单词之间用逗号分隔,例如:
I have a plain text file with words, which are separated by comma, for example:
word1, word2, word3, word2, word4, word5, word 3, word6, word7, word3
我想删除重复项并成为:
i want to delete the duplicates and to become:
word1, word2, word3, word4, word5, word6, word7
有什么想法吗?我认为egrep可以帮助我,但是我不确定如何正确使用它....
Any Ideas? I think, egrep can help me, but i'm not sure, how to use it exactly....
推荐答案
假定单词每行一个,并且文件已经排序:
Assuming that the words are one per line, and the file is already sorted:
uniq filename
如果文件未排序:
sort filename | uniq
如果它们不是每行一个,并且您不介意它们是每行一个:
If they're not one per line, and you don't mind them being one per line:
tr -s [:space:] \\n < filename | sort | uniq
但是,这不能消除标点符号,所以也许您想要:
That doesn't remove punctuation, though, so maybe you want:
tr -s [:space:][:punct:] \\n < filename | sort | uniq
但是,这会从连字词中删除连字符. "man tr"以获取更多选项.
But that removes the hyphen from hyphenated words. "man tr" for more options.
这篇关于如何使用linux命令从纯文本文件中删除重复的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文