如何使用linux命令从纯文本文件中删除重复的单词 [英] How to remove duplicate words from a plain text file using linux command

查看：295 发布时间：2020/5/1 10:44:46 linux file duplicates plaintext

本文介绍了如何使用linux命令从纯文本文件中删除重复的单词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我有一个带有单词的纯文本文件，这些单词之间用逗号分隔，例如:

I have a plain text file with words, which are separated by comma, for example:

word1, word2, word3, word2, word4, word5, word 3, word6, word7, word3

我想删除重复项并成为:

i want to delete the duplicates and to become:

word1, word2, word3, word4, word5, word6, word7

有什么想法吗?我认为egrep可以帮助我，但是我不确定如何正确使用它....

Any Ideas? I think, egrep can help me, but i'm not sure, how to use it exactly....

假定单词每行一个，并且文件已经排序:

Assuming that the words are one per line, and the file is already sorted:

uniq filename

如果文件未排序:

sort filename | uniq

如果它们不是每行一个，并且您不介意它们是每行一个:

If they're not one per line, and you don't mind them being one per line:

tr -s [:space:] \\n < filename | sort | uniq

但是，这不能消除标点符号，所以也许您想要:

That doesn't remove punctuation, though, so maybe you want:

tr -s [:space:][:punct:] \\n < filename | sort | uniq

但是，这会从连字词中删除连字符. "man tr"以获取更多选项.

But that removes the hyphen from hyphenated words. "man tr" for more options.

这篇关于如何使用linux命令从纯文本文件中删除重复的单词的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文