如何使用linux命令从纯文本文件中删除重复的单词 [英] How to remove duplicate words from a plain text file using linux command

查看:295
本文介绍了如何使用linux命令从纯文本文件中删除重复的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有单词的纯文本文件,这些单词之间用逗号分隔,例如:

I have a plain text file with words, which are separated by comma, for example:

word1, word2, word3, word2, word4, word5, word 3, word6, word7, word3

我想删除重复项并成为:

i want to delete the duplicates and to become:

word1, word2, word3, word4, word5, word6, word7

有什么想法吗?我认为egrep可以帮助我,但是我不确定如何正确使用它....

Any Ideas? I think, egrep can help me, but i'm not sure, how to use it exactly....

推荐答案

假定单词每行一个,并且文件已经排序:

Assuming that the words are one per line, and the file is already sorted:

uniq filename

如果文件未排序:

sort filename | uniq

如果它们不是每行一个,并且您不介意它们是每行一个:

If they're not one per line, and you don't mind them being one per line:

tr -s [:space:] \\n < filename | sort | uniq

但是,这不能消除标点符号,所以也许您想要:

That doesn't remove punctuation, though, so maybe you want:

tr -s [:space:][:punct:] \\n < filename | sort | uniq

但是,这会从连字词中删除连字符. "man tr"以获取更多选项.

But that removes the hyphen from hyphenated words. "man tr" for more options.

这篇关于如何使用linux命令从纯文本文件中删除重复的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆