如何从ubuntu中的文本文件生成(唯一)单词列表? [英] how to generate list of (unique) words from text file in ubuntu?

查看:24
本文介绍了如何从ubuntu中的文本文件生成(唯一)单词列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 ASCII 文本文件.我想使用一个或多个 Ubuntu 命令从该文件生成所有单词"的列表.单词被定义为分隔符之间的字母数字序列.分隔符默认是空格,但我也想尝试使用其他字符,如标点符号等.换句话说,我希望能够指定一个分隔符字符集.如何只生成一组独特的单词?如果我还想只列出那些长度至少为 N 个字符的单词怎么办?

I have an ASCII text file. I want to generate a list of all "words" from that file using one or more Ubuntu commands. A word is defined as an alpha-num sequence between delimiters. Delimiters are by default whitespaces but I also want to experiment with other characters like punctuation etc. IN other words, i want to be able to specify a delimiter char set. How do I produce only a unique set of words? What if I also want to list only those words that are at least N characters long?

推荐答案

你可以使用 grep:

You could use grep:

-E '\w+' 搜索词

-o 只打印匹配 % cat temp

一些例子使用敏捷的棕色狐狸跳过懒惰的狗",而不是Lorem ipsum dolor sat amet, consectetur adipiscing elit"例如文本.

Some examples use "The quick brown fox jumped over the lazy dog," rather than "Lorem ipsum dolor sit amet, consectetur adipiscing elit" for example text.

% grep -o -E '\w+' temp
Some
examples
use
The
quick
brown
fox
jumped
over
the
lazy
dog
rather
than
Lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
for
example
text

如果你只想打印每个单词一次,不考虑大小写,你可以使用 sort

If you want to only print each word once, disregarding case, you can use sort

-u 每个单词只打印一次

-f 告诉 sort 在比较单词时忽略大小写

-f tells sort to ignore case when comparing words

% grep -o -E '\w+' temp | sort -u -f
adipiscing
amet
brown
consectetur
dog
dolor
elit
example
examples
for
fox
ipsum
jumped
lazy
Lorem
over
quick
rather
sit
Some
text
than
The
use

你也可以使用tr命令

echo the quick brown fox jumped over the lazydog | tr -cs 'a-zA-Z0-9' '\n'
the
quick
brown
fox
jumped
over
the
lazydog

-c 用于指定字符的补码;-s 挤出替换的重复项;'a-zA-Z0-9' 是一组字母数字,如果您在此处添加一个字符,则输入不会被该字符分隔(参见下面的另一个示例);'\n' 是替换字符(换行符).

The -c is for the complement of the specified characters; the -s squeezes out duplicates of the replacements; the 'a-zA-Z0-9' is the set of alphanumerics, if you add a character here, the input won't get delimited on that character (see another example below); the '\n' is the replacement character (newline).

echo the quick brown fox jumped over the lazy-dog | tr -cs 'a-zA-Z0-9-' '\n'
the
quick
brown
fox
jumped
over
the
lazy-dog

当我们在非分隔符列表中添加-"时,会打印lazy-dog.其他输出为

As we added '-' in the list of non-delimiters list, lazy-dog was printed. Other the output is

echo the quick brown fox jumped over the lazy-dog | tr -cs 'a-zA-Z0-9' '\n'
the
quick
brown
fox
jumped
over
the
lazy
dog

tr 的总结:任何不在 -c 参数中的字符,都将作为分隔符.我希望这也能解决您的分隔符问题.

Summary for tr: any character not in argument of -c, will act as a delimiter. I hope this solves your delimiter problem too.

这篇关于如何从ubuntu中的文本文件生成(唯一)单词列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆