如何创建文件中每个单词的频率列表? [英] How to create a frequency list of every word in a file?
问题描述
我有一个这样的文件:
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
我想生成一个两列列表.第一列显示哪些词出现,第二列显示它们出现的频率,例如:
I would like to generate a two-column list. The first column shows what words appear, the second column shows how often they appear, for example:
this@1
is@1
a@1
file@1
with@1
many@1
words3
some@2
of@2
the@2
only@1
appear@2
more@1
than@1
one@1
once@1
time@1
- 为了使这项工作更简单,在处理列表之前,我将删除所有标点符号,并将所有文本更改为小写字母.
- 除非有一个简单的解决方案,
words
和word
可以算作两个单独的词. - To make this work simpler, prior to processing the list, I will remove all punctuation, and change all text to lowercase letters.
- Unless there is a simple solution around it,
words
andword
can count as two separate words.
到目前为止,我有这个:
So far, I have this:
sed -i "s/ /
/g" ./file1.txt # put all words on a new line
while read line
do
count="$(grep -c $line file1.txt)"
echo $line"@"$count >> file2.txt # add word and frequency to file
done < ./file1.txt
sort -u -d # remove duplicate lines
出于某种原因,这只是在每个单词后显示0".
For some reason, this is only showing "0" after each word.
如何生成文件中出现的每个单词的列表以及频率信息?
How can I generate a list of every word that appears in a file, along with frequency information?
推荐答案
不是 sed
和 grep
,而是 tr
, sort
、uniq
和 awk
:
Not sed
and grep
, but tr
, sort
, uniq
, and awk
:
% (tr ' ' '
' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF
a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1
在大多数情况下,您还想删除数字和标点符号,将所有内容都转换为小写(否则THE"、The"和the"将单独计算)并禁止输入零长度单词.对于 ASCII 文本,您可以使用此修改后的命令执行所有这些操作:
In most cases you also want to remove numbers and punctuation, convert everything to lowercase (otherwise "THE", "The" and "the" are counted separately) and suppress an entry for a zero length word. For ASCII text you can do all these with this modified command:
sed -e 's/[^A-Za-z]/ /g' text.txt | tr 'A-Z' 'a-z' | tr ' ' '
' | grep -v '^$'| sort | uniq -c | sort -rn
这篇关于如何创建文件中每个单词的频率列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!