用Bash和Unix合并字数统计 [英] Merging word counts with Bash and Unix
问题描述
我制作了一个Bash脚本,该脚本使用 grep 和 sed 从文本文件中提取单词,然后使用 sort 对它们进行排序并计算重复次数使用 wc ,然后再次按频率排序.示例输出如下所示:
I made a Bash script that extracts words from a text file with grep and sed and then sorts them with sort and counts the repetitions with wc, then sort again by frequency. The example output looks like this:
12 the
7 code
7 with
7 add
5 quite
3 do
3 well
1 quick
1 can
1 pick
1 easy
现在,我想将所有具有相同频率的单词合并为一行,如下所示:
Now I'd like to merge all words with the same frequency into one line, like this:
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy
有什么方法可以使用Bash和标准Unix工具集来做到这一点吗?还是我不得不用一些更复杂的脚本语言编写脚本/程序?
Is there any way to do that with Bash and standard Unix toolset? Or I would have to write a script / program in some more sophisticated scripting language?
推荐答案
使用 awk
:
$ echo "12 the
7 code
7 with
7 add
5 quite
3 do
3 well
1 quick
1 can
1 pick
1 easy" | awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2} END {for (e in cnt) print e, cnt[e]} ' | sort -nr
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy
您可以对Bash 4关联数组执行类似的操作. awk
更容易,但是POSIX.使用它.
You can do something similar with Bash 4 associative arrays. awk
is easier and POSIX though. Use that.
说明:
-
awk
用FS中的分隔符将行分开,在本例中为水平空白; -
$ 1
是count的第一个字段-使用该字段可以在以cnt [$ 1]
的count为键的关联数组中收集具有相同count的项目; -
cnt [$ 1] = cnt [$ 1]吗?cnt [$ 1] OFS $ 2:$ 2
是三元分配-如果cnt [$ 1]
没有值,只需将第二个字段$ 2
分配给它(:
的RH).如果它确实具有先前的值,则将$ 2
串联起来,并用OFS
的值(:
的LH)隔开; - 最后,打印出关联数组的值.
awk
splits the line apart by the separator in FS, in this case the default of horizontal whitespace;$1
is the first field of the count - use that to collect items with the same count in an associative array keyed by the count withcnt[$1]
;cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2
is a ternary assignment - ifcnt[$1]
has no value, just assign the second field$2
to it (The RH of:
). If it does have a previous value, concatenate$2
separated by the value ofOFS
(the LH of:
);- At the end, print out the value of the associative array.
由于awk关联数组是无序的,因此您需要再次根据第一列的数值进行排序. gawk
可以在内部进行排序,但是调用 sort
一样容易.awk的输入不需要排序,因此您可以消除管道的这一部分.
Since awk associative arrays are unordered, you need to sort again by the numeric value of the first column. gawk
can sort internally, but it is just as easy to call sort
. The input to awk does not need to be sorted, so you can eliminate that part of the pipeline.
如果您希望数字右对齐(例如您的示例中的数字):
If you want the digits to be right justified (as your have in your example):
$ awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2}
END {for (e in cnt) printf "%3s %s\n", e, cnt[e]} '
如果您想让 gawk
到按降序数值排序,您可以在遍历数组之前添加 PROCINFO ["sorted_in"] ="@ ind_num_desc"
:
If you want gawk
to sort numerically by descending values, you can add PROCINFO["sorted_in"]="@ind_num_desc"
prior to traversing the array:
$ gawk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2}
END {PROCINFO["sorted_in"]="@ind_num_desc"
for (e in cnt) printf "%3s %s\n", e, cnt[e]} '
这篇关于用Bash和Unix合并字数统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!