用Bash和Unix合并字数统计 [英] Merging word counts with Bash and Unix

查看:71
本文介绍了用Bash和Unix合并字数统计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我制作了一个Bash脚本,该脚本使用 grep sed 从文本文件中提取单词,然后使用 sort 对它们进行排序并计算重复次数使用 wc ,然后再次按频率排序.示例输出如下所示:

I made a Bash script that extracts words from a text file with grep and sed and then sorts them with sort and counts the repetitions with wc, then sort again by frequency. The example output looks like this:

12 the
 7 code
 7 with
 7 add
 5 quite
 3 do
 3 well
 1 quick
 1 can
 1 pick
 1 easy

现在,我想将所有具有相同频率的单词合并为一行,如下所示:

Now I'd like to merge all words with the same frequency into one line, like this:

12 the
 7 code with add
 5 quite
 3 do well
 1 quick can pick easy

有什么方法可以使用Bash和标准Unix工具集来做到这一点吗?还是我不得不用一些更复杂的脚本语言编写脚本/程序?

Is there any way to do that with Bash and standard Unix toolset? Or I would have to write a script / program in some more sophisticated scripting language?

推荐答案

使用 awk :

$ echo "12 the
 7 code
 7 with
 7 add
 5 quite
 3 do
 3 well
 1 quick
 1 can
 1 pick
 1 easy" | awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2} END {for (e in cnt) print e, cnt[e]} ' | sort -nr
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy

您可以对Bash 4关联数组执行类似的操作. awk 更容易,但是POSIX.使用它.

You can do something similar with Bash 4 associative arrays. awk is easier and POSIX though. Use that.

说明:

  1. awk 用FS中的分隔符将行分开,在本例中为水平空白;
  2. $ 1 是count的第一个字段-使用该字段可以在以 cnt [$ 1] 的count为键的关联数组中收集具有相同count的项目;
  3. cnt [$ 1] = cnt [$ 1]吗?cnt [$ 1] OFS $ 2:$ 2 是三元分配-如果 cnt [$ 1] 没有值,只需将第二个字段 $ 2 分配给它(:的RH).如果它确实具有先前的值,则将 $ 2 串联起来,并用 OFS 的值(:的LH)隔开;
  4. 最后,打印出关联数组的值.
  1. awk splits the line apart by the separator in FS, in this case the default of horizontal whitespace;
  2. $1 is the first field of the count - use that to collect items with the same count in an associative array keyed by the count with cnt[$1];
  3. cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2 is a ternary assignment - if cnt[$1] has no value, just assign the second field $2 to it (The RH of :). If it does have a previous value, concatenate $2 separated by the value of OFS (the LH of :);
  4. At the end, print out the value of the associative array.

由于awk关联数组是无序的,因此您需要再次根据第一列的数值进行排序. gawk 可以在内部进行排序,但是调用 sort 一样容易.awk的输入不需要排序,因此您可以消除管道的这一部分.

Since awk associative arrays are unordered, you need to sort again by the numeric value of the first column. gawk can sort internally, but it is just as easy to call sort. The input to awk does not need to be sorted, so you can eliminate that part of the pipeline.

如果您希望数字右对齐(例如您的示例中的数字):

If you want the digits to be right justified (as your have in your example):

$ awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2} 
     END {for (e in cnt) printf "%3s %s\n", e, cnt[e]} '


如果您想让 gawk 按降序数值排序,您可以在遍历数组之前添加 PROCINFO ["sorted_in"] ="@ ind_num_desc" :


If you want gawk to sort numerically by descending values, you can add PROCINFO["sorted_in"]="@ind_num_desc" prior to traversing the array:

$ gawk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2} 
            END {PROCINFO["sorted_in"]="@ind_num_desc"
               for (e in cnt) printf "%3s %s\n", e, cnt[e]} '

这篇关于用Bash和Unix合并字数统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆