如何计算重复行的数量并填充新的制表符分隔列 [英] How to count number of duplicate rows and populate a new tab separated column

查看:17
本文介绍了如何计算重复行的数量并填充新的制表符分隔列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据集:

1   22469318    22469539
1   22469318    22469539
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
16  222825  223026
17  79989511    79989692
19  18717331    18717680
19  18717331    18717680
2   131355420   131355575
2   131355420   131355575
22  51135971    51136163

并且想创建一个没有重复行的数据框,但有一个值指示该行重复发生的频率.例如对于我使用的上述数据集

And would like to create a data frame with no duplicated lines but a value indicating how often a duplicate occurred for that row. For example for the above data set I used

sort file | uniq -c > new_file

并得到以下输出

 2 1    22469318    22469539
     21 16  222825  223026
      1 17  79989511    79989692
      2 19  18717331    18717680
      2 2   131355420   131355575
      4 22  51135971    51136163
      1 5   70240464    70240600
      4 9   140513423   140513521
     22 X   153792513   153793281

但我希望计数数据位于带有制表符分隔符的单独列中,例如:

but I would like the count data in a separate column with a tab separator for example:

1       22469318        22469539        2
16      222825          223026          21
17      79989511        79989692        1
19      18717331        18717680        2
2       131355420   131355575   2
22      51135971        51136163        4
5       70240464        70240600        1
9       140513423   140513521   4
X   153792513   153793281   22

推荐答案

使用您显示的示例,请尝试遵循 awk 代码一次.

With your shown samples, please try following awk code once.

awk 'BEGIN{FS=OFS="\t"} {arr[$0]++} END{for(i in arr){print arr[i],i}}' Input_file

说明:在此处添加对上述代码的详细说明.

Explanation: Adding detailed explanation for above code here.

awk '               ##Starting awk program from here.
BEGIN{              ##Starting BEGIN section of this program.
  FS=OFS="\t"       ##Setting FS, OFS as tab here.
}
{
  arr[$0]++         ##Creating array with index of current line and keep adding its value to it.
}
END{                ##Starting END block of this program from here.
  for(i in arr){    ##Traversing through arr here.
    print arr[i],i  ##Printing array value and key here.
  }
}
' Input_file        ##Mentioning Input_file name here.

这篇关于如何计算重复行的数量并填充新的制表符分隔列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆