打印重复计数,但不删除终端中的重复项 [英] Print duplicate count without removing duplicates in Terminal

查看:132
本文介绍了打印重复计数,但不删除终端中的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Mac上使用终端机是我的新手,它有一个很大的.tsv文件,其中包含项目列表以及其旁边的两个值.我希望能够在第一次出现该项目的旁边打印重复项的数量,而无需删除其他数据.

I am new to working with the Terminal on mac and have a large .tsv file that consists of a list of items, and two values next to it. I would like to be able to print the number of duplicates next to the first occurrence of the item Without removing additional data.

我知道cut -f 1 |排序| uniq -c,但这会删除许多我想保留用于分析的有价值的数据.我正在阅读有关awk和grep的信息,但我认为我可以使用一些帮助.

I am aware of cut -f 1 | sort | uniq -c but this removes a lot of valuable data I would like to keep for analysis. I'm reading about awk and grep but I think I could use a little help.

这是我要处理的文件的示例:

This is an example of the file I'm trying to process:

fruit   number  reference
apple   12  342
apple   13  345
apple   43  4772
banana  19  234
banana  73  3242
peach   131 53423
peach   234 3266
peach   242 324
peach   131 56758
peaches 29  2434

理想情况下,输出看起来像这样:

Ideally, the output would look something like this:

fruit   number  reference   fruit_count
apple   12  342 3
apple   13  345 
apple   43  4772    
banana  19  234 2
banana  73  3242    
peach   131 53423   4
peach   234 3266    
peach   242 324 
peach   131 56758   
peaches 29  2434    1

像这样的事情有可能吗?我可以使用公式获得所需的输出excel,但文件太大,并不断崩溃.任何帮助将不胜感激.

Is something like this even possible? I can get the desired output excel using formulas, but the file is too large and keeps crashing on me. Any help would be appreciated.

添加我当前的解决方案(不符合我的要求)

Adding My current solution (that does not meet my requirements)

cut -f 1 fruitsample.txt | sort | uniq -c | sed -e 's/ *//' -e 's/ / /'

这给了我想要的计数,用制表符替换了uniq -c输出的标准计数+空间,但是它也对标题行进行了排序,并删除了第二列和第三列.

This gives me the intended counts, replacing the standard count + space output from uniq -c with a tab character, but it also sorts the header row and removes the 2nd and third columns.

在Excel上,我可以使用公式=IF(COUNTIF(A$2:A2,A2)=1,COUNTIF(A:A,A2),"")并将其填写下来.我正在使用的文件有将近680K的数据行,而Excel却试图计算那么多行而感到窒息.

On Excel, I can use the formula =IF(COUNTIF(A$2:A2,A2)=1,COUNTIF(A:A,A2),"") and fill it down. The file I'm working with is nearly 680K rows of data, and Excel chokes trying to calculate that many rows.

正如我提到的,我是一名初学者,正在寻找指导.我只是对awk或grep不太熟悉.再次感谢!

As I mentioned, I am a beginner looking for guidance. I'm just not as familiar with awk or grep. Thanks again!

推荐答案

awk进行救援!

awk 'NR==FNR {a[$1]++; next} 
     FNR==1  {print $0, "fruit_count"; next} 
     $1 in a {$(NF+1)=a[$1]; delete a[$1]}1' file{,} | 
column -t

fruit    number  reference  fruit_count
apple    12      342        3
apple    13      345
apple    43      4772
banana   19      234        2
banana   73      3242
peach    131     53423      4
peach    234     3266
peach    242     324
peach    131     56758
peaches  29      2434       1

为解释主要思想,我将使用一个没有标题和未排序数据的简单结构

for explanation of the main idea I'll use a simpler structure without header, and unsorted data

$ cat file
apple
banana
apple
apple
cherry
banana

$ awk 'NR==FNR {a[$1]++; next}            # in the first pass, save key counts
                $1 in a                   # if the key in map
                        {$(NF+1)=a[$1];   # add the count as a last column
                         delete a[$1]}    # remove key from map
                1                         # print
       ' file{,} |                        # bash shorthand for: file file
  column -t                               # pretty print columns 


apple   3
banana  2
apple
apple
cherry  1
banana

对于简化示例,使用unix工具链,您可以使用

for the simplified example, using unix toolchain you can achieve the same with

join -a1 -11 -22 -o1.2,2.1 <(cat -n file) <(cat -n file | sort -k2 | uniq -c -f1)

添加标题将需要更多的处理;这是awk闪耀的地方.

adding header will require more juggling; it's where awk shines.

这篇关于打印重复计数,但不删除终端中的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆