打印重复计数,但不删除终端中的重复项 [英] Print duplicate count without removing duplicates in Terminal
问题描述
在Mac上使用终端机是我的新手,它有一个很大的.tsv文件,其中包含项目列表以及其旁边的两个值.我希望能够在第一次出现该项目的旁边打印重复项的数量,而无需删除其他数据.
I am new to working with the Terminal on mac and have a large .tsv file that consists of a list of items, and two values next to it. I would like to be able to print the number of duplicates next to the first occurrence of the item Without removing additional data.
我知道cut -f 1 |排序| uniq -c,但这会删除许多我想保留用于分析的有价值的数据.我正在阅读有关awk和grep的信息,但我认为我可以使用一些帮助.
I am aware of cut -f 1 | sort | uniq -c but this removes a lot of valuable data I would like to keep for analysis. I'm reading about awk and grep but I think I could use a little help.
这是我要处理的文件的示例:
This is an example of the file I'm trying to process:
fruit number reference
apple 12 342
apple 13 345
apple 43 4772
banana 19 234
banana 73 3242
peach 131 53423
peach 234 3266
peach 242 324
peach 131 56758
peaches 29 2434
理想情况下,输出看起来像这样:
Ideally, the output would look something like this:
fruit number reference fruit_count
apple 12 342 3
apple 13 345
apple 43 4772
banana 19 234 2
banana 73 3242
peach 131 53423 4
peach 234 3266
peach 242 324
peach 131 56758
peaches 29 2434 1
像这样的事情有可能吗?我可以使用公式获得所需的输出excel,但文件太大,并不断崩溃.任何帮助将不胜感激.
Is something like this even possible? I can get the desired output excel using formulas, but the file is too large and keeps crashing on me. Any help would be appreciated.
添加我当前的解决方案(不符合我的要求)
Adding My current solution (that does not meet my requirements)
cut -f 1 fruitsample.txt | sort | uniq -c | sed -e 's/ *//' -e 's/ / /'
这给了我想要的计数,用制表符替换了uniq -c输出的标准计数+空间,但是它也对标题行进行了排序,并删除了第二列和第三列.
This gives me the intended counts, replacing the standard count + space output from uniq -c with a tab character, but it also sorts the header row and removes the 2nd and third columns.
在Excel上,我可以使用公式=IF(COUNTIF(A$2:A2,A2)=1,COUNTIF(A:A,A2),"")
并将其填写下来.我正在使用的文件有将近680K的数据行,而Excel却试图计算那么多行而感到窒息.
On Excel, I can use the formula =IF(COUNTIF(A$2:A2,A2)=1,COUNTIF(A:A,A2),"")
and fill it down. The file I'm working with is nearly 680K rows of data, and Excel chokes trying to calculate that many rows.
正如我提到的,我是一名初学者,正在寻找指导.我只是对awk或grep不太熟悉.再次感谢!
As I mentioned, I am a beginner looking for guidance. I'm just not as familiar with awk or grep. Thanks again!
推荐答案
awk
进行救援!
awk 'NR==FNR {a[$1]++; next}
FNR==1 {print $0, "fruit_count"; next}
$1 in a {$(NF+1)=a[$1]; delete a[$1]}1' file{,} |
column -t
fruit number reference fruit_count
apple 12 342 3
apple 13 345
apple 43 4772
banana 19 234 2
banana 73 3242
peach 131 53423 4
peach 234 3266
peach 242 324
peach 131 56758
peaches 29 2434 1
为解释主要思想,我将使用一个没有标题和未排序数据的简单结构
for explanation of the main idea I'll use a simpler structure without header, and unsorted data
$ cat file
apple
banana
apple
apple
cherry
banana
$ awk 'NR==FNR {a[$1]++; next} # in the first pass, save key counts
$1 in a # if the key in map
{$(NF+1)=a[$1]; # add the count as a last column
delete a[$1]} # remove key from map
1 # print
' file{,} | # bash shorthand for: file file
column -t # pretty print columns
apple 3
banana 2
apple
apple
cherry 1
banana
对于简化示例,使用unix工具链,您可以使用
for the simplified example, using unix toolchain you can achieve the same with
join -a1 -11 -22 -o1.2,2.1 <(cat -n file) <(cat -n file | sort -k2 | uniq -c -f1)
添加标题将需要更多的处理;这是awk
闪耀的地方.
adding header will require more juggling; it's where awk
shines.
这篇关于打印重复计数,但不删除终端中的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!