命令行以并置文件中的频率求和 [英] Command line to sum frequency in concatenated file

查看:211
本文介绍了命令行以并置文件中的频率求和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要总结几个大的制表符分隔文件的一列的频率。
文件中的内容示例为:

 蓝色表3 
蓝色椅子2
Big cat 1
小猫2

连接文件后,问题是



第2列实质上是列0和列1的组合在一起看到的次数的频率计数。



我需要在连接文件的第2列中添加所有相同组合的频率。



例如:如果在文件A中如下:

 蓝表3 
蓝椅2
大猫1
小cat 2

,在文件B中的内容如下:

 蓝表3 
蓝椅2
大猫1
小猫2



连接的文件C中的内容如下:

 蓝表3 
蓝椅2
大猫1
小猫2
蓝表3
蓝椅2
大猫1
小猫2

我想将第0列和第1列中所有相同组合的频率在文件D中获得以下结果:

 蓝表6 
蓝椅4
大猫2
小猫4

我试图用以下命令排序和计数信息:

  sort< input_file> | uniq -c< output_file> 

但结果如下:

  2大猫1 
2蓝椅2
2蓝表3
2小猫2

有没有人建议一个终端命令可以产生我想要的结果?



谢谢

解决方案

你很近;你有你需要的所有数字。每行的总计是您从uniq(第1列)乘以频率计数(第4列)的行数。你可以用awk计算:

  sort input.txt | uniq -c | awk'{print $ 2\t$ 3\t$ 1 * $ 4}'


I need to summarize the frequency of one column of several large tab-separated files. An example of the content in the file is :

Blue    table   3 
Blue    chair   2 
Big cat 1 
Small   cat 2

After concatenating the files, the trouble is the following:

Column 2 essentially is a frequency count of the amount of times the combination of Column 0 and Column 1 were seen together.

I need to add the frequency of all of the identical combinations in Column 2 of the concatenated file.

For instance: If in File A the contents are as follows:

Blue    table   3
Blue    chair   2
Big cat 1
Small   cat 2

and in File B the contents are as follows:

Blue    table   3
Blue    chair   2
Big cat 1
Small   cat 2

the contents in the concatenated File C are as follows:

Blue    table   3
Blue    chair   2
Big cat 1
Small   cat 2
Blue    table   3
Blue    chair   2
Big cat 1
Small   cat 2

I want to sum the frequencies of all identical combos in Column 0 and Column 1 in a File D to get the following results:

Blue    table   6
Blue    chair   4
Big cat 2
Small   cat 4

I tried to sort and count the info with the following command:

 sort <input_file> | uniq -c <output_file>

but the result is the following:

  2 Big cat 1
  2 Blue    chair   2
  2 Blue    table   3
  2 Small   cat 2

Does anyone have a suggestion of a terminal command that can produce my desired results?

Thank you in advance for any help.

解决方案

You're close; you have all the numbers you need. The total for each row is the count of rows that you got from uniq (column 1) times the frequency count (column 4). You can calculate that with awk:

sort input.txt | uniq -c  | awk ' {  print $2 "\t" $3 "\t" $1*$4 } '

这篇关于命令行以并置文件中的频率求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆