命令行以并置文件中的频率求和 [英] Command line to sum frequency in concatenated file
问题描述
我需要总结几个大的制表符分隔文件的一列的频率。
文件中的内容示例为:
蓝色表3
蓝色椅子2
Big cat 1
小猫2
连接文件后,问题是
第2列实质上是列0和列1的组合在一起看到的次数的频率计数。
我需要在连接文件的第2列中添加所有相同组合的频率。
例如:如果在文件A中如下:
蓝表3
蓝椅2
大猫1
小cat 2
,在文件B中的内容如下:
蓝表3
蓝椅2
大猫1
小猫2
连接的文件C中的内容如下:
蓝表3
蓝椅2
大猫1
小猫2
蓝表3
蓝椅2
大猫1
小猫2
我想将第0列和第1列中所有相同组合的频率在文件D中获得以下结果:
蓝表6
蓝椅4
大猫2
小猫4
我试图用以下命令排序和计数信息:
sort< input_file> | uniq -c< output_file>
但结果如下:
2大猫1
2蓝椅2
2蓝表3
2小猫2
有没有人建议一个终端命令可以产生我想要的结果?
谢谢
解决方案你很近;你有你需要的所有数字。每行的总计是您从uniq(第1列)乘以频率计数(第4列)的行数。你可以用awk计算:
sort input.txt | uniq -c | awk'{print $ 2\t$ 3\t$ 1 * $ 4}'
I need to summarize the frequency of one column of several large tab-separated files. An example of the content in the file is :
Blue table 3 Blue chair 2 Big cat 1 Small cat 2
After concatenating the files, the trouble is the following:
Column 2 essentially is a frequency count of the amount of times the combination of Column 0 and Column 1 were seen together.
I need to add the frequency of all of the identical combinations in Column 2 of the concatenated file.
For instance: If in File A the contents are as follows:
Blue table 3 Blue chair 2 Big cat 1 Small cat 2
and in File B the contents are as follows:
Blue table 3 Blue chair 2 Big cat 1 Small cat 2
the contents in the concatenated File C are as follows:
Blue table 3 Blue chair 2 Big cat 1 Small cat 2 Blue table 3 Blue chair 2 Big cat 1 Small cat 2
I want to sum the frequencies of all identical combos in Column 0 and Column 1 in a File D to get the following results:
Blue table 6 Blue chair 4 Big cat 2 Small cat 4
I tried to sort and count the info with the following command:
sort <input_file> | uniq -c <output_file>
but the result is the following:
2 Big cat 1 2 Blue chair 2 2 Blue table 3 2 Small cat 2
Does anyone have a suggestion of a terminal command that can produce my desired results?
Thank you in advance for any help.
解决方案You're close; you have all the numbers you need. The total for each row is the count of rows that you got from uniq (column 1) times the frequency count (column 4). You can calculate that with awk:
sort input.txt | uniq -c | awk ' { print $2 "\t" $3 "\t" $1*$4 } '
这篇关于命令行以并置文件中的频率求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!