根据第2列AWK/BASH中的最小值对两列文件进行重复数据删除 [英] Deduplicate two column file based on minimum value in column 2 AWK / BASH
问题描述
我手头有一个看起来像这样的文件(由制表符和2个字段分隔):
I have at hand a file looking like this (delimited by tabs, 2 fields):
denovo0 90.2
denovo1 97.7
denovo1 97.7
denovo1 96.9
denovo10 93.8
denovo10 92.2
denovo10 91.5
denovo100 95.3
denovo100 95.3
denovo100 94.6
我想只在第一个字段中保留唯一的字符串,这些字符串在第二列中具有最低的值:
And I would like to retain only unique strings in the first field that have the lowest value in the second column to have:
denovo0 90.2
denovo1 96.9
denovo10 91.5
denovo100 94.6
从上面的示例中可以看出,文件中的某些行可能是其他行的完全重复,我不确定这将如何影响解决方案.
As is can be seen in the upper example, some rows in the file may be complete duplicates of other rows, I am not sure how that would influence solutions.
我在StackOverflow上查找了类似的解决方案,例如:
I have looked up similar solutions on StackOverflow, e.g: Uniq in awk; removing duplicate values in a column using awk , but was not able to adopt them.
如果有人可以帮忙,我会很高兴.
I would be happy if someone could help.
我更喜欢使用AWK,但BASH也是一种选择.我正在使用MacOSX Yosemite.
I'd prefer using AWK but BASH would also be an option. I am working with MacOSX Yosemite.
如果有人可以帮忙,我会很高兴.
I would be really happy if someone could help out.
谢谢您的问候,
保罗
推荐答案
您可以使用以下方法获得显示的结果:
You can get the results you show with:
awk '{if (!($1 in a)) a[$1] = $2} END { for (key in a) print key, a[key] }'
输出:
denovo0 90.2
denovo1 97.7
denovo10 93.8
denovo100 95.3
对于所描述的结果(第1列中每个键的第2列中的最小值),您可以使用:
For the results described (minimum value in column 2 for each key in column 1), you can use:
awk '{ if (!($1 in a)) a[$1] = $2; else if (a[$1] > $2) a[$1] = $2 }
END { for (key in a) print key, a[key] }'
输出:
denovo0 90.2
denovo1 96.9
denovo10 91.5
denovo100 94.6
您还可以通过查找每个键的最大值来获得问题中的样本输出;碰巧的是,最大值也是样本数据中每个键的第一个.
You can also get the sample output in the question by looking for the maximum value for each key; it so happens that the maximum value is also the first for each key in the sample data.
awk '{ if (!($1 in a)) a[$1] = $2; else if (a[$1] < $2) a[$1] = $2 }
END { for (key in a) print key, a[key] }'
输出:
denovo0 90.2
denovo1 97.7
denovo10 93.8
denovo100 95.3
这篇关于根据第2列AWK/BASH中的最小值对两列文件进行重复数据删除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!