根据第2列AWK/BASH中的最小值对两列文件进行重复数据删除 [英] Deduplicate two column file based on minimum value in column 2 AWK / BASH

查看:98
本文介绍了根据第2列AWK/BASH中的最小值对两列文件进行重复数据删除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我手头有一个看起来像这样的文件(由制表符和2个字段分隔):

I have at hand a file looking like this (delimited by tabs, 2 fields):

    denovo0  90.2
    denovo1  97.7
    denovo1  97.7
    denovo1  96.9
    denovo10     93.8
    denovo10     92.2
    denovo10     91.5
    denovo100    95.3
    denovo100    95.3
    denovo100    94.6

我想只在第一个字段中保留唯一的字符串,这些字符串在第二列中具有最低的值:

And I would like to retain only unique strings in the first field that have the lowest value in the second column to have:

    denovo0  90.2
    denovo1  96.9
    denovo10     91.5
    denovo100    94.6

从上面的示例中可以看出,文件中的某些行可能是其他行的完全重复,我不确定这将如何影响解决方案.

As is can be seen in the upper example, some rows in the file may be complete duplicates of other rows, I am not sure how that would influence solutions.

我在StackOverflow上查找了类似的解决方案,例如:

I have looked up similar solutions on StackOverflow, e.g: Uniq in awk; removing duplicate values in a column using awk , but was not able to adopt them.

如果有人可以帮忙,我会很高兴.

I would be happy if someone could help.

我更喜欢使用AWK,但BASH也是一种选择.我正在使用MacOSX Yosemite.

I'd prefer using AWK but BASH would also be an option. I am working with MacOSX Yosemite.

如果有人可以帮忙,我会很高兴.

I would be really happy if someone could help out.

谢谢您的问候,

保罗

推荐答案

您可以使用以下方法获得显示的结果:

You can get the results you show with:

awk '{if (!($1 in a)) a[$1] = $2} END { for (key in a) print key, a[key] }'

输出:

denovo0 90.2
denovo1 97.7
denovo10 93.8
denovo100 95.3

对于所描述的结果(第1列中每个键的第2列中的最小值),您可以使用:

For the results described (minimum value in column 2 for each key in column 1), you can use:

awk '{ if (!($1 in a)) a[$1] = $2; else if (a[$1] > $2) a[$1] = $2 }
     END { for (key in a) print key, a[key] }'

输出:

denovo0 90.2
denovo1 96.9
denovo10 91.5
denovo100 94.6

您还可以通过查找每个键的最大值来获得问题中的样本输出;碰巧的是,最大值也是样本数据中每个键的第一个.

You can also get the sample output in the question by looking for the maximum value for each key; it so happens that the maximum value is also the first for each key in the sample data.

awk '{ if (!($1 in a)) a[$1] = $2; else if (a[$1] < $2) a[$1] = $2 }
     END { for (key in a) print key, a[key] }'

输出:

denovo0 90.2
denovo1 97.7
denovo10 93.8
denovo100 95.3

这篇关于根据第2列AWK/BASH中的最小值对两列文件进行重复数据删除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆