根据每行列中的最大值过滤文件 [英] Filtering file according to the highest value in a column of each line

查看:66
本文介绍了根据每行列中的最大值过滤文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下文件:

gene.100079.0.5.p3  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   84.9
gene.100079.0.3.p1  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   84.9
gene.100079.0.0.p1  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   86.7
gene.100080.0.3.p1  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   99.9
gene.100080.0.0.p1  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   99.9
chr11_pilon3.g3568.t1   transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   74.9
chr11_pilon3.g3568.t2   transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   76.7

上面的文件具有一些相似的ID

The above file has some IDs which are similar

gene.100079.0.5.p3
gene.100079.0.3.p1
gene.100079.0.0.p1

通过仅保留gene.100079,ID变得相同.我想通过以下方式过滤上面的文件:

By remaining only gene.100079 the IDs become identically. I would like to filter the above file in the following way:

  • chr11_pilon3.g3568.t1 = 74.9&& chr11_pilon3.g3568.t2 = 76.7. chr11_pilon3.g3568.t2的值最高,因此应该在输出中.
  • gene.100079.0.0.p1 = 86.7&& gene.100079.0.5.p3 = 84.9 == gene.100079.0.3.p1 = 84.9. gene.100079.0.0.p1的值最高,因此应该在输出中.
  • gene.100080.0.3.p1 = 99.9 == gene.100080.0.0.p1 = 99.9.这两个ID具有相同的值,因此都应在输出中.
  • chr11_pilon3.g3568.t1 = 74.9 && chr11_pilon3.g3568.t2 = 76.7. chr11_pilon3.g3568.t2 has the highest value and therefore it should be in the output.
  • gene.100079.0.0.p1 = 86.7 && gene.100079.0.5.p3 = 84.9 == gene.100079.0.3.p1 = 84.9. gene.100079.0.0.p1 has the highest value and therefore it should be in the output.
  • gene.100080.0.3.p1 = 99.9 == gene.100080.0.0.p1 = 99.9. Both IDs have the same value and therefore both should be in the output.

但是,来自@ RavinderSingh13和@anubhava的awk脚本返回了错误的结果.

However, this awk script from @RavinderSingh13 and @anubhava returns the wrong results.

awk '{
   if (/^gene\./) {
      split($1, a, /\./)
      k = a[1] "." a[2]
    }
    else
       k = $1
}
!(k in max) || $13 >= max[k] {
   if(!(k in max))
      ord[++n] = k
   else if (max[k] == $13) {
      print
      next
   }
   max[k] = $13
   rec[k] = $0
}
END {
   for (i=1; i<=n; ++i)
      print rec[ord[i]]
}' file

上面的脚本输出错误:

chr11_pilon3.g3568.t1   transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   74.9
chr11_pilon3.g3568.t2   transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   76.7
gene.100079.0.0.p1  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   86.7
gene.100079.0.3.p1  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   84.9
gene.100080.0.0.p1  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   99.9
gene.100080.0.3.p1  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   99.9

作为输出,我想得到:

chr11_pilon3.g3568.t2   transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   76.7
gene.100079.0.0.p1  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   86.7
gene.100080.0.3.p1  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   99.9
gene.100080.0.0.p1  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   99.9

我还尝试修复如下所示的问题,但没有成功:

I also tried to fix as show below but it didn't work:

awk '{
   if (/^gene\./) {
      split($1, a, /\./)
      k = a[1] "." a[2]
    }
    else
       k = $1
}
!(k in max) || $13 > max[k] {
   max[k]=$13; 
   line[k]=$0
}
END {
   for(i in line) 
      print line[i]
}'

先谢谢您

推荐答案

这似乎是正确的,假设数据是有序的,那么具有相同的前两个名称组成部分的所有行都将在数据文件中分组在一起.这些行在组中的顺序无关紧要.

This seems to work correctly, assuming that the data is ordered so that all the lines with the same first two name components are grouped together in the data file. The order of those lines within the group doesn't matter.

#!/bin/sh

awk '
    function dump_memo()
    {
        if (memo_num > 0)
        {
            for (i = 0; i < memo_num; i++)
                print memo_line[i]
        }
    }
    {
        split($1, a, ".")
        key = a[1] "." a[2]
        val = $NF
        # print "# " key " = " val " (memo_key = " memo_key ", memo_val = " memo_val ")"
        if (memo_key == key)
        {
            if (memo_val == val)
            {
                memo_line[memo_num++] = $0
            }
            else if (memo_val < val)
            {
                memo_val = val
                memo_num = 0
                memo_line[memo_num++] = $0
            }
        }
        else
        {
            dump_memo()
            memo_num = 0
            memo_line[memo_num++] = $0
            memo_key = key
            memo_val = val
        }
    }
    END { dump_memo() }' "$@"

在问题中显示的数据文件上运行时,输出为:

When run on the data file shown in the question, the output is:

gene.100079.0.0.p1  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   86.7
gene.100080.0.3.p1  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   99.9
gene.100080.0.0.p1  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   99.9
chr11_pilon3.g3568.t2   transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   76.7

此内容与您要求的内容之间的主要区别是排序顺序.如果需要按排序的数据,请通过sort用管道传输脚本的输出.

The main difference between this and what you request is the sort order. If you need the data in sorted order, pipe the output of the script through sort.

这篇关于根据每行列中的最大值过滤文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆