根据每行列中的最大值过滤文件 [英] Filtering file according to the highest value in a column of each line
问题描述
我有以下文件:
gene.100079.0.5.p3 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 84.9
gene.100079.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 84.9
gene.100079.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 86.7
gene.100080.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
gene.100080.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
chr11_pilon3.g3568.t1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 74.9
chr11_pilon3.g3568.t2 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 76.7
上面的文件具有一些相似的ID
The above file has some IDs which are similar
gene.100079.0.5.p3
gene.100079.0.3.p1
gene.100079.0.0.p1
通过仅保留gene.100079
,ID变得相同.我想通过以下方式过滤上面的文件:
By remaining only gene.100079
the IDs become identically. I would like to filter the above file in the following way:
-
chr11_pilon3.g3568.t1 = 74.9
&&chr11_pilon3.g3568.t2 = 76.7
.chr11_pilon3.g3568.t2
的值最高,因此应该在输出中. -
gene.100079.0.0.p1 = 86.7
&&gene.100079.0.5.p3 = 84.9
==gene.100079.0.3.p1 = 84.9
.gene.100079.0.0.p1
的值最高,因此应该在输出中. -
gene.100080.0.3.p1 = 99.9
==gene.100080.0.0.p1 = 99.9
.这两个ID具有相同的值,因此都应在输出中.
chr11_pilon3.g3568.t1 = 74.9
&&chr11_pilon3.g3568.t2 = 76.7
.chr11_pilon3.g3568.t2
has the highest value and therefore it should be in the output.gene.100079.0.0.p1 = 86.7
&&gene.100079.0.5.p3 = 84.9
==gene.100079.0.3.p1 = 84.9
.gene.100079.0.0.p1
has the highest value and therefore it should be in the output.gene.100080.0.3.p1 = 99.9
==gene.100080.0.0.p1 = 99.9
. Both IDs have the same value and therefore both should be in the output.
但是,来自@ RavinderSingh13和@anubhava的awk脚本返回了错误的结果.
However, this awk script from @RavinderSingh13 and @anubhava returns the wrong results.
awk '{
if (/^gene\./) {
split($1, a, /\./)
k = a[1] "." a[2]
}
else
k = $1
}
!(k in max) || $13 >= max[k] {
if(!(k in max))
ord[++n] = k
else if (max[k] == $13) {
print
next
}
max[k] = $13
rec[k] = $0
}
END {
for (i=1; i<=n; ++i)
print rec[ord[i]]
}' file
上面的脚本输出错误:
chr11_pilon3.g3568.t1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 74.9
chr11_pilon3.g3568.t2 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 76.7
gene.100079.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 86.7
gene.100079.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 84.9
gene.100080.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
gene.100080.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
作为输出,我想得到:
chr11_pilon3.g3568.t2 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 76.7
gene.100079.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 86.7
gene.100080.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
gene.100080.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
我还尝试修复如下所示的问题,但没有成功:
I also tried to fix as show below but it didn't work:
awk '{
if (/^gene\./) {
split($1, a, /\./)
k = a[1] "." a[2]
}
else
k = $1
}
!(k in max) || $13 > max[k] {
max[k]=$13;
line[k]=$0
}
END {
for(i in line)
print line[i]
}'
先谢谢您
推荐答案
这似乎是正确的,假设数据是有序的,那么具有相同的前两个名称组成部分的所有行都将在数据文件中分组在一起.这些行在组中的顺序无关紧要.
This seems to work correctly, assuming that the data is ordered so that all the lines with the same first two name components are grouped together in the data file. The order of those lines within the group doesn't matter.
#!/bin/sh
awk '
function dump_memo()
{
if (memo_num > 0)
{
for (i = 0; i < memo_num; i++)
print memo_line[i]
}
}
{
split($1, a, ".")
key = a[1] "." a[2]
val = $NF
# print "# " key " = " val " (memo_key = " memo_key ", memo_val = " memo_val ")"
if (memo_key == key)
{
if (memo_val == val)
{
memo_line[memo_num++] = $0
}
else if (memo_val < val)
{
memo_val = val
memo_num = 0
memo_line[memo_num++] = $0
}
}
else
{
dump_memo()
memo_num = 0
memo_line[memo_num++] = $0
memo_key = key
memo_val = val
}
}
END { dump_memo() }' "$@"
在问题中显示的数据文件上运行时,输出为:
When run on the data file shown in the question, the output is:
gene.100079.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 86.7
gene.100080.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
gene.100080.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
chr11_pilon3.g3568.t2 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 76.7
此内容与您要求的内容之间的主要区别是排序顺序.如果需要按排序的数据,请通过sort
用管道传输脚本的输出.
The main difference between this and what you request is the sort order. If you need the data in sorted order, pipe the output of the script through sort
.
这篇关于根据每行列中的最大值过滤文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!