AWK / bash的删除线,唯一的ID,并保持具有相同的ID下的列中的最大值/最小值行 [英] awk/bash remove lines with an unique id and keep the lines that has the max/min value in a column under the same ID
问题描述
如果我们有如下的输入,并想首先检测是否cpd_number($ 2)是在文件中独树一帜,删除整个行。在这种情况下,与线CPD-6666666应被删除。结果
其次,如果有多个行保持在相同的cpd_number,只打印出具有最大和最小log_ratio($ 17)的两行。
49,CPD-7788990,1212,2323,IC50 ,, 100,嗯,1334,1331,奇,, 10,嗯,, 10,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CPD-7788990,5555,6666,IC50,> 150,嗯,1334,1331,奇,, 10,嗯,> 15,-1.176091259,12 / 6/2006 0:00,2 / 16 / 2007 0:00,细胞,酶
49,CPD-7788990,8888,9999,IC50 ,, 200,嗯,1334,1331,奇,, 10,嗯,, 20,-1.301029996,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CPD-6666666,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1.602059991,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
理想的输出应为
<$p$p><$c$c>targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline49,CPD-7788990,1212,2323,IC50 ,, 100,嗯,1334,1331,奇,, 10,嗯,, 10,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CPD-7788990,8888,9999,IC50 ,, 200,嗯,1334,1331,奇,, 10,嗯,, 20,-1.301029996,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
我是想用awk计数功能,但它似乎没有工作得很好。可以在任何大师请给一些意见?谢谢!
虽然不像简明perl的答案,这里是一个可执行文件awk的:
#!的/ usr /斌/的awk -fBEGIN {FS =,}NR == 1 {打印;下一个}{
一[$ 2,$ 17 = $ 0个 H =高[$ 2]
高[$ 2] = $ 17 GT; H || ^ h ==? $ 17:^ h L =低[$ 2]
低[$ 2] = $ 17 LT; l ||升==? $ 17:升
}结束 {
对(我高){
如果(低[I]!=高[I]){
打印[我,高[I]
打印[我,低[I]
}
}
}
其中:
- 打印头行
- 店铺全行
A
和高
和低点
为每个键 - 在
END
,走高
阵列和版画系,其中高
和低
是由他们获取不同的A
If we have the following input and would like to firstly, detect if the cpd_number ($2) is unique in the file, remove the whole row. In this case, the line with "cpd-6666666" should be removed.
Secondly, if there are multiple lines kept under the same "cpd_number", only prints out the two lines which has the max and min "log_ratio" ($17).
targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,5555,6666, IC50 ,>,150,uM,1334,1331,Ki,,10,uM,>,15,-1.176091259,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-1.301029996,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.602059991,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
The ideal output should be
targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-1.301029996,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
I was trying to use awk count function but it doesn't seem to work quite well. Could any guru kindly give some comments? Thanks!
While not as concise as the perl answer, here's a executable awk file:
#!/usr/bin/awk -f
BEGIN { FS="," }
NR==1 {print; next}
{
a[$2,$17]=$0
h=high[$2]
high[$2]=$17>h || h=="" ? $17 : h
l=low[$2]
low[$2]=$17<l || l=="" ? $17 : l
}
END {
for(i in high) {
if(low[i]!=high[i]) {
print a[i,high[i]]
print a[i,low[i]]
}
}
}
which:
- Prints the header row
- Stores whole lines in
a
, andhighs
andlows
for each key - In the
END
, walks thehigh
array and prints lines where thehigh
andlow
are different by retrieving them froma
这篇关于AWK / bash的删除线,唯一的ID,并保持具有相同的ID下的列中的最大值/最小值行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!