AWK / bash的删除线,唯一的ID,并保持具有相同的ID下的列中的最大值/最小值行 [英] awk/bash remove lines with an unique id and keep the lines that has the max/min value in a column under the same ID

查看:99
本文介绍了AWK / bash的删除线,唯一的ID,并保持具有相同的ID下的列中的最大值/最小值行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我们有如下的输入,并想首先检测是否cpd_number($ 2)是在文件中独树一帜,删除整个行。在这种情况下,与线CPD-6666666应被删除。结果
其次,如果有多个行保持在相同的cpd_number,只打印出具有最大和最小log_ratio($ 17)的两行。

<$p$p><$c$c>targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,CP​​D-7788990,1212,2323,IC50 ,, 100,嗯,1334,1331,奇,, 10,嗯,, 10,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-7788990,5555,6666,IC50,&GT; 150,嗯,1334,1331,奇,, 10,嗯,&GT; 15,-1.176091259,12 / 6/2006 0:00,2 / 16 / 2007 0:00,细胞,酶
49,CP​​D-7788990,8888,9999,IC50 ,, 200,嗯,1334,1331,奇,, 10,嗯,, 20,-1.301029996,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-6666666,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1.602059991,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶

理想的输出应为

<$p$p><$c$c>targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,CP​​D-7788990,1212,2323,IC50 ,, 100,嗯,1334,1331,奇,, 10,嗯,, 10,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-7788990,8888,9999,IC50 ,, 200,嗯,1334,1331,奇,, 10,嗯,, 20,-1.301029996,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶

我是想用awk计数功能,但它似乎没有工作得很好。可以在任何大师请给一些意见?谢谢!


解决方案

虽然不像简明perl的答案,这里是一个可执行文件awk的:

 #!的/ usr /斌/的awk -fBEGIN {FS =,}NR == 1 {打印;下一个}{
  一[$ 2,$ 17 = $ 0个  H =高[$ 2]
  高[$ 2] = $ 17 GT; H || ^ h ==? $ 17:^ h  L =低[$ 2]
  低[$ 2] = $ 17 LT; l ||升==? $ 17:升
}结束 {
  对(我高){
    如果(低[I]!=高[I]){
      打印[我,高[I]
      打印[我,低[I]
    }
  }
}

其中:


  • 打印头行

  • 店铺全行 A 低点为每个键

  • END ,走阵列和版画系,其中是由他们获取不同的 A

If we have the following input and would like to firstly, detect if the cpd_number ($2) is unique in the file, remove the whole row. In this case, the line with "cpd-6666666" should be removed.
Secondly, if there are multiple lines kept under the same "cpd_number", only prints out the two lines which has the max and min "log_ratio" ($17).

targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,5555,6666, IC50 ,>,150,uM,1334,1331,Ki,,10,uM,>,15,-1.176091259,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-1.301029996,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.602059991,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

The ideal output should be

targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-1.301029996,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

I was trying to use awk count function but it doesn't seem to work quite well. Could any guru kindly give some comments? Thanks!

解决方案

While not as concise as the perl answer, here's a executable awk file:

#!/usr/bin/awk -f

BEGIN { FS="," }

NR==1 {print; next}

{
  a[$2,$17]=$0

  h=high[$2]
  high[$2]=$17>h || h=="" ? $17 : h

  l=low[$2]
  low[$2]=$17<l || l=="" ? $17 : l
}

END {
  for(i in high) {
    if(low[i]!=high[i]) {
      print a[i,high[i]]
      print a[i,low[i]]
    }
  }
}

which:

  • Prints the header row
  • Stores whole lines in a, and highs and lows for each key
  • In the END, walks the high array and prints lines where the high and low are different by retrieving them from a

这篇关于AWK / bash的删除线,唯一的ID,并保持具有相同的ID下的列中的最大值/最小值行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆