awk的基础上$ 2和$ 17个独立的行和做平均的$ 17日 [英] awk separate rows based on $2 and $17 and do average on $17

查看:155
本文介绍了awk的基础上$ 2和$ 17个独立的行和做平均的$ 17日的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在这里有一个输入:

<$p$p><$c$c>cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,CP​​D-7788990,1212,2323,IC50 ,, 100,嗯,1334,1331,奇,, 10,嗯,, 10,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-7788990,5555,6666,IC50,&GT; 150,嗯,1334,1331,奇,, 10,嗯,&GT; 15,-2,12 / 6/2006 0:00,2 / 16 / 2007 0:00,细胞,酶
49,CP​​D-7788990,8888,9999,IC50 ,, 200,嗯,1334,1331,奇,, 10,嗯,, 20,-3,12- / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-6666666,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1.1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1.2,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1.3,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶

我们希望这个input.csv分成2档

如果该$ 2是同一和最大减闵在$ 17所述; = 1,平均$ 17和把它分为文件中的

如果$ 2相同,最大负MIN,以$ 17日> 1,平均$ 17和把它放到文件B。

请注意:如果有一个独特$ 2本身,我们想保持它在这里(CPD-6666666为例)

请注意:CPD-1111($最大为17分钟)= -1 - ( - 1.3)= 0.3&LT; 1

a:其中($最大为17分钟)LT = 1。新的$ 17 CPD-1111($ 2)的平均值(-1,-1.1,-1.2,-1.3)= -1.15

<$p$p><$c$c>cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,CP​​D-6666666,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1.15,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶

B:在哪儿($最大为17分钟)> 1。新的$ 17 CPD-7788990($ 2)是平均(-1,-2,-3)= -2

<$p$p><$c$c>cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,CP​​D-7788990,1212,2323,IC50 ,, 100,嗯,1334,1331,奇,, 10,嗯,, 10,-2,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶

下面是可以分开输入a和b,但还没有做平均尚未尝试。

 #!的/ usr /斌/的awk -fBEGIN {FS =,; F1 =一个; F2 =B}FNR == 1 {打印$ 0 GT; F1;打印$ 0 GT; F2;下一个 }$ 2 = last_id和放大器;!&安培; FNR&GT; 2 {handleBlock()}{A [++ CNT] = $ 0; M [CNT] = $ 17; last_id = $ 2}END {handleBlock()}功能handleBlock(){如果(M [1] -m [CNT]下; = 1)FNAME = F1否则FNAME = F2为(ⅰ= 1; I&下; = CNT;我++){打印[Ⅰ]≥ FNAME}CNT = 0
}

我想知道是否有反正做平均a和b?谢谢你。


解决方案

您可以通过改变 handleBlock()如下得到的输出文件的平均值>

 函数handleBlock(){
  如果(M [1] -m [CNT]下; = 1)FNAME = F1
  否则FNAME = F2
    #计算$ 17个领域的总和为组
  对于(i = 1; I&LT; = CNT;我++){总和+ = M [I]}
    #计算平均
  平均= CNT&GT; 0?总和/ CNT:总和
    #用于输出最大线,分割成一个输出数组:oarr
  FCNT =分(一[1],oarr)
    #修改输出数组的第17场
  oarr [17] =平均
    #编写更新阵列所需的文件,一个字段在一个时间
  对于(i = 1; I&LT; = FCNT;我++){
    的printf(%s%S,oarr [I],我== FCNT\\ n:FS?)GT; FNAME
  }
  CNT = 0;总和= 0
}

对原始脚本注释。

We have an input here:

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,5555,6666, IC50 ,>,150,uM,1334,1331,Ki,,10,uM,>,15,-2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

We would like to separate this input.csv into 2 files

If the $2 is the same and the max minus min in $17 <= 1 ", average $17 and put it into "file a".

If the $2 is the same and the max minus min in $17 > 1 ", average $17 and put it into "file b".

Note: If there is an unique $2 itself, we would like to keep it here (cpd-6666666 as an example)

Note: cpd-1111 ($17 max-min) = -1-(-1.3)=0.3 < 1

a: where ($17 max-min)<=1 . The new $17 in cpd-1111($2) is the average of (-1,-1.1,-1.2,-1.3) = -1.15

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.15,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

b:where ($17 max-min)>1 . The new $17 in cpd-7788990($2) is the average of (-1,-2,-3) = -2

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

Here is the attempt which could separate input into a and b but haven't done average yet.

#!/usr/bin/awk -f

BEGIN {FS=","; f1="a"; f2="b"}

FNR==1 { print $0 > f1; print $0 > f2; next }

$2!=last_id && FNR > 2 { handleBlock() }

{ a[++cnt]=$0; m[cnt]=$17; last_id=$2 }

END { handleBlock() }

function handleBlock() {

if( m[1]-m[cnt]<=1 ) fname = f1

else fname = f2

for( i=1;i<=cnt;i++ ) { print a[i] > fname }  

cnt=0
}

May I know if there is anyway to do the average in a and b? Thanks.

解决方案

You can get the averages in the output files by altering handleBlock() as follows:

function handleBlock() {
  if( m[1]-m[cnt]<=1 ) fname = f1
  else fname = f2
    # compute the sum of the $17 fields for the group
  for( i=1;i<=cnt;i++ ) { sum+=m[i] }
    # compute the average
  avg = cnt > 0 ? sum/cnt : sum
    # use the max line for the output, split into an output array: oarr
  fcnt = split( a[1], oarr )
    # modify the 17th field of the output array
  oarr[17]=avg
    # write the updated array to the desired file one field at a time
  for( i=1;i<=fcnt;i++ ) {
    printf( "%s%s", oarr[i], i==fcnt ? "\n" : FS ) > fname
  }
  cnt=0; sum=0
}

Check here for comments on the original script.

这篇关于awk的基础上$ 2和$ 17个独立的行和做平均的$ 17日的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆