如果的awk&QUOT单独行; $ 2相同和最大值和最小值< = 1"和" $ 2相同和最大值和最小值< 1" [英] awk separate rows if "$2 are the same and max and min value <= 1" and "$2 are the same and max and min value < 1"

查看:110
本文介绍了如果的awk&QUOT单独行; $ 2相同和最大值和最小值< = 1"和" $ 2相同和最大值和最小值< 1"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我们有一个输入文件:input.csv

<$p$p><$c$c>cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,CP​​D-7788990,1212,2323,IC50 ,, 100,嗯,1334,1331,奇,, 10,嗯,, 10,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-7788990,5555,6666,IC50,&GT; 150,嗯,1334,1331,奇,, 10,嗯,&GT; 15,-2,12 / 6/2006 0:00,2 / 16 / 2007 0:00,细胞,酶
49,CP​​D-7788990,8888,9999,IC50 ,, 200,嗯,1334,1331,奇,, 10,嗯,, 20,-3,12- / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-6666666,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1.1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1.2,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1.3,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶

我们希望这个input.csv分成2个文件,这样我们可以做如下的步骤:做平均的行如果$ 2是相同的,其中最大负MIN,以$ 17 LT = 1

如果$ 2相同,最大负MIN,以$ 17 LT = 1,把它放到1文件


  • 请注意:如果有一个独特$ 2本身,我们想保持它在这里(CPD-6666666为例)


  • 请注意:CPD-1111($最大为17分钟)= -1 - ( - 1.3)= 0.3&LT; 1


outputfile1.csv

<$p$p><$c$c>cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,CP​​D-6666666,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1.1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1.2,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1.3,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶

如果$ 2相同,最大负MIN,以$ 17日> 1,把它放到另一个文件

outfile2.csv(其中,max&放大器;最低的$ 17 = -1 - ( - 3)= 2> 1)

<$p$p><$c$c>cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,CP​​D-7788990,1212,2323,IC50 ,, 100,嗯,1334,1331,奇,, 10,嗯,, 10,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-7788990,5555,6666,IC50,&GT; 150,嗯,1334,1331,奇,, 10,嗯,&GT; 15,-2,12 / 6/2006 0:00,2 / 16 / 2007 0:00,细胞,酶
49,CP​​D-7788990,8888,9999,IC50 ,, 200,嗯,1334,1331,奇,, 10,嗯,, 20,-3,12- / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶

下面是尝试从以下链接修改

<一个href=\"http://stackoverflow.com/questions/24543486/awk-bash-remove-lines-with-an-unique-id-and-keep-the-lines-that-has-the-max-min\">awk/bash删除线,唯一的ID,并保持具有相同的ID 下的列中的最大值/最小值行

 #!的/ usr /斌/的awk -fBEGIN {FS =,}NR == 1 {打印;下一个}{
  一[$ 2,$ 17 = $ 0个  H =高[$ 2]
  高[$ 2] = $ 17 GT; H || ^ h ==? $ 17:^ h  M =中旬[$ 2]
  中旬[$ 2] = L&LT; $ 17 LT; H ||米==? $ 17:米  L =低[$ 2]
  低[$ 2] = $ 17 LT; l ||升==? $ 17:升
}结束 {
  对(我高){
    如果(高[I] - 低[1] - = 1){
      打印[我,高[I]
      打印[我,中期[I]
      打印[我,低[I]
    }
  }
}

输出:

<$p$p><$c$c>cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,CP​​D-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1.3,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-6666666,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-6666666,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CP​​D-6666666,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶

不知什么原因,这个脚本不能打印出中间范围值(S)正确。我想知道如果有大师有意见/解决方案?


解决方案

看看这个,这是每个组处理,因为它的ID改变的例子:

 #!的/ usr /斌/的awk -fBEGIN {FS =,; F1 =一个; F2 =B}FNR == 1 {打印$ 0 GT; F1;打印$ 0 GT; F2;下一个 }$ 2 = last_id和放大器;!&安培; FNR&GT; 2 {handleBlock()}{A [++ CNT] = $ 0; M [CNT] = $ 17; last_id = $ 2}END {handleBlock()}功能handleBlock(){
  如果(M [1] -m [CNT]下; = 1)FNAME = F1
  否则FNAME = F2
  为(ⅰ= 1; I&下; = CNT;我++){打印[Ⅰ]≥ FNAME}
  CNT = 0
}

这是一个可执行文件的awk。当把它放到一个名为 awko 文件和搭配chmod + X awko 它可以像运行awko数据名为输入文件的数据。

我写的其他问题的脚本是基于我假设该文件内容的输入顺序是未知的 - 其中 $ 2 字段可以在任何顺序和只有最小值和最大值重要的。在这个问题上,OP想发送基于最小/最大值相关的 $ 2 字段一个文件或其他所有行。

有关这个问题的输入文件具有以下属性此脚本依赖于:


  • 标题是在第一行

  • $ 2 字段分组

  • 的最大值是该组的第一个元素

  • 的最小值是该集团
  • 的最后一个值

哪里有这是由资源ID排序的资源列表,用于最低限度装载数据的一个共同的算法是只加载的资源ID更改时。同样可以在这里处理分组的条目来完成。就拿像一个例子:

  A
一个
一个
B&LT; - 这是处理所有前一项的好地方
b
C&LT; - 过程B这里项
C
EOF&所述; - 文件的末尾。处理最后一组(这里的C的条目)

考虑到这一点,这里有一个休息的剧本下来:


  • FS 部分输出文件名BEGIN 块(A和B为我的测试)

  • 第一行是标题 - 把它的每个文件, F1 F2

  • 如果 $ 2!= last_id ,叫 handleBlock()函数来处理它。

  • 存储整个线阵 A 在数组$ 17日 M 并设置 last_id = $ 2 (数组名称是可怕)。

  • CNT 变量表示有多少项是每个组中(我称之为块)

  • handleBlock()只会被调用时, $ 2 ID改变或在文件的结尾赶上在结束的最后一组块。

  • handleBlock()使用 M (max是 M [1] 和最小为m [CNT])确定输出文件名,然后打印从所有元素了`所选择的文件名。

If we have an input file: input.csv

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,5555,6666, IC50 ,>,150,uM,1334,1331,Ki,,10,uM,>,15,-2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

We would like to separate this input.csv into 2 files so that we can do the following step: "do averages on rows if $2 are the same where the max minus min in $17 <= 1"

"If the $2 is the same and the max minus min in $17 <= 1 ", put it into 1 file

  • Note: If there is an unique $2 itself, we would like to keep it here (cpd-6666666 as an example)

  • Note: cpd-1111 ($17 max-min) = -1-(-1.3)=0.3 < 1

outputfile1.csv

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

"If the $2 is the same and the max minus min in $17 > 1", put it into another file

outfile2.csv (where max & min in $17=-1-(-3)=2 > 1)

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,5555,6666, IC50 ,>,150,uM,1334,1331,Ki,,10,uM,>,15,-2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

Here is the attempt modifying from the following link

awk/bash remove lines with an unique id and keep the lines that has the max/min value in a column under the same ID

#!/usr/bin/awk -f

BEGIN { FS="," }

NR==1 {print; next}

{
  a[$2,$17]=$0

  h=high[$2]
  high[$2]=$17>h || h=="" ? $17 : h

  m=mid[$2]
  mid[$2]=l<$17<h || m=="" ? $17 : m

  l=low[$2]
  low[$2]=$17<l || l=="" ? $17 : l
}

END {
  for(i in high) {
    if(high[i]-low[i]<=1) {
      print a[i,high[i]]
      print a[[i,mid[i]]
      print a[i,low[i]]
    }
  }
}

Output:

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

For unknown reasons, this script cannot print out the middle range value(s) correctly. May I know if any guru has comments/solutions?

解决方案

Take a look at this, which is an example of processing each group as it's id changes:

#!/usr/bin/awk -f

BEGIN {FS=","; f1="a"; f2="b"}

FNR==1 { print $0 > f1; print $0 > f2; next }

$2!=last_id && FNR > 2 { handleBlock() }

{ a[++cnt]=$0; m[cnt]=$17; last_id=$2 }

END { handleBlock() }

function handleBlock() {
  if( m[1]-m[cnt]<=1 ) fname = f1
  else fname = f2
  for( i=1;i<=cnt;i++ ) { print a[i] > fname }
  cnt=0
}

It's an executable awk file. When put it into a file called awko and chmod +x awko it can be run like awko data for an input file called "data".

The script I wrote for the other question was based on me assuming the the input order of the file elements were unknown - where the $2 fields could be in any order and that only the min and max values mattered. In this question, the OP would like to send all rows related to the $2 field to one file or another based on the min/max values.

The input file for this question has the following properties which this script is dependent on:

  • The header is on the first line
  • The $2 fields are grouped
  • The max value is the first element of the group
  • The min value is the last value of the group

Where there's a resource list that's sorted by the resource id, one common algorithm for minimally loading the data is to only load it when the resource id changes. The same can be done for processing grouped entries here. Take an example like:

a
a
a
b <- this is a good place to process all the prior "a" entries
b
c <- process "b" entries here
c
EOF <- the end of the file.  process the last group ( the "c" entries here )

With that in mind, here's a break down of the script:

  • Set the FS and some output file names in BEGIN block ( "a" and "b" for my testing )
  • The first line is the header - put it in each file, f1 and f2.
  • If $2 != last_id, call the handleBlock() function to process it.
  • Store the whole line in array a, $17 in array m and set last_id=$2 ( the array names are horrible ).
  • The cnt variable indicates how many entries are in each group ( what I called a block )
  • handleBlock() will only get called when the $2 id changes or at the end of the file to catch the last group in the END block.
  • handleBlock() tests the OP's condition usingm( max ism[1]and min is m[cnt] ) to determine the output file name and then prints all elements froma` to the chosen filename.

这篇关于如果的awk&QUOT单独行; $ 2相同和最大值和最小值&LT; = 1&QUOT;和&QUOT; $ 2相同和最大值和最小值&LT; 1&QUOT;的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆