AWK:基于两个列信息过滤数据 [英] AWK: filtering of the data based on TWO column information

查看:62
本文介绍了AWK:基于两个列信息过滤数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理以多列格式排列的多列CSV的后处理:

I am working on post-processing of multi-column CSV arranged in multi-column format:

ID, POP, dG
1, 10, -5.6200
2, 4, -5.4900
3, 1, -5.3000
4, 4, -5.1600
5, 4, -4.8800
6, 3, -4.7600
7, 2, -4.4900
8, 5, -4.4500
9, 2, -4.4400
10, 8, -4.1400
11, 1, -4.1200
12, 2, -4.0900
13, 5, -4.0100
14, 1, -3.9500
15, 3, -3.9200
16, 10, -3.8800
17, 1, -3.8700
18, 3, -3.8300
19, 1, -3.8200
20, 3, -3.8000

以前,我曾使用以下AWK解决方案两次处理inout日志,检测pop(MAX)并保存与$ 2匹配的linnes.(.8 *最大值)':

Previously I have used the following AWK sollution to process the inout log two times, detect pop(MAX) and save linnes which matched $2 > (.8 * max)':

awk -F ', ' 'NR == 1 {next} FNR==NR {if (max < $2) {max=$2; n=FNR+1} next} FNR <= 2 || (FNR == n && $2 > (.4*max)) || $2 > (.8 * max)' input.csv{,} > output.csv

这可以减少输入日志,从而仅保留两条POP最高的直线:

that could reduce the input log keeping just two linnes with highest POP:

ID, POP, dG
1, 10, -5.6200
16, 10, -3.8800

现在,我需要同时考虑第二列(POP)和第三列(dG)来更改搜索算法:i)始终以第一行作为参考,在第三列(dG)中始终具有最大负数);ii)在第二列中找到编号最大的行pop(MAX);iii)采取(i)和(ii)之间的所有直线,这些直线将与适用于BOTH列的以下规则匹配:a)行在第3列中应具有(负)数字,并符合以下规则:$ 1>(.5 * $ 1(min))',其中$ 1(min)是第一行的数字(dG)(总是最负数)b)另外,行应与阈值降低的第二列的旧规则匹配:$ 2 =或>(.5 * max)',其中max是pop(MAX)

Now I need to change the search algorithm taking into account the both 2nd (POP) and 3rd(dG) columns: i) always taking the first line as the reference, which always has most negative number in the 3rd column (dG); ii) finding the line which has biggest number in the second column, pop(MAX); iii) taking all linnes between (i) and (ii) that will match the following rule applied for the BOTH columns: a) line should have (negative) number in 3rd column, matching following the rule: $1 > (.5 * $1(min))', where $1(min) is the number (dG) of the first line (always most negative) b) additionally line should match the old rule for the second column with decreased threshold : $2 = or > (.5 * max)', where max is the pop(MAX)

所以预期的输出应该是

ID, POP, dG
1, 10, -5.6200.  # this is the first line with most negative dG
8, 5, -4.4500   # this has POP (5) and dG (-4.4500) matching the both rules
10, 8, -4.1400. # this has POP (8) and dG (-4.1400) matching the both rules    
16, 10, -3.8800  # this is pop max, with higher POP

添加8-04:

如果第一行的POP较低(与规则$ 2不匹配,则> =(.5 * maxPop)

ADDED 8-04:

For the case if the first line has with very low POP (which does not match the rule $2 >= (.5 * maxPop)

ID, POP, dG
1, 5, -5.5600
2, 7, -5.3300
3, 7, -5.1900
4, 1, -4.6800
5, 1, -4.5800
6, 5, -4.5600
7, 3, -4.4700
8, 4, -4.4300
9, 9, -4.4200
10, 4, -4.4200
11, 2, -4.3800
12, 4, -4.3400
13, 25, -4.3000
14, 6, -4.2900
15, 8, -4.2600
16, 3, -4.2300
17, 1, -4.1800
18, 3, -4.1300
19, 1, -4.1300
20, 1, -4.1200
21, 27, -4.0800
22, 2, -4.0300

在仍然将dG列中的值用作第二个条件的参考时,输出也不应包含第一行($ 3< =(.5 * minD),该值应用于选择其他行中的值.输出:

the output should not contain the first line either while still using its value from dG column as the reference for the second condition ($3 <= (.5 * minD), which should be applied for the selection of other linnes in the output:

13, 25, -4.3000
21, 27, -4.0800

推荐答案

您可以使用以下 awk 解决方案:

You may use this awk solution:

awk -F ', ' 'NR == 1 {next} FNR==NR {if (maxP < $2) maxP=$2; if (minD=="" || minD > $3) minD=$3; next} FNR <= 2 || ($2 >= (.5 * maxP) && $3 <= (.5 * minD))' file{,}

ID, POP, dG
1, 10, -5.6200
8, 5, -4.4500
10, 8, -4.1400
13, 5, -4.0100
16, 10, -3.8800

使其更具可读性:

awk -F ', ' '
NR == 1 {next}                   # skip 1st record 1st time
FNR == NR {
   if (maxP < $2)                # compute max(POP)
      maxP = $2
   if (minD == "" || minD > $3)  # compute min(dG)
      minD = $3
   next
}
# print if 1st 2 lines OR "$2 >= .5 * max(POP) && $3 <= .5 * min(dG)"
FNR <= 2 || ($2 >= (.5 * maxP) && $3 <= (.5 * minD))
' file{,}

这篇关于AWK:基于两个列信息过滤数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆