AWK:基于两个列信息过滤数据 [英] AWK: filtering of the data based on TWO column information

查看：62 发布时间：2021/5/9 20:49:02 csv awk

本文介绍了AWK:基于两个列信息过滤数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在处理以多列格式排列的多列CSV的后处理:

I am working on post-processing of multi-column CSV arranged in multi-column format:

ID, POP, dG
1, 10, -5.6200
2, 4, -5.4900
3, 1, -5.3000
4, 4, -5.1600
5, 4, -4.8800
6, 3, -4.7600
7, 2, -4.4900
8, 5, -4.4500
9, 2, -4.4400
10, 8, -4.1400
11, 1, -4.1200
12, 2, -4.0900
13, 5, -4.0100
14, 1, -3.9500
15, 3, -3.9200
16, 10, -3.8800
17, 1, -3.8700
18, 3, -3.8300
19, 1, -3.8200
20, 3, -3.8000

以前，我曾使用以下AWK解决方案两次处理inout日志，检测pop(MAX)并保存与$ 2匹配的linnes.(.8 *最大值)':

Previously I have used the following AWK sollution to process the inout log two times, detect pop(MAX) and save linnes which matched $2 > (.8 * max)':

awk -F ', ' 'NR == 1 {next} FNR==NR {if (max < $2) {max=$2; n=FNR+1} next} FNR <= 2 || (FNR == n && $2 > (.4*max)) || $2 > (.8 * max)' input.csv{,} > output.csv

这可以减少输入日志，从而仅保留两条POP最高的直线:

that could reduce the input log keeping just two linnes with highest POP:

ID, POP, dG
1, 10, -5.6200
16, 10, -3.8800

现在，我需要同时考虑第二列(POP)和第三列(dG)来更改搜索算法:i)始终以第一行作为参考，在第三列(dG)中始终具有最大负数);ii)在第二列中找到编号最大的行pop(MAX)；iii)采取(i)和(ii)之间的所有直线，这些直线将与适用于BOTH列的以下规则匹配:a)行在第3列中应具有(负)数字，并符合以下规则:$ 1>(.5 * $ 1(min))'，其中$ 1(min)是第一行的数字(dG)(总是最负数)b)另外，行应与阈值降低的第二列的旧规则匹配:$ 2 =或>(.5 * max)'，其中max是pop(MAX)

Now I need to change the search algorithm taking into account the both 2nd (POP) and 3rd(dG) columns: i) always taking the first line as the reference, which always has most negative number in the 3rd column (dG); ii) finding the line which has biggest number in the second column, pop(MAX); iii) taking all linnes between (i) and (ii) that will match the following rule applied for the BOTH columns: a) line should have (negative) number in 3rd column, matching following the rule: $1 > (.5 * $1(min))', where $1(min) is the number (dG) of the first line (always most negative) b) additionally line should match the old rule for the second column with decreased threshold : $2 = or > (.5 * max)', where max is the pop(MAX)

所以预期的输出应该是

ID, POP, dG
1, 10, -5.6200.  # this is the first line with most negative dG
8, 5, -4.4500   # this has POP (5) and dG (-4.4500) matching the both rules
10, 8, -4.1400. # this has POP (8) and dG (-4.1400) matching the both rules    
16, 10, -3.8800  # this is pop max, with higher POP

添加8-04:

如果第一行的POP较低(与规则$ 2不匹配，则> =(.5 * maxPop)

ADDED 8-04:

For the case if the first line has with very low POP (which does not match the rule $2 >= (.5 * maxPop)

ID, POP, dG
1, 5, -5.5600
2, 7, -5.3300
3, 7, -5.1900
4, 1, -4.6800
5, 1, -4.5800
6, 5, -4.5600
7, 3, -4.4700
8, 4, -4.4300
9, 9, -4.4200
10, 4, -4.4200
11, 2, -4.3800
12, 4, -4.3400
13, 25, -4.3000
14, 6, -4.2900
15, 8, -4.2600
16, 3, -4.2300
17, 1, -4.1800
18, 3, -4.1300
19, 1, -4.1300
20, 1, -4.1200
21, 27, -4.0800
22, 2, -4.0300

在仍然将dG列中的值用作第二个条件的参考时，输出也不应包含第一行($ 3< =(.5 * minD)，该值应用于选择其他行中的值.输出:

the output should not contain the first line either while still using its value from dG column as the reference for the second condition ($3 <= (.5 * minD), which should be applied for the selection of other linnes in the output:

13, 25, -4.3000
21, 27, -4.0800

AWK:基于两个列信息过滤数据 [英] AWK: filtering of the data based on TWO column information

问题描述

添加8-04:

ADDED 8-04:

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

AWK:基于两个列信息过滤数据 [英] AWK: filtering of the data based on TWO column information

问题描述

添加8-04:

ADDED 8-04:

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭