AWK:对多列数据进行简单的数学运算并随后进行数据转换 [英] AWK: simple math operations on multi-column data with subsequent data conversion

查看:59
本文介绍了AWK:对多列数据进行简单的数学运算并随后进行数据转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为Bash例程的一部分,我正在处理包含许多文件夹的目录,并使用以下命名方式:

As a part of my Bash routine, I am dealing with the directory consisted of many folders, with the following naming patters:

1000_cne_lig1, 1000_cne_lig2, 1000_cne_lig3, 1000_cne_lig4, 1000_cne_lig5  ... 1000_cne_ligN
2000_cne_lig1, 2000_cne_lig2, 2000_cne_lig3, 2000_cne_lig4, 2000_cne_lig5  ... 2000_cne_ligN
3000_cne_lig1, 3000_cne_lig2, 3000_cne_lig3, 4000_cne_lig4, 5000_cne_lig5  ... 3000_cne_ligN
7000_cne_lig1, 7000_cne_lig2, 7000_cne_lig3, 7000_cne_lig4, 7000_cne_lig5  ... 4000_cne_ligN
...
xxxx_cne_lig1, xxxx_cne_lig2, xxxx_cne_lig3, xxxx_cne_lig4, xxxx_cne_lig5  ... xxxx_cne_ligN

注意,可以根据系统名称将所有文件夹分组为X个类别(1000,2000 ... xxxx),该名称由出现在第一个斜杠"_"之前的模式定义.在每个文件夹的名称中.每个文件夹内都有一个CSV文件,其中包含以多行格式排列的数据:

Note, that all of the folders can be grouped into X categories (1000,2000 ... xxxx) according to the name of the system, which is defined by the pattern occured before the first slash "_" in the name of each folder. Inside each of the folder there is a CSV file contained data arranged in multi-line format:

ID, POP, dG
1, 40, -5.7600
2, 2, -5.4000
3, 8, -5.3300

我需要迭代遍历属于不同系统的文件夹(例如,对于5个1000个文件夹,然后对于5个2000个文件夹等等)并检测CSV文件.然后,我需要对每个CSV日志(针对特定系统)执行一些简单的数学运算:计算负数的平均值(在CSV的第三列中)并将其保存在包含系统名称的新文件中(例如1000.csv)包含在其中的一行:特定文件夹的名称,平均值.例如,对于系统1000,1000.csv应为:

I need to iteratively loop over the folders belonged to distinct system (e.g. for 5 folders of 1000, then for 5 folders of 2000 etc) and detect the CSV file. Then I need to carry out some simple math operations on each of the CSV log (for particular system): calculate the mean for the negative numbers (in third column of the CSV) and save it in a new file containing the name of the system (e.g. 1000.csv) in one line containing: the name of the particular folder, the mean value. For example for system 1000, the 1000.csv should be:

# system 1000; dG(mean)
lig1: -5.555
lig2: -6.003
lig3: -3.031
lig4: -3.222
lig5: -10.300
ligN: -NN.NNN

请注意,我在每一行(原始文件的名称)中都删除了1000_cne_,但将其添加到了CSV文件的开头.

Note, I removed 1000_cne_ in the each of the line (name of the original file) but add it to the head of the CSV.

最后,对于X系统,该脚本应根据文件夹数产生X个新的CSV填充(1000.csv,2000.csv,XXXX.csv等),其中包含N行.

Finally for X systems, the script should produce X new CSV filles (1000.csv, 2000.csv, XXXX.csv etc) contained N lines according to the number of the folders.

这是bash例程的实际实现,已经对文件夹进行了分类,然后应由AWK完成,它将进行所有数学运算并将计算出的平均值转换为新的CSV:

Here is the practical realisation of the bash routine, which already classify the folders and then should be completed by AWK which will do all math and transfer of the computed mean values to new CSV:

#!/bin/bash
home=$PWD
# folder with the folders to analyse
storage="${home}"/results
# folder with the outputs
rescore="${home}"/rescore 


# this will iteratively do something on the group of the folders belonged to one syst
for folder in "${storage}"/*; do
# this is the name of each folder
folder_name=$(basename "$folder")
# detect the name of the system (X) determined by 4 characters near the first _ >> this is the name of output.csv
syst_name=$(basename "$folder" | cut -d'_' -f 1)
# detect the name of the sample (N) the last entry after the last _ >> the name of the lines in new CSV
sample_name=$(basename "$folder" | cut -d'_' -f 3)
pushd $syst
# a simple example of output format w/o calculations
echo "${sample_name}: dG(mean)" >> ${rescore}/${syst_name}.csv
# apply AWK on each CSV to calculate mean for the numbers in third column 
# add the mean for each processed CSV in >> ${rescore}/${syst_name}.csv
popd
done

推荐答案

您也可以为此使用一个与POSIX兼容的 awk 脚本:

You may use this single awk script for this which is POSIX compliant as well:

awk 'FNR==1 {if (n) mean[suffix] = s/n; prefix=suffix=FILENAME; sub(/_.*/, "", prefix); sub(/^.*_/, "", suffix); s=n=0} FNR > 1 {s+=$3; ++n} END {mean[suffix] = s/n; print "# system", prefix, "; dG(mean)"; for (i in mean) print i ":", mean[i]}' 1000_*

# system 1000 ; dG(mean)
lig1: -5.49667
lig2: -6.76333

展开表格:

awk 'FNR==1 {
   if (n)
      mean[suffix] = s/n
   prefix=suffix=FILENAME
   sub(/_.*/, "", prefix)
   sub(/^.*_/, "", suffix)
   s=n=0
}
FNR > 1 {
   s+=$3
   ++n
}
END {
   mean[suffix] = s/n
   print "# system", prefix, "; dG(mean)"
   for (i in mean)
      print i ":", mean[i]
}' 1000_*

这篇关于AWK:对多列数据进行简单的数学运算并随后进行数据转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆