AWK:多列CSV数据的统计操作 [英] AWK: statistics operations of multi-column CSV data

查看:63
本文介绍了AWK:多列CSV数据的统计操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了对多列数据进行一些统计分析,我正在使用以下bash + AWK例程分析大量CSV填充:

With the aim to perform some statistical analysis of multi-column data I am analyzing big number of CSV filles using the following bash + AWK routine:

#!/bin/bash
home="$PWD"
# folder with the outputs
rescore="${home}"/rescore 
# folder with the folders to analyse
storage="${home}"/results
#cd "${home}"/results
cd ${storage}
csv_pattern='*_filt.csv'


while read -r d; do
awk -v rescore="$rescore" '
FNR==1 {
   if (n)
      mean[suffix] = s/n
   prefix=suffix=FILENAME
   sub(/_.*/, "", prefix)
   sub(/\/[^\/]+$/, "", suffix)
   sub(/^.*_/, "", suffix)
   s=n=0
}
FNR > 1 {
   s += $3
   ++n
}
END {
   out = rescore "/" prefix ".csv"
   mean[suffix] = s/n
   print prefix ":", "dG(mean)" > out
   for (i in mean)
      printf "%s: %.2f\n", i, mean[i] >> out
   close(out)
}' "${d}_"*/${csv_pattern} #> "${rescore}/"${d%%_*}".csv"
done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[$2]++ {print $2}')

基本上,脚本对属于相同前缀的CSV文件进行合并(定义为在包含CSV的目录的开头出现的命名模式,例如10V1_cne_lig1中的10V1),并为其计算第三栏:

Basically the script takes ensemble of CSV files belonged to the same prefix (defined as the naming pattern occured at the begining of the directory contained CSV, for example 10V1 from 10V1_cne_lig1) and calculate for it the mean value for the numbers in the third column:

# input *_filt.csv located in the folder 10V1_cne_lig1001
ID, POP, dG
1, 142, -5.6500
2, 10, -5.5000
3, 2, -4.9500

在10V1.csv中添加1个字符串,该字符串以2列格式组织:i)带有初始CSV的文件夹后缀的名称;ii)为input.csv第三栏中(dG)中所有数字计算的平均值:

add 1 string to 10V1.csv, which is organized in 2 column format i) the name of the suffix of the folder with initial CSV; ii) the mean value calculated for all numbers in the third column (dG) of input.csv:

# this is two column format of output.csv: 10V1.csv
10V1: dG(mean)
lig1001: -5.37

以这种方式填充100 CSV,例如output.csv应该包含100行及其平均值,等等

in this way for 100 CSV filles such output.csv should contain 100 lines with the mean values, etc

我需要对我的例程的AWK部分进行小的修改,该修改会将第3列添加到具有初始数据(dG)的RMSD值(作为初始dG值之间的差异的度量)的输出CSV,用于计算MEAN值.使用AWK语法,在具有特定MEAN值的情况下,RMS可以表示为

I need to introduce a small modification to my AWK part of my routine that would add the 3rd column to the output CSV with RMSD value (as the measure of the differences between initial dG values) of the initial data (dG), which had been used to calculate the MEAN value. Using AWK syntax, with a particular MEAN value the RMS could be expressed as

mean=$(awk -F , '{sum+=$3}END{printf "%.2f", sum/NR}' $csv)
rmsd=$(awk -v mean=$mean '{++n;sum+=($NF-mean)^2} END{if(n) printf "%.2f", sqrt(sum/n)}' $csv)

以下是为5个CSV日志计算的5个均值和5个rmsds值的预期输出(第一个对应于我上面的示例!):

Here is expected output for 5 means and 5 rmsds values calculated for 5 CSV logs (the first one is corresponded to my above example!):

10V1: dG(mean): RMSD (error)
lig1001 -5.37 0.30
lig1002 -8.53 0.34
lig1003 -6.57 0.25
lig1004 -9.53 0.00 # rmsd=0 since initial csv has only 1 line: no data variance 
lig1005 -8.11 0.39

如何将此添加项合并到我的主要bash-AWK代码中,以将第三RMSD列(对于每个已处理的CSV,从而将每个计算出的MEAN)添加到output.csv中?

How this addition could be incorporated into my main bash-AWK code with the aim to add the third RMSD column (for each of the processed CSV, thus taking each of the calculated MEAN) to the output.csv?

推荐答案

您可以在 awk 代码中计算 mean rmsd .
您能否尝试以下 awk 代码:

You can calculate both of mean and rmsd within the awk code.
Would you please try the following awk code:

awk -v rescore="$rescore" '
FNR==1 {
   if (n) {                     # calculate the results of previous file
      m = s / n                 # mean
      var = s2 / n - m * m      # variance
      if (var < 0) var = 0      # avoid an exception due to round-off error
      mean[suffix] = m          # store the mean in an array
      rmsd[suffix] = sqrt(var)
   }
   prefix=suffix=FILENAME
   sub(/_.*/, "", prefix)
   sub(/\/[^\/]+$/, "", suffix)
   sub(/^.*_/, "", suffix)
   s = 0                        # sum of $3
   s2 = 0                       # sum of $3 ** 2
   n = 0                        # count of samples
}
FNR > 1 {
   s += $3
   s2 += $3 * $3
   ++n
}
END {
   out = rescore "/" prefix ".csv"
   m = s / n
   var = s2 / n - m * m
   if (var < 0) var = 0
   mean[suffix] = m
   rmsd[suffix] = sqrt(var)

   print prefix ":", "dG(mean)", "dG(rmsd)" > out
   for (i in mean)
      printf "%s: %.2f %.2f\n", i, mean[i], rmsd[i] >> out
   close(out)
}' 

这里是打印 dG 最低值的版本.​​

Here is the version to print the lowest value of dG.

awk -v rescore="$rescore" '
FNR==1 {
   if (n) {                     # calculate the results of previous file
      m = s / n                 # mean
      var = s2 / n - m * m      # variance
      if (var < 0) var = 0      # avoid an exception due to round-off error
      mean[suffix] = m          # store the mean in an array
      rmsd[suffix] = sqrt(var)
      lowest[suffix] = min
   }
   prefix=suffix=FILENAME
   sub(/_.*/, "", prefix)
   sub(/\/[^\/]+$/, "", suffix)
   sub(/^.*_/, "", suffix)
   s = 0                        # sum of $3
   s2 = 0                       # sum of $3 ** 2
   n = 0                        # count of samples
   min = 0                      # lowest value of $3
}
FNR > 1 {
   s += $3
   s2 += $3 * $3
   ++n
   if ($3 < min) min = $3       # update the lowest value
}
END {
   if (n) {                     # just to avoid division by zero
      m = s / n
      var = s2 / n - m * m
      if (var < 0) var = 0
      mean[suffix] = m
      rmsd[suffix] = sqrt(var)
      lowest[suffix] = min
   }
   out = rescore "/" prefix ".csv"
   print prefix ":", "dG(mean)", "dG(rmsd)", "dG(lowest)" > out
   for (i in mean)
      printf "%s: %.2f %.2f %.2f\n", i, mean[i], rmsd[i], lowest[i] > out
}' file_*.csv

  • 我假设所有 dG 值均为负.如果有机会值大于零,请修改行 min = 0 进行初始化将该变量设置为相当大的值(10,000或类似值).
  • 如果需要,请对文件名进行修改.
  • 埃德·莫顿(Ed Morton)的建议也包括在内,尽管结果相同.
    • I've assumed all dG values are negative. If there is any chance the value is greater than zero, modify the line min = 0 which initializes the variable to considerably big value (10,000 or whatever).
    • Please apply your modifications regarding the filenames, if needed.
    • The suggestions by Ed Morton are also included although the results will be the same.
    • 这篇关于AWK:多列CSV数据的统计操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆