awk/sed:多列填充的后处理 [英] awk/sed: post-processing of multi-column fille(s)

查看:34
本文介绍了awk/sed:多列填充的后处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下 bash 函数,该函数对 CSV 文件进行操作,并为每个 AWK 代码执行对列数据进行一些数学运算,并最终将处理后的 CSV 保存在一个新文件中.

I am using the following bash function that operates on CSV files and execute for each AWK code doing some math operations on the column data and eventually saves the processed CSV in a new file.

home="$PWD"
# folder with the outputs
rescore="${home}"/rescore 
# folder with the folders to analyse
storage="${home}"/results_bench
cd "${storage}"
# pattern of the csv file located inside each of sub-directory of "${storage}"
str='*str1.csv'

rescore_data2 () {
str_name=$(basename "${str}" .csv)
printf >&2 'Dataset for %s is being rescored...  ' "${str_name}"; sleep 0.1 
mkdir "${rescore}"/"${str_name}"
# Apply the following AWK code for rescoring and final data collecting
while read -r d; do
awk -F', *' -v OFS=', ' '
    FNR==1 {
        path=FILENAME
        sub(/\/[^/]+$/,"",path)
        prefix=suffix=FILENAME
        sub(/_.*/, "", prefix)
        sub(/\/[^\/]+$/, "", suffix); sub(/^.*_/, "", suffix)
        print suffix,"dG(rescored)"
        next
    }
    {
        print $1, sqrt((($3+12)/12)^2+(($2-240)/240)^2)
    }
'  "${d}_"*/${str} > "${rescore}/"${str_name}"/"${d%%_*}".csv"
done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[$2]++ {print $2}')
}

基本上每个处理过的 CSV 都有以下格式:

Basically each processed CSV has the following format:

#inout CSV located in the folder 10V1_cne_lig12
ID, POP, dG
1, 142, -5.6500 
2, 10, -5.5000
3, 2, -4.9500
4, 150, -4.1200

我的 awk 代码将其转换为 2 列格式(通过在第 2 列和第 3 列上应用数学方程):

My awk code converts it to the 2 column format (via applying math equation on the 2nd and 3rd columns):

# output.csv
lig12, dG(rescored)
1, 0.596625
2, 1.05873
3, 1.11285
4, 0.697402

注意第一行的lig12是我的AWK代码从包含这个CSV的FOLDER部分提取的后缀(用作csv的ID),10V1是prefix(定义 csv 的类型)

Note that lig12 in the first line is the suffix (used as the ID of the csv) extracted by my AWK code from the part of the FOLDER contained this CSV, and 10V1 is the prefix ( defines type of the csv)

我需要将我的 AWK 脚本通过管道传输到 sed 或 AWK 之类的东西,它们将对获得的 output.csv 进行进一步的修改,应该将其转换为一行格式,其中包含:后缀 (lig12),在输出的第二列中检测到的最小值(这里是 0.596625)以及它在第一列 (1) 中对应的 ID 号:

I need to pipe my AWK script to something like sed or AWK that will do further modifications of the obtained output.csv , which should be converted to one line format, containing: the suffix (lig12), the minimal value detected in the second column of the output (here it is 0.596625) as well as its corresponded ID number from the first column (1):

lig12, 0.596625 (1)

这是一行 AWK 解决方案,它只为一个 csv 完成这项工作:

Here is one line AWK solution, which do the job just for one csv:

 awk -F ', ' ' NR==1 { coltitle=$1 } NR==2 { min=$2; id=$1 } NR>3 && $2<min { min=$2; id=$1 } END { print coltitle FS min" ("id")" }'

是否可以将其正确传送到 rescore_data2() 中的第一个 AWK 代码,该代码应用于我的 bash 函数处理的许多 CSV 文件?因此,存储在 ("${rescore}/"${str_name}"/"${d%%_*}".csv") 中的预期输出应包含行数(带有 dG(min) of each CSV) 等于已处理 CSV 的数量.

May it be piped correctly to the first AWK code inside of the rescore_data2(), which is applied on many CSVs processed by my bash function? So the expected output stored in ("${rescore}/"${str_name}"/"${d%%_*}".csv") should contain the number of the lines (with the dG(min) of each CSV) equal to the number of the processed CSVs.

# expected output for 10 processed CSVs belonged to the prefix 10V1
# currently it does not print dGmin correctly for different CSVs.
    name: 10V1, dG(min)     # header with prefix should be in the top!
    lig199, 0.946749 (1)
    lig211, 0.946749 (1)
    lig278, 0.756155 (2)
    lig40, 0.756155 (2)
    lig576, 0.594778 (1)
    lig619, 0.594778 (1)
    lig697, 0.594778 (1)
    lig800, 0.594778 (1)
    lig868, 0.594778 (1)
    lig868, 0.594778 (1)

推荐答案

我提取了 awk 脚本如下(稍作修改):

I have extracted awk script as follows (with minor modifications):

awk -F', *' -v OFS=', ' '
    FNR==1 {
        path=FILENAME
        sub(/\/[^/]+$/,"",path)
        prefix=suffix=FILENAME
        sub(/_.*/, "", prefix)
        sub(/\/[^\/]+$/, "", suffix); sub(/^.*_/, "", suffix)
        print suffix,"dG(rescored)"
        next
    }
    {
        print $1, sqrt((($3+12)/12)^2+(($2-240)/240)^2)
    }
' 10V1_cne_lig12/foo_str3a.csv

输出如下:

lig12, dG(rescored)
1, 0.668396
2, 1.10082
3, 1.15263
4, 0.756198

尽管值与提供的结果略有不同,请让我继续.
然后对 awk 脚本添加修改为:

Although the values slightly differ from the provided result, please let me go on as is.
Then add a modifications to the awk script as:

awk -F', *' -v OFS=', ' '
    FNR==1 {
        dgmin = ""                              # initialize the min value
        path=FILENAME
        sub(/\/[^/]+$/,"",path)
        prefix=suffix=FILENAME
        sub(/_.*/, "", prefix)
        sub(/\/[^\/]+$/, "", suffix); sub(/^.*_/, "", suffix)
        print suffix,"dG(rescored)"
        next
    }
    {
        dG = sqrt((($3+12)/12)^2+(($2-240)/240)^2)
        if (dGmin == "" || dG < dGmin) {
            dGmin = dG                          # update the min dG value
            dGminid = $1                        # update the ID with the min dG
        }
    }
    END {
        print suffix, dGmin " (" dGminid ")"    # report the results
    }
' 10V1_cne_lig12/foo_str3a.csv

输出:

lig12, dG(rescored)
lig12, 0.668396 (1)

您会看到第一条记录及其 ID 被选中.上面的 awk 脚本假设输入文件只是一个.如果您想一次处理多个 csv 文件,您将需要将 report the results" 行不仅放在 END{} 块中,而且可能 FNR==1{} 块的开始(每当一个文件处理完成时).

You'll see the 1st record is picked along with its ID. The awk script above assumes the input file is just one. If you want to process multiple csv files at once, you will need to put the "report the results" line not only in the END{} block but possibly the starting of the FNR==1{} block (whenever one file processing is done).

[更新]
你会把你的 rescore_data3() 函数替换为:

rescore_data3 () {
str_name=$(basename "${str}" .csv)
printf >&2 'Dataset for %s is being rescored...  ' "${str_name}"; sleep 0.1
mkdir -p "${rescore}"/"${str_name}"
# Apply the following AWK code for rescoring and final data collecting
while read -r d; do
awk -F', *' -v OFS=', ' '
    FNR==1 {
        if (suffix)                             # suppress the empty line
            print suffix, dGmin " (" dGminid ")"
                                                # report the results
        dGmin = ""                              # initialize the min value
        path=FILENAME
        sub(/\/[^/]+$/,"",path)
        prefix=suffix=FILENAME
        sub(/_.*/, "", prefix)
        sub(/\/[^\/]+$/, "", suffix); sub(/^.*_/, "", suffix)
        if (FNR==NR)
            print prefix                        # print the header line
        next
    }
    {
        dG = sqrt((($3+12)/12)^2+(($2-240)/240)^2)
        if (dGmin == "" || dG < dGmin) {
            dGmin = dG                          # update the min dG value
            dGminid = $1                        # update the ID with the min dG
        }
    }
    END {
        print suffix, dGmin " (" dGminid ")"    # report the results
    }
' "${d}_"*/${str} > "${rescore}/"${str_name}"/"${d%%_*}".csv"
done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[$2]++ {print $2}')
}

  • 如前所述,你需要放置一个条件,例如if (suffix) ...FNR==1{} 块中抑制结果文件开头的空行.
  • 对不起,我打错了 dgmin = "" 应该是 dGmin = ""在我之前的回答中.
  • 最好将 -p 选项放在 mkdir 中,这样您就可以避免mkdir: 无法创建目录:文件存在 错误.
    • As mentioned before, you need to put a condition such as if (suffix) ... in the FNR==1{} block to suppress the empty line at the beginning of the result file.
    • I'm sorry I made a typo as dgmin = "" which should be dGmin = "" in my previous answer.
    • Better to put -p option to mkdir so you can avoid the mkdir: cannot create directory: File exists error.
    • 这篇关于awk/sed:多列填充的后处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆