awk/sed:多列填充的后处理 [英] awk/sed: post-processing of multi-column fille(s)
问题描述
我正在使用以下 bash 函数,该函数对 CSV 文件进行操作,并为每个 AWK 代码执行对列数据进行一些数学运算,并最终将处理后的 CSV 保存在一个新文件中.
I am using the following bash function that operates on CSV files and execute for each AWK code doing some math operations on the column data and eventually saves the processed CSV in a new file.
home="$PWD"
# folder with the outputs
rescore="${home}"/rescore
# folder with the folders to analyse
storage="${home}"/results_bench
cd "${storage}"
# pattern of the csv file located inside each of sub-directory of "${storage}"
str='*str1.csv'
rescore_data2 () {
str_name=$(basename "${str}" .csv)
printf >&2 'Dataset for %s is being rescored... ' "${str_name}"; sleep 0.1
mkdir "${rescore}"/"${str_name}"
# Apply the following AWK code for rescoring and final data collecting
while read -r d; do
awk -F', *' -v OFS=', ' '
FNR==1 {
path=FILENAME
sub(/\/[^/]+$/,"",path)
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix); sub(/^.*_/, "", suffix)
print suffix,"dG(rescored)"
next
}
{
print $1, sqrt((($3+12)/12)^2+(($2-240)/240)^2)
}
' "${d}_"*/${str} > "${rescore}/"${str_name}"/"${d%%_*}".csv"
done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[$2]++ {print $2}')
}
基本上每个处理过的 CSV 都有以下格式:
Basically each processed CSV has the following format:
#inout CSV located in the folder 10V1_cne_lig12
ID, POP, dG
1, 142, -5.6500
2, 10, -5.5000
3, 2, -4.9500
4, 150, -4.1200
我的 awk 代码将其转换为 2 列格式(通过在第 2 列和第 3 列上应用数学方程):
My awk code converts it to the 2 column format (via applying math equation on the 2nd and 3rd columns):
# output.csv
lig12, dG(rescored)
1, 0.596625
2, 1.05873
3, 1.11285
4, 0.697402
注意第一行的lig12是我的AWK代码从包含这个CSV的FOLDER部分提取的后缀(用作csv的ID),10V1是prefix(定义 csv 的类型)
Note that lig12 in the first line is the suffix (used as the ID of the csv) extracted by my AWK code from the part of the FOLDER contained this CSV, and 10V1 is the prefix ( defines type of the csv)
我需要将我的 AWK 脚本通过管道传输到 sed 或 AWK 之类的东西,它们将对获得的 output.csv 进行进一步的修改,应该将其转换为一行格式,其中包含:后缀 (lig12),在输出的第二列中检测到的最小值(这里是 0.596625)以及它在第一列 (1) 中对应的 ID 号:
I need to pipe my AWK script to something like sed or AWK that will do further modifications of the obtained output.csv , which should be converted to one line format, containing: the suffix (lig12), the minimal value detected in the second column of the output (here it is 0.596625) as well as its corresponded ID number from the first column (1):
lig12, 0.596625 (1)
这是一行 AWK 解决方案,它只为一个 csv 完成这项工作:
Here is one line AWK solution, which do the job just for one csv:
awk -F ', ' ' NR==1 { coltitle=$1 } NR==2 { min=$2; id=$1 } NR>3 && $2<min { min=$2; id=$1 } END { print coltitle FS min" ("id")" }'
是否可以将其正确传送到 rescore_data2() 中的第一个 AWK 代码,该代码应用于我的 bash 函数处理的许多 CSV 文件?因此,存储在 ("${rescore}/"${str_name}"/"${d%%_*}".csv") 中的预期输出应包含行数(带有 dG(min) of each CSV) 等于已处理 CSV 的数量.
May it be piped correctly to the first AWK code inside of the rescore_data2(), which is applied on many CSVs processed by my bash function? So the expected output stored in ("${rescore}/"${str_name}"/"${d%%_*}".csv") should contain the number of the lines (with the dG(min) of each CSV) equal to the number of the processed CSVs.
# expected output for 10 processed CSVs belonged to the prefix 10V1
# currently it does not print dGmin correctly for different CSVs.
name: 10V1, dG(min) # header with prefix should be in the top!
lig199, 0.946749 (1)
lig211, 0.946749 (1)
lig278, 0.756155 (2)
lig40, 0.756155 (2)
lig576, 0.594778 (1)
lig619, 0.594778 (1)
lig697, 0.594778 (1)
lig800, 0.594778 (1)
lig868, 0.594778 (1)
lig868, 0.594778 (1)
推荐答案
我提取了 awk 脚本如下(稍作修改):
I have extracted awk script as follows (with minor modifications):
awk -F', *' -v OFS=', ' '
FNR==1 {
path=FILENAME
sub(/\/[^/]+$/,"",path)
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix); sub(/^.*_/, "", suffix)
print suffix,"dG(rescored)"
next
}
{
print $1, sqrt((($3+12)/12)^2+(($2-240)/240)^2)
}
' 10V1_cne_lig12/foo_str3a.csv
输出如下:
lig12, dG(rescored)
1, 0.668396
2, 1.10082
3, 1.15263
4, 0.756198
尽管值与提供的结果略有不同,请让我继续.
然后对 awk 脚本添加修改为:
Although the values slightly differ from the provided result,
please let me go on as is.
Then add a modifications to the awk script as:
awk -F', *' -v OFS=', ' '
FNR==1 {
dgmin = "" # initialize the min value
path=FILENAME
sub(/\/[^/]+$/,"",path)
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix); sub(/^.*_/, "", suffix)
print suffix,"dG(rescored)"
next
}
{
dG = sqrt((($3+12)/12)^2+(($2-240)/240)^2)
if (dGmin == "" || dG < dGmin) {
dGmin = dG # update the min dG value
dGminid = $1 # update the ID with the min dG
}
}
END {
print suffix, dGmin " (" dGminid ")" # report the results
}
' 10V1_cne_lig12/foo_str3a.csv
输出:
lig12, dG(rescored)
lig12, 0.668396 (1)
您会看到第一条记录及其 ID 被选中.上面的 awk 脚本假设输入文件只是一个.如果您想一次处理多个 csv 文件,您将需要将 report the results"
行不仅放在 END{}
块中,而且可能 FNR==1{}
块的开始(每当一个文件处理完成时).
You'll see the 1st record is picked along with its ID.
The awk script above assumes the input file is just one.
If you want to process multiple csv files at once, you will need
to put the "report the results"
line not only in the END{}
block but
possibly the starting of the FNR==1{}
block
(whenever one file processing is done).
[更新]
你会把你的 rescore_data3()
函数替换为:
rescore_data3 () {
str_name=$(basename "${str}" .csv)
printf >&2 'Dataset for %s is being rescored... ' "${str_name}"; sleep 0.1
mkdir -p "${rescore}"/"${str_name}"
# Apply the following AWK code for rescoring and final data collecting
while read -r d; do
awk -F', *' -v OFS=', ' '
FNR==1 {
if (suffix) # suppress the empty line
print suffix, dGmin " (" dGminid ")"
# report the results
dGmin = "" # initialize the min value
path=FILENAME
sub(/\/[^/]+$/,"",path)
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix); sub(/^.*_/, "", suffix)
if (FNR==NR)
print prefix # print the header line
next
}
{
dG = sqrt((($3+12)/12)^2+(($2-240)/240)^2)
if (dGmin == "" || dG < dGmin) {
dGmin = dG # update the min dG value
dGminid = $1 # update the ID with the min dG
}
}
END {
print suffix, dGmin " (" dGminid ")" # report the results
}
' "${d}_"*/${str} > "${rescore}/"${str_name}"/"${d%%_*}".csv"
done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[$2]++ {print $2}')
}
- 如前所述,你需要放置一个条件,例如
if (suffix) ...
在FNR==1{}
块中抑制结果文件开头的空行. - 对不起,我打错了
dgmin = ""
应该是dGmin = ""
在我之前的回答中. - 最好将
-p
选项放在mkdir
中,这样您就可以避免mkdir: 无法创建目录:文件存在
错误. - As mentioned before, you need to put a condition such as
if (suffix) ...
in theFNR==1{}
block to suppress the empty line at the beginning of the result file. - I'm sorry I made a typo as
dgmin = ""
which should bedGmin = ""
in my previous answer. - Better to put
-p
option tomkdir
so you can avoid themkdir: cannot create directory: File exists
error.
这篇关于awk/sed:多列填充的后处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!