bash循环并提取txt文件的片段 [英] bash looping and extracting of the fragment of txt file
问题描述
我正在处理对位于workdir中的大量dlg文本文件的分析.每个文件都有一个以下格式的表(通常位于日志的不同位置):
I am dealing with the analysis of big number of dlg text files located within the workdir. Each file has a table (usually located in different positions of the log) in the following format:
文件1:
CLUSTERING HISTOGRAM
____________________
________________________________________________________________________________
| | | | |
Clus | Lowest | Run | Mean | Num | Histogram
-ter | Binding | | Binding | in |
Rank | Energy | | Energy | Clus| 5 10 15 20 25 30 35
_____|___________|_____|___________|_____|____:____|____:____|____:____|____:___
1 | -5.78 | 11 | -5.78 | 1 |#
2 | -5.53 | 13 | -5.53 | 1 |#
3 | -5.47 | 17 | -5.44 | 2 |##
4 | -5.43 | 20 | -5.43 | 1 |#
5 | -5.26 | 19 | -5.26 | 1 |#
6 | -5.24 | 3 | -5.24 | 1 |#
7 | -5.19 | 4 | -5.19 | 1 |#
8 | -5.14 | 16 | -5.14 | 1 |#
9 | -5.11 | 9 | -5.11 | 1 |#
10 | -5.07 | 1 | -5.07 | 1 |#
11 | -5.05 | 14 | -5.05 | 1 |#
12 | -4.99 | 12 | -4.99 | 1 |#
13 | -4.95 | 8 | -4.95 | 1 |#
14 | -4.93 | 2 | -4.93 | 1 |#
15 | -4.90 | 10 | -4.90 | 1 |#
16 | -4.83 | 15 | -4.83 | 1 |#
17 | -4.82 | 6 | -4.82 | 1 |#
18 | -4.43 | 5 | -4.43 | 1 |#
19 | -4.26 | 7 | -4.26 | 1 |#
_____|___________|_____|___________|_____|______________________________________
目标是遍历所有dlg文件,并从表中获取与更宽的簇相对应的单行(在直方图"列中具有较大的斜杠数量).在上表的示例中,这是第三行.
The aim is to loop over all the dlg files and take the single line from the table corresponding to wider cluster (with bigger number of slashes in Histogram column). In the above example from the table this is the third line.
3 | -5.47 | 17 | -5.44 | 2 |##
然后,我需要将此行与日志文件的名称(应在该行之前指定)一起添加到final_log.txt中.因此,最后我应该具有以下格式的内容(用于3个不同的日志文件):
Then I need to add this line to the final_log.txt together with the name of the log file (that should be specified before the line). So in the end I should have something in following format (for 3 different log files):
"Name of the file 1": 3 | -5.47 | 17 | -5.44 | 2 |##
"Name_of_the_file_2": 1 | -5.99 | 13 | -5.98 | 16 |################
"Name_of_the_file_3": 2 | -4.78 | 19 | -4.44 | 3 |###
我的BASH工作流程的可能模型是:
A possible model of my BASH workflow would be:
#!/bin/bash
do
file_name2=$(basename "$f")
file_name="${file_name2/.dlg}"
echo "Processing of $f..."
# take a name of the file and save it in the log
echo "$file_name" >> $PWD/final_results.log
# search of the beginning of the table inside of each file and save it after its name
cat $f |grep 'CLUSTERING HISTOGRAM' >> $PWD/final_results.log
# check whether it works
gedit $PWD/final_results.log
done
在这里,我需要用echo和grep的组合来代替表的选定部分.
Here I need to substitute combination of echo and grep in order to take selected parts of the table.
推荐答案
您可以使用这个速度足够快的方法.除了表格之外,文件中的多余行不会出现问题.
You can use this one, expected to be fast enough. Extra lines in your files, besides the tables, are not expected to be a problem.
grep "#$" *.dlg | sort -rk11 | awk '!seen[$1]++'
grep
提取所有直方图行,然后按照最后一个字段的相反顺序对它们进行排序,这意味着顶部的行最多包含#
,最后是 awk
删除重复项.请注意,当 grep
解析多个文件时,默认情况下它具有 -H
来在行首打印文件名,因此,如果对一个文件进行测试,请使用 grep -H
.
grep
fetches all the histogram lines which are then sorted in reverse order by last field, that means lines with most #
on the top, and finally awk
removes the duplicates. Note that when grep
is parsing more than one file, it has -H
by default to print the filenames at the beginning of the line, so if you test it for one file, use grep -H
.
结果应如下所示:
file1.dlg: 3 | -5.47 | 17 | -5.44 | 2 |##########
file2.dlg: 3 | -5.47 | 17 | -5.44 | 2 |####
file3.dlg: 3 | -5.47 | 17 | -5.44 | 2 |#######
这里是一种修改,以使文件中的最大行数相等时可以首次出现:
Here is a modification to get the first appearence in case of many equal max lines in a file:
grep "#$" *.dlg | sort -k11 | tac | awk '!seen[$1]++'
我们用'tac'命令替换了反转的参数,该命令反转了文件流,因此对于任何相等的行,现在都保留了初始顺序.
We replaced the reversed parameter in sort, with the 'tac' command which is reversing the file stream, so now for any equal lines, initial order is preserved.
第二个解决方案
这里仅使用awk:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) print i ":" row[i]}' *.dlg
更新:如果从其他目录执行它,并且只想保留每个文件的基本名称,则删除路径前缀:
Update: if you execute it from different directory and want to keep only the basename of every file, to remove the path prefix:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) {sub(".*/","",i); print i ":" row[i]}}'
这篇关于bash循环并提取txt文件的片段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!