bash循环并提取txt文件的片段 [英] bash looping and extracting of the fragment of txt file

查看:72
本文介绍了bash循环并提取txt文件的片段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理对位于workdir中的大量dlg文本文件的分析.每个文件都有一个以下格式的表(通常位于日志的不同位置):

I am dealing with the analysis of big number of dlg text files located within the workdir. Each file has a table (usually located in different positions of the log) in the following format:

文件1:

    CLUSTERING HISTOGRAM
    ____________________


________________________________________________________________________________
     |           |     |           |     |
Clus | Lowest    | Run | Mean      | Num | Histogram
-ter | Binding   |     | Binding   | in  |
Rank | Energy    |     | Energy    | Clus|    5    10   15   20   25   30   35
_____|___________|_____|___________|_____|____:____|____:____|____:____|____:___
   1 |     -5.78 |  11 |     -5.78 |   1 |#
   2 |     -5.53 |  13 |     -5.53 |   1 |#
   3 |     -5.47 |  17 |     -5.44 |   2 |##
   4 |     -5.43 |  20 |     -5.43 |   1 |#
   5 |     -5.26 |  19 |     -5.26 |   1 |#
   6 |     -5.24 |   3 |     -5.24 |   1 |#
   7 |     -5.19 |   4 |     -5.19 |   1 |#
   8 |     -5.14 |  16 |     -5.14 |   1 |#
   9 |     -5.11 |   9 |     -5.11 |   1 |#
  10 |     -5.07 |   1 |     -5.07 |   1 |#
  11 |     -5.05 |  14 |     -5.05 |   1 |#
  12 |     -4.99 |  12 |     -4.99 |   1 |#
  13 |     -4.95 |   8 |     -4.95 |   1 |#
  14 |     -4.93 |   2 |     -4.93 |   1 |#
  15 |     -4.90 |  10 |     -4.90 |   1 |#
  16 |     -4.83 |  15 |     -4.83 |   1 |#
  17 |     -4.82 |   6 |     -4.82 |   1 |#
  18 |     -4.43 |   5 |     -4.43 |   1 |#
  19 |     -4.26 |   7 |     -4.26 |   1 |#
_____|___________|_____|___________|_____|______________________________________

目标是遍历所有dlg文件,并从表中获取与更宽的簇相对应的单行(在直方图"列中具有较大的斜杠数量).在上表的示例中,这是第三行.

The aim is to loop over all the dlg files and take the single line from the table corresponding to wider cluster (with bigger number of slashes in Histogram column). In the above example from the table this is the third line.

   3 |     -5.47 |  17 |     -5.44 |   2 |##

然后,我需要将此行与日志文件的名称(应在该行之前指定)一起添加到final_log.txt中.因此,最后我应该具有以下格式的内容(用于3个不同的日志文件):

Then I need to add this line to the final_log.txt together with the name of the log file (that should be specified before the line). So in the end I should have something in following format (for 3 different log files):

"Name of the file 1": 3 |     -5.47 |  17 |     -5.44 |   2 |##
"Name_of_the_file_2": 1 |     -5.99 |  13 |     -5.98 |  16 |################
"Name_of_the_file_3": 2 |     -4.78 |  19 |     -4.44 |   3 |###

我的BASH工作流程的可能模型是:

A possible model of my BASH workflow would be:

#!/bin/bash
do
  file_name2=$(basename "$f")
  file_name="${file_name2/.dlg}"
  echo "Processing of $f..."
  # take a name of the file and save it in the log
  echo "$file_name" >> $PWD/final_results.log
  # search of the beginning of the table inside of each file and save it after its name
  cat $f |grep 'CLUSTERING HISTOGRAM' >> $PWD/final_results.log
  # check whether it works
  gedit $PWD/final_results.log
done

在这里,我需要用echo和grep的组合来代替表的选定部分.

Here I need to substitute combination of echo and grep in order to take selected parts of the table.

推荐答案

您可以使用这个速度足够快的方法.除了表格之外,文件中的多余行不会出现问题.

You can use this one, expected to be fast enough. Extra lines in your files, besides the tables, are not expected to be a problem.

grep "#$" *.dlg | sort -rk11 | awk '!seen[$1]++'

grep 提取所有直方图行,然后按照最后一个字段的相反顺序对它们进行排序,这意味着顶部的行最多包含#,最后是 awk删除重复项.请注意,当 grep 解析多个文件时,默认情况下它具有 -H 来在行首打印文件名,因此,如果对一个文件进行测试,请使用 grep -H .

grep fetches all the histogram lines which are then sorted in reverse order by last field, that means lines with most # on the top, and finally awk removes the duplicates. Note that when grep is parsing more than one file, it has -H by default to print the filenames at the beginning of the line, so if you test it for one file, use grep -H.

结果应如下所示:

file1.dlg:   3 |     -5.47 |  17 |     -5.44 |   2 |##########
file2.dlg:   3 |     -5.47 |  17 |     -5.44 |   2 |####
file3.dlg:   3 |     -5.47 |  17 |     -5.44 |   2 |#######


这里是一种修改,以使文件中的最大行数相等时可以首次出现:


Here is a modification to get the first appearence in case of many equal max lines in a file:

grep "#$" *.dlg | sort -k11 | tac | awk '!seen[$1]++'

我们用'tac'命令替换了反转的参数,该命令反转了文件流,因此对于任何相等的行,现在都保留了初始顺序.

We replaced the reversed parameter in sort, with the 'tac' command which is reversing the file stream, so now for any equal lines, initial order is preserved.

第二个解决方案

这里仅使用awk:

awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
           END {for (i in row) print i ":" row[i]}' *.dlg


更新:如果从其他目录执行它,并且只想保留每个文件的基本名称,则删除路径前缀:


Update: if you execute it from different directory and want to keep only the basename of every file, to remove the path prefix:

awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
           END {for (i in row) {sub(".*/","",i); print i ":" row[i]}}'

这篇关于bash循环并提取txt文件的片段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆