bash循环提取txt文件片段 [英] bash looping and extracting of the fragment of txt file

查看:15
本文介绍了bash循环提取txt文件片段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理位于工作目录中的大量 dlg 文本文件的分析.每个文件都有一个表格(通常位于日志的不同位置),格式如下:

I am dealing with the analysis of big number of dlg text files located within the workdir. Each file has a table (usually located in different positions of the log) in the following format:

文件 1:

    CLUSTERING HISTOGRAM
    ____________________


________________________________________________________________________________
     |           |     |           |     |
Clus | Lowest    | Run | Mean      | Num | Histogram
-ter | Binding   |     | Binding   | in  |
Rank | Energy    |     | Energy    | Clus|    5    10   15   20   25   30   35
_____|___________|_____|___________|_____|____:____|____:____|____:____|____:___
   1 |     -5.78 |  11 |     -5.78 |   1 |#
   2 |     -5.53 |  13 |     -5.53 |   1 |#
   3 |     -5.47 |  17 |     -5.44 |   2 |##
   4 |     -5.43 |  20 |     -5.43 |   1 |#
   5 |     -5.26 |  19 |     -5.26 |   1 |#
   6 |     -5.24 |   3 |     -5.24 |   1 |#
   7 |     -5.19 |   4 |     -5.19 |   1 |#
   8 |     -5.14 |  16 |     -5.14 |   1 |#
   9 |     -5.11 |   9 |     -5.11 |   1 |#
  10 |     -5.07 |   1 |     -5.07 |   1 |#
  11 |     -5.05 |  14 |     -5.05 |   1 |#
  12 |     -4.99 |  12 |     -4.99 |   1 |#
  13 |     -4.95 |   8 |     -4.95 |   1 |#
  14 |     -4.93 |   2 |     -4.93 |   1 |#
  15 |     -4.90 |  10 |     -4.90 |   1 |#
  16 |     -4.83 |  15 |     -4.83 |   1 |#
  17 |     -4.82 |   6 |     -4.82 |   1 |#
  18 |     -4.43 |   5 |     -4.43 |   1 |#
  19 |     -4.26 |   7 |     -4.26 |   1 |#
_____|___________|_____|___________|_____|______________________________________

目的是遍历所有 dlg 文件并从表中取出与更宽集群相对应的单行(直方图列中的斜线数量更大).在上面表格的例子中,这是第三行.

The aim is to loop over all the dlg files and take the single line from the table corresponding to wider cluster (with bigger number of slashes in Histogram column). In the above example from the table this is the third line.

   3 |     -5.47 |  17 |     -5.44 |   2 |##

然后我需要将此行与日志文件的名称(应在该行之前指定)一起添加到 final_log.txt 中.所以最后我应该有以下格式的东西(对于 3 个不同的日志文件):

Then I need to add this line to the final_log.txt together with the name of the log file (that should be specified before the line). So in the end I should have something in following format (for 3 different log files):

"Name of the file 1": 3 |     -5.47 |  17 |     -5.44 |   2 |##
"Name_of_the_file_2": 1 |     -5.99 |  13 |     -5.98 |  16 |################
"Name_of_the_file_3": 2 |     -4.78 |  19 |     -4.44 |   3 |###

我的 BASH 工作流程的一个可能模型是:

A possible model of my BASH workflow would be:

#!/bin/bash
do
  file_name2=$(basename "$f")
  file_name="${file_name2/.dlg}"
  echo "Processing of $f..."
  # take a name of the file and save it in the log
  echo "$file_name" >> $PWD/final_results.log
  # search of the beginning of the table inside of each file and save it after its name
  cat $f |grep 'CLUSTERING HISTOGRAM' >> $PWD/final_results.log
  # check whether it works
  gedit $PWD/final_results.log
done

这里我需要替换 echo 和 grep 的组合以获取表格的选定部分.

Here I need to substitute combination of echo and grep in order to take selected parts of the table.

推荐答案

你可以用这个,估计够快了.除了表格之外,您的文件中的额外行预计不会成为问题.

You can use this one, expected to be fast enough. Extra lines in your files, besides the tables, are not expected to be a problem.

grep "#$" *.dlg | sort -rk11 | awk '!seen[$1]++'

grep 获取所有直方图行,然后按最后一个字段以相反的顺序排序,这意味着 # 最多的行在顶部,最后是 awk 删除重复项.请注意,当 grep 解析多个文件时,默认情况下 -H 会在行首打印文件名,因此如果您针对一个文件对其进行测试, 使用 grep -H.

grep fetches all the histogram lines which are then sorted in reverse order by last field, that means lines with most # on the top, and finally awk removes the duplicates. Note that when grep is parsing more than one file, it has -H by default to print the filenames at the beginning of the line, so if you test it for one file, use grep -H.

结果应该是这样的:

file1.dlg:   3 |     -5.47 |  17 |     -5.44 |   2 |##########
file2.dlg:   3 |     -5.47 |  17 |     -5.44 |   2 |####
file3.dlg:   3 |     -5.47 |  17 |     -5.44 |   2 |#######

<小时>

这是在文件中有许多相等的最大行的情况下获得第一次出现的修改:


Here is a modification to get the first appearence in case of many equal max lines in a file:

grep "#$" *.dlg | sort -k11 | tac | awk '!seen[$1]++'

我们将 sort 中的 reversed 参数替换为反转文件流的 'tac' 命令,因此现在对于任何相等的行,保留初始顺序.

We replaced the reversed parameter in sort, with the 'tac' command which is reversing the file stream, so now for any equal lines, initial order is preserved.

第二种解决方案

这里只使用awk:

awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
           END {for (i in row) print i ":" row[i]}' *.dlg

<小时>

更新:如果您从不同的目录执行它并且只想保留每个文件的基本名称,请删除路径前缀:


Update: if you execute it from different directory and want to keep only the basename of every file, to remove the path prefix:

awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
           END {for (i in row) {sub(".*/","",i); print i ":" row[i]}}'

这篇关于bash循环提取txt文件片段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆