使用awk将一个大型,复杂的一列文件拆分为几列 [英] Splitting a large, complex one column file into several columns with awk

查看:358
本文介绍了使用awk将一个大型,复杂的一列文件拆分为几列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由某些商业软件生成的文本文件,如下所示.它由方括号分隔的部分组成,每个部分都包含数百万个元素,但确切值会从一种情况变为另一种情况.

I have a text file produced by some commercial software, looking like below. It consists in brackets delimited sections, each of which counts several million elements but the exact value changes from one case to another.

(1
 2
 3
...
)
(11
22
33
...
)
(111
222
333
...
)

我需要实现如下输出:

 1;  11;   111
 2;  22;   222
 3;  33;   333
...  ...  ...

我发现一种复杂的方式是:

I found a complicated way that is:

  • 执行sed操作以获取

  • perform sed operations to get

1
2
3
...
#
11
22
33
...
#
111
222
333
...

  • 按如下所示使用awk将我的文件拆分为几个子文件

  • use awk as follows to split my file in several sub-files

    awk -v RS="#" '{print > ("splitted-" NR ".txt")}'
    

  • 使用sed再次删除子文件中的空格

  • remove white spaces from my subfiles again with sed

    sed -i '/^[[:space:]]*$/d' splitted*.txt
    

  • 将所有内容组合在一起:

  • join everything together:

    paste splitted*.txt > out.txt
    

  • 添加一个字段分隔符(在我的bash脚本中定义)

  • add a field separator (defined in my bash script)

    awk -v sep=$my_sep 'BEGIN{OFS=sep}{$1=$1; print }' out.txt > formatted.txt
    

  • 我几次循环遍历一百万行时,感觉很糟糕. 即使返回时间很正常(〜80秒),我也想找到一个完整的awk解决方案,但无法解决. 像这样:

    I feel this is crappy as I loop over million lines several time. Even if the return time is quite OK (~80sec), I'd like to find a full awk solution but can't get to it. Something like:

    awk 'BEGIN{RS="(\\n)"; OFS=";"} { print something } '
    

    我发现了一些相关的问题,尤其是这个使用awk 进行列转换,但是它假定括号之间的行数恒定,这是我无法做到的.

    I found some related questions, especially this one row to column conversion with awk, but it assumes a constant number of lines between brackets which I can't do.

    任何帮助将不胜感激.

    推荐答案

    使用GNU awk用于多字符RS和真正的多维数组:

    With GNU awk for multi-char RS and true multi dimensional arrays:

    $ cat tst.awk
    BEGIN {
        RS  = "(\\s*[()]\\s*)+"
        OFS = ";"
    }
    NR>1 {
        cell[NR][1]
        split($0,cell[NR])
    }
    END {
        for (rowNr=1; rowNr<=NF; rowNr++) {
            for (colNr=2; colNr<=NR; colNr++) {
                printf "%6s%s", cell[colNr][rowNr], (colNr<NR ? OFS : ORS)
            }
        }
    }
    
    $ awk -f tst.awk file
         1;    11;   111
         2;    22;   222
         3;    33;   333
       ...;   ...;   ...
    

    这篇关于使用awk将一个大型,复杂的一列文件拆分为几列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆