哪个AWK程序可以执行此操作? [英] Which AWK program can do this manipulation?

查看:43
本文介绍了哪个AWK程序可以执行此操作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个包含如下结构的文件(其字段由SP或HT分隔)

Given a file containing a structure arranged like the following (with fields separated by SP or HT)

4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w

我需要获得哪个AWK程序以下输出?

Which AWK program do I need to get the following output?

  4 5
  m d
  t 7
  h 5
  r 5
  4 1
  x c
  0 0
  6 2
  6 7
  4 2
  6 2
  7 1
  9 0
  a 2
  3 2
  9 8
  9 5
  4 2
  5 s
  2 2
  5 6
  3 4
  1 4
  4 8
  4 g
  5 3
  3 4
  4 1
  d f
  5 9
  q w

在此先感谢您提供所有帮助.

Thanks in advance for any and all help.

后记

请记住,

  1. 我的输入文件比这个问题描述的大得多.

  1. My input file is much larger than the one depicted in this question.

我的计算机科学技能受到严重限制.

My computer science skills are seriously limited.

这个任务已经强加给我了.

This task has been imposed on me.

推荐答案

awk -v n=4 '
    function join(start, end,    result, i) {
        for (i=start; i<=end; i++)
            result = result $i (i==end ? ORS : FS)
        return result
    }
    {
        c=0
        for (i=1; i<NF; i+=n) {
            c++
            col[c] = col[c] join(i, i+n-1)
        }
    }
    END {
        for (i=1; i<=c; i++)
            printf "%s", col[i]  # the value already ends with newline
    }
' file

The awk info page has a short primer on awk, so read that too.

  1. 创建一个具有1,000,000列和8行(由OP指定)的输入文件

  1. create an input file with 1,000,000 columns and 8 rows (as specified by OP)

#!perl
my $cols = 2**20; # 1,048,576
my $rows = 8;
my @alphabet=( 'a'..'z', 0..9 );
my $size = scalar @alphabet;

for ($r=1; $r <= $rows; $r++) {
    for ($c = 1; $c <= $cols; $c++) {
        my $idx = int rand $size;
        printf "%s ", $alphabet[$idx];
    }
    printf "\n";
}

$ perl createfile.pl > input.file
$ wc input.file
       8  8388608 16777224 input.file

  • 为各种实现提供时间:我使用 shell,因此计时输出与bash的计时输出不同

  • time various implementations: I use the fish shell, so the timing output is different from bash's

    • 我的awk

    • my awk

    $ time awk -f columnize.awk -v n=4 input.file > output.file
    
    ________________________________________________________
    Executed in    3.62 secs   fish           external
       usr time    3.49 secs    0.24 millis    3.49 secs
       sys time    0.11 secs    1.96 millis    0.11 secs
    
    $ wc output.file
     2097152  8388608 16777216 output.file
    

  • Timur的Perl:

  • Timur's perl:

    $ time perl -lan columnize.pl input.file > output.file
    
    ________________________________________________________
    Executed in    3.25 secs   fish           external
       usr time    2.97 secs    0.16 millis    2.97 secs
       sys time    0.27 secs    2.87 millis    0.27 secs
    

  • 掠夺者的awk

  • Ravinder's awk

    $ time awk -f columnize.ravinder input.file > output.file
    
    ________________________________________________________
    Executed in    4.01 secs   fish           external
       usr time    3.84 secs    0.18 millis    3.84 secs
       sys time    0.15 secs    3.75 millis    0.14 secs
    

  • kvantour的awk,第一个版本

  • kvantour's awk, first version

    $ time awk -f columnize.kvantour -v n=4 input.file > output.file
    
    ________________________________________________________
    Executed in    3.84 secs   fish           external
       usr time    3.71 secs  166.00 micros    3.71 secs
       sys time    0.11 secs  1326.00 micros    0.11 secs
    

  • kvantour的第二个awk版本:Crtl-C在几分钟后被中断

  • kvantour's second awk version: Crtl-C interrupted after a few minutes

    $ time awk -f columnize.kvantour2 -v n=4 input.file > output.file
    ^C
    ________________________________________________________
    Executed in  260.80 secs   fish           external
       usr time  257.39 secs    0.13 millis  257.39 secs
       sys time    1.68 secs    2.72 millis    1.67 secs
    
    $ wc output.file
     9728 38912 77824 output.file
    

    $ 0 = a [j] 行非常昂贵,因为它每次必须将字符串解析为字段.

    The $0=a[j] line is pretty expensive, as it has to parse the string into fields each time.

    道格的蟒蛇

    $ timeout 60s fish -c 'time python3 columnize.py input.file 4 > output.file'
    [... 60 seconds later ...]
    $ wc output.file
     2049  8196 16392 output.file
    

  • 另一个有趣的数据点:使用不同的awk实现.我在装有通过自制软件安装的GNU awk和mawk的Mac上

    another interesting data point: using different awk implementations. I'm on a Mac with GNU awk and mawk installed via homebrew

    • 多列少行

    • with many columns, few rows

    $ time gawk -f columnize.awk -v n=4 input.file > output.file
    
    ________________________________________________________
    Executed in    3.78 secs   fish           external
       usr time    3.62 secs  174.00 micros    3.62 secs
       sys time    0.13 secs  1259.00 micros    0.13 secs
    

    $ time /usr/bin/awk -f columnize.awk -v n=4 input.file > output.file
    
    ________________________________________________________
    Executed in   17.73 secs   fish           external
       usr time   14.95 secs    0.20 millis   14.95 secs
       sys time    2.72 secs    3.45 millis    2.71 secs
    

    $ time mawk -f columnize.awk -v n=4 input.file > output.file
    
    ________________________________________________________
    Executed in    2.01 secs   fish           external
       usr time  1892.31 millis    0.11 millis  1892.21 millis
       sys time   95.14 millis    2.17 millis   92.97 millis
    

  • 多行,少列,该测试在MacBook Pro,6核Intel cpu,16GB内存

  • with many rows, few columns, this test took over half an hour on a MacBook Pro, 6 core Intel cpu, 16GB ram

    $ time mawk -f columnize.awk -v n=4 input.file > output.file
    
    ________________________________________________________
    Executed in   32.30 mins   fish           external
       usr time   23.58 mins    0.15 millis   23.58 mins
       sys time    8.63 mins    2.52 millis    8.63 mins
    

  • 这篇关于哪个AWK程序可以执行此操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆