awk超慢处理很多行但没有很多列 [英] awk super slow processing many rows but not many columns

查看:68
本文介绍了awk超慢处理很多行但没有很多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在研究这个这个问题时,面临的挑战是采用以下矩阵:

While looking into this this question the challenge was to take this matrix:

4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w

然后变成:

4 5
m d
t 7
h 5
r 5
4 1
x c
0 0
6 2       # top of next 2 columns...
6 7
4 2
... each N elements from each row of the matrix -- in this example, N=2...
3 4
4 1
d f
5 9
q w      # last element is lower left of matrix

OP指出输入比该示例大得多",但未指定实际输入的形状(数百万行或数百万列?或两者兼有?)

The OP stated the input was 'much bigger' than the example without specifying the shape of the actual input (millions of rows? millions of columns? or both?)

我(错误地)认为该文件具有数百万行(后来被指定为具有数百万列)

I assumed (mistakenly) that the file had millions of rows (it was later specified to have millions of columns)

但有趣的是,大多数 awks 所写的速度 IF 完全可以接受,数据的形状为数百万列.

BUT the interesting thing is that most of the awks written were perfectly acceptable speed IF the shape of the data was millions of columns.

示例:@glennjackman 发布完美使用的awk ,只要长端位于列中,而不是排.

Example: @glennjackman posted a perfectly useable awk so long as the long end was in columns, not in rows.

在这里,您可以使用他的Perl生成示例X列的矩阵.这是Perl:

Here, you can use his Perl to generate an example matrix of rows X columns. Here is that Perl:

perl -E '
my $cols = 2**20;    # 1,048,576 columns - the long end
my $rows = 2**3;     # 8 rows
my @alphabet=( 'a'..'z', 0..9 );
my $size = scalar @alphabet;

for ($r=1; $r <= $rows; $r++) {
    for ($c = 1; $c <= $cols; $c++) {
        my $idx = int rand $size;
        printf "%s ", $alphabet[$idx];
    }
    printf "\n";
}' >file

以下是一些候选脚本,这些脚本将 file (来自该Perl脚本)转换为从每一行的开头获取的两列的输出:

Here are some candidate scripts that turn file (from that Perl script) into the output of 2 columns taken from the front of each row:

这是速度冠军,与Python中的输入形状无关:

This is the speed champ regardless of the shape of input in Python:

$ cat col.py
import sys

cols=int(sys.argv[2])
offset=0
delim="\t"

with open(sys.argv[1], "r") as f:
   dat=[line.split() for line in f]

while offset<=len(dat[0])-cols:
    for sl in dat:
        print(delim.join(sl[offset:offset+cols]))
    offset+=cols

这是一个Perl,无论数据的形状如何,它也足够快:

Here is a Perl that is also quick enough regardless of the shape of the data:

$ cat col.pl
push @rows, [@F];
END {
    my $delim = "\t";
    my $cols_per_group = 2;
    my $col_start = 0;
    while ( 1 ) {
        for my $row ( @rows ) {
            print join $delim, @{$row}[ $col_start .. ($col_start + $cols_per_group - 1) ];
        }
        $col_start += $cols_per_group;
        last if ($col_start + $cols_per_group - 1) > $#F;
    }
}

这是另一种 awk ,它虽然速度较慢但速度一致(并且需要预先计算文件中的行数):

Here is an alternate awk that is slower but a consistent speed (and the number of lines in the file needs to be pre-calculated):

$ cat col3.awk
function join(start, end,    result, i) {
    for (i=start; i<=end; i++)
        result = result $i (i==end ? ORS : FS)
    return result
}

{   col_offset=0
    for(i=1;i<=NF;i+=cols) {
        s=join(i,i+cols-1)
        col[NR+col_offset*nl]=join(i,i+cols-1)
        col_offset++
        ++cnt
    }
}
END { for(i=1;i<=cnt;i++) printf "%s", col[i]

}

还有格伦·杰克曼(Glenn Jackman)的 awk (因为所有awks在许多行上都具有相同的不良结果,所以不要选择他):

And Glenn Jackman's awk (not to pick on him since ALL the awks had the same bad result with many rows):

function join(start, end,    result, i) {
    for (i=start; i<=end; i++)
        result = result $i (i==end ? ORS : FS)
    return result
}
{
    c=0
    for (i=1; i<NF; i+=n) {
        c++
        col[c] = col[c] join(i, i+n-1)
    }
}
END {
    for (i=1; i<=c; i++)
        printf "%s", col[i]  # the value already ends with newline
}

以下是具有许多列的计时(即,在上面生成 file 的Perl脚本中, my $ cols = 2 ** 20 my $行= 2 ** 3 ):

Here are the timings with many columns (ie, in the Perl scrip that generates file above, my $cols = 2**20 and my $rows = 2**3):

echo 'glenn jackman awk'
time awk -f col1.awk -v n=2 file >file1

echo 'glenn jackman gawk'
time gawk -f col1.awk -v n=2 file >file5 

echo 'perl'
time perl -lan columnize.pl file >file2

echo 'dawg Python'
time python3 col.py file 2 >file3

echo 'dawg awk'
time awk -f col3.awk -v nl=$(awk '{cnt++} END{print cnt}' file) -v cols=2 file >file4

打印:

# 2**20 COLUMNS; 2**3 ROWS

glenn jackman awk
real    0m4.460s
user    0m4.344s
sys 0m0.113s

glenn jackman gawk    
real    0m4.493s
user    0m4.379s
sys 0m0.109s

perl    
real    0m3.005s
user    0m2.774s
sys 0m0.230s

dawg Python    
real    0m2.871s
user    0m2.721s
sys 0m0.148s

dawg awk    
real    0m11.356s
user    0m11.038s
sys 0m0.312s

但是通过设置 my $ cols = 2 ** 3 my $ rows = 2 ** 20 来转置数据形状,并运行相同的计时:/p>

But transpose the shape of the data by setting my $cols = 2**3 and my $rows = 2**20 and run the same timings:

# 2**3 COLUMNS; 2**20 ROWS

glenn jackman awk
real    23m15.798s
user    16m39.675s
sys 6m35.972s

glenn jackman gawk
real    21m49.645s
user    16m4.449s
sys 5m45.036s

perl    
real    0m3.605s
user    0m3.348s
sys 0m0.228s

dawg Python    
real    0m3.157s
user    0m3.065s
sys 0m0.080s

dawg awk    
real    0m11.117s
user    0m10.710s
sys 0m0.399s


那么问题:


So question:

如果将数据转换为数百万行和数百万列的数据,那么第一次awk会慢100倍的原因是什么?

What would cause the first awk to be 100x slower if the data are transposed to millions of rows vs millions of columns?

它是处理的元素数相同且数据总数相同. join 函数的调用次数相同.

It is the same number of elements processed and the same total data. The join function is called the same number of times.

推荐答案

保存在变量中的字符串连接是awk中最慢的操作之一(IIRC比I/O慢),因为您一直在寻找一个新的内存位置来保存连接的结果,随着行数的增加,awk脚本中还会发生更多的事情,因此,发布的解决方案中的所有字符串连接都可能导致速度变慢.

String concatenation being saved in a variable is one of the slowest operations in awk (IIRC it's slower than I/O) as you're constantly having to find a new memory location to hold the result of the concatenation and there's more of that happening in the awk scripts as the rows get longer so it's probably all of the string concatenation in the posted solutions that's causing the slowdown.

这样的事情应该很快,并且不应该依赖于有多少个字段与多少个记录:

Something like this should be fast and shouldn't be dependent on how many fields there are vs how many records:

$ cat tst.awk
{
    for (i=1; i<=NF; i++) {
        vals[++numVals] = $i
    }
}
END {
    for (i=1; i<=numVals; i+=2) {
        valNr = i + ((i-1) * NF)        # <- not correct, fix it!
        print vals[valNr], vals[valNr+1]
    }
}

我现在没有时间找出正确的数学公式来计算上面的单循环方法的索引(请参阅代码中的注释),所以这是一个有2个循环的工作版本,不需要那么多认为并且不应运行太多(如果有的话),速度会变慢:

I don't have time right now to figure out the correct math to calculate the index for the single loop approach above (see the comment in the code) so here's a working version with 2 loops that doesn't require as much thought and shouldn't run much if any, slower:

$ cat tst.awk
{
    for (i=1; i<=NF; i++) {
        vals[++numVals] = $i
    }
}
END {
    inc = NF - 1
    for (i=0; i<NF; i+=2) {
        for (j=1; j<=NR; j++) {
            valNr = i + j + ((j-1) * inc)
            print vals[valNr], vals[valNr+1]
        }
    }
}

$ awk -f tst.awk file
4 5
m d
t 7
h 5
r 5
4 1
x c
0 0
6 2
6 7
4 2
6 2
7 1
9 0
a 2
3 2
9 8
9 5
4 2
5 s
2 2
5 6
3 4
1 4
4 8
4 g
5 3
3 4
4 1
d f
5 9
q w

这篇关于awk超慢处理很多行但没有很多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆