最快读取大型二进制文件的每一个字节的30路? [英] Fastest way to read every 30th byte of large binary file?

查看:156
本文介绍了最快读取大型二进制文件的每一个字节的30路?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是读一个大二进制文件(2-3 GB)的每一个字节30的最快方法?我读过有与因为I / O缓冲区的fseek的性能问题,但我不希望任何揪着每30字节之前阅读2-3 GB的数据到内存中。

What is the fastest way to read every 30th byte of a large binary file (2-3 GB)? I've read there are performance problems with fseek because of I/O buffers, but I don't want to read 2-3 GB of data into memory before grabbing every 30th byte either.

推荐答案

性能测试。如果你想自己使用它,注意,如果阶梯分BUFSZ和MEGS足够小,你不读出该文件末尾的完整性检查(印刷总)才有效。这是由于(A)惰性,(二)的愿望不模糊真实code。 rand1.data是使用 DD

Performance test. If you want to use it yourself, note that the integrity check (printing total) only works if "step" divides BUFSZ, and MEGS is small enough that you don't read off the end of the file. This is due to (a) laziness, (b) desire not to obscure the real code. rand1.data is a few GB copied from /dev/urandom using dd.

#include <stdio.h>
#include <stdlib.h>

const long long size = 1024LL*1024*MEGS;
const int step = 32;

int main() {
    FILE *in = fopen("/cygdrive/c/rand1.data", "rb");
    int total = 0;
    #if SEEK
        long long i = 0;
        char buf[1];
        while (i < size) {
            fread(buf, 1, 1, in);
            total += (unsigned char) buf[0];
            fseek(in, step - 1, SEEK_CUR);
            i += step;
        }
    #endif
    #ifdef BUFSZ
        long long i = 0;
        char buf[BUFSZ];
        while (i < size) {
            fread(buf, BUFSZ, 1, in);
            i += BUFSZ;
            for (int j = 0; j < BUFSZ; j += step) 
                total += (unsigned char) buf[j];
        }
    #endif
    printf("%d\n", total);
}

结果:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2
83595817

real    0m1.391s
user    0m0.030s
sys     0m0.030s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2
83595817

real    0m0.172s
user    0m0.108s
sys     0m0.046s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2
83595817

real    0m0.031s
user    0m0.030s
sys     0m0.015s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2
83595817

real    0m0.141s
user    0m0.140s
sys     0m0.015s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DSEEK -DMEGS=20 && time ./buff2
83595817

real    0m20.797s
user    0m1.733s
sys     0m9.140s

摘要:

我使用的数据20MB最初,这当然适合在缓存中。我第一次读它(使用32KB缓存)需要1.4秒,把它纳入缓存。第二次(使用32字节缓冲区),需要0.17s。第三次(回用32KB缓冲器再次)开0.03S,这是太接近我定时器的粒度才有意义。 fseek的接管20多岁,即使数据已经在磁盘缓存

I'm using 20MB of data initially, which of course fits in cache. The first time I read it (using a 32KB buffer) takes 1.4s, bringing it into cache. The second time (using a 32 byte buffer) takes 0.17s. The third time (back with the 32KB buffer again) takes 0.03s, which is too close to the granularity of my timer to be meaningful. fseek takes over 20s, even though the data is already in disk cache.

在这一点上,我拉着fseek的出环,所以另外两个可以继续:

At this point I'm pulling fseek out of the ring so the other two can continue:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m33.437s
user    0m0.749s
sys     0m1.562s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2
-117681741

real    0m6.078s
user    0m5.030s
sys     0m0.484s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m1.141s
user    0m0.280s
sys     0m0.500s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2
-117681741

real    0m6.094s
user    0m4.968s
sys     0m0.640s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m1.140s
user    0m0.171s
sys     0m0.640s

数据1000MB似乎也基本上缓存。 32KB的缓冲区比32字节的缓冲区快6倍。但不同的是所有用户的时间,而不是花时间阻塞的磁盘I / O。现在,8000MB比我有RAM多得多,所以我能避免缓存:

1000MB of data also appears to be substantially cached. A 32KB buffer is 6 times faster than a 32 byte buffer. But the difference is all user time, not time spent blocked on disk I/O. Now, 8000MB is much more than I have RAM, so I can avoid caching:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2
-938074821

real    3m25.515s
user    0m5.155s
sys     0m12.640s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=8000 && time ./buff2
-938074821

real    3m59.015s
user    1m11.061s
sys     0m10.999s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2
-938074821

real    3m42.423s
user    0m5.577s
sys     0m14.484s

忽略第这三个的,它从文件已在RAM作为第一1000MB受益。

Ignore the first of those three, it benefited from the first 1000MB of the file already being in RAM.

现在,与32KB的版本只略快于挂钟时间(我也懒得再运行,所以让我们忽略了它),但是看在用户+ SYS时间上的差异: 20年代与82S。我认为我的操作系统的投机预读磁盘缓存在这里保存了32个字节的缓冲区的培根:而32字节的缓冲区正在慢慢重新填充,操作系统加载,即使没有人要求他们在接下来的几个磁盘扇区。如果没有,我怀疑这会是一分钟(20%)比32KB缓冲区,花费更少的时间在用户请求的土地下一次读以前慢。

Now, the version with the 32KB is only slightly faster in wall clock time (and I can't be bothered to re-run, so let's ignore it for now), but look at the difference in user+sys time: 20s vs. 82s. I think that my OS's speculative read-ahead disk caching has saved the 32-byte buffer's bacon here: while the 32 byte buffer is being slowly refilled, the OS is loading the next few disk sectors even though nobody has asked for them. Without that I suspect it would have been a minute (20%) slower than the 32KB buffer, which spends less time in user-land before requesting the next read.

这个故事告诉我们:标准I / O缓冲不削减它在我的执行时间,fseek的表现残暴的提问说道。当文件在操作系统被缓存,缓存大小是一个大问题。当文件没有在操作系统缓存,缓存大小不作一大堆的差异,以挂钟时间,但我的CPU是繁忙的。

Moral of the story: standard I/O buffering doesn't cut it in my implementation, the performance of fseek is atrocious as the questioner says. When the file is cached in the OS, buffer size is a big deal. When the file is not cached in the OS, buffer size doesn't make a whole lot of difference to wall clock time, but my CPU was busier.

incrediman的使用读取缓冲区的基本建议是非常重要的,因为fseek的是惨不忍睹。争论缓冲区是否应该是几KB或几百KB很可能是没有意义的我的机器上,可能是因为操作系统已经做了确保操作紧紧I / O密集​​型的工作。但我pretty肯定这是下降到操作系统磁盘预读,而不是标准I / O缓冲,因为如果是后者fseek的话会更好,比它。其实,这可能是因为标准I / O是做预读,但过于简单的实现fseek的是每次丢弃缓冲区。我没有看过进入实施(我不能按照它跨越边界进入操作系统和文件系统驱动程序,如果我这样做)。

incrediman's fundamental suggestion to use a read buffer is vital, since fseek is appalling. Arguing over whether the buffer should be a few KB or a few hundred KB is most likely pointless on my machine, probably because the OS has done a job of ensuring that the operation is tightly I/O bound. But I'm pretty sure this is down to OS disk read-ahead, not standard I/O buffering, because if it was the latter then fseek would be better than it is. Actually, it could be that the standard I/O is doing the read ahead, but a too-simple implementation of fseek is discarding the buffer every time. I haven't looked into the implementation (and I couldn't follow it across the boundary into the OS and filesystem drivers if I did).

这篇关于最快读取大型二进制文件的每一个字节的30路?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆