getline同时读取文件vs读取整个文件,然后基于换行符分割 [英] getline while reading a file vs reading whole file and then splitting based on newline character

查看:391
本文介绍了getline同时读取文件vs读取整个文件,然后基于换行符分割的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在我想在硬盘上处理文件的每一行。是更好地加载一个文件作为一个整体,然后拆分基于换行符(使用boost),或者最好使用 getline()?我的问题是 getline()在调用时读取单行(导致多个硬盘访问)或读取整个文件并逐行给出?

I want to process each line of a file on a hard-disk now. Is it better to load a file as a whole and then split on basis of newline character (using boost), or is it better to use getline()? My question is does getline() reads single line when called (resulting in multiple hard disk access) or reads whole file and gives line by line?

推荐答案

getline 将调用 read()作为系统调用在C库的gutst深处。正是它被调用了多少次,以及如何调用它取决于C库设计。但是很可能在一个时间相对于整个文件读取行中没有明显的区别,因为底层的OS将一次读取(至少)一个磁盘块,并且最可能至少一个页面(4KB),如果不是更多。

getline will call read() as a system call somewhere deep in the gutst of the C library. Exactly how many times it is called, and how it is called depends on the C library design. But most likely there is no distinct difference in reading a line at a time vs. the whole file, becuse the OS at the bottom layer will read (at least) one disk-block at a time, and most likely at least a "page" (4KB), if not more.

更进一步,你在你读完之后,你的字符串几乎什么都不做了(例如,你正在写grep,所以大多数只是阅读, ),不太可能每次读取行的开销是您花费的大部分时间。

Further, unles you do nearly nothing with your string after you have read it (e.g you are writing something like "grep", so mostly just reading the to find a string), it is unlikely that the overhead of reading a line at a time is a large part of the time you spend.

但是一次加载整个文件有几个不同的问题:

But the "load the whole file in one go" has several, distinct, problems:


  1. 在读取整个文件之前,不会开始处理。

  2. 您需要足够的内存来读取整个文件到内存中 - 如果文件大小为几百GB?你的程序失败了吗?

不要尝试优化某些东西,除非您已经使用分析来证明它是为什么代码运行缓慢的一部分。你只是为自己造成更多的问题。

Don't try to optimise something unless you have used profiling to prove that it's part of why your code is running slow. You are just causing more problems for yourself.

编辑:所以,我写了一个程序来衡量这个,因为我认为这很有趣。

So, I wrote a program to measure this, since I think it's quite interesting.

结果是非常有趣的 - 为了使比较公平,我创建了三个大文件1297984192字节每个(通过复制目录中的所有源文件大约十打不同的源文件,然后复制这个文件多次乘它,直到它花了1.5秒运行测试,这是我认为你需要运行的东西,以确保时间不太容易随机网络分组进入或一些其他外部影响花费时间的过程)。

And the results are definitely interesting - to make the comparison fair, I created three large files of 1297984192 bytes each (by copying all source files in a directory with about a dozen different source files, then copying this file several times over to "multiply" it, until it took over 1.5 seconds to run the test, which is how long I think you need to run things to make sure the timing isn't too susceptible to random "network packet came in" or some other outside influences taking time out of the process).

我还决定通过该过程来度量系统和用户时间。

I also decided to measure the system and user-time by the process.

$ ./bigfile
Lines=24812608
Wallclock time for mmap is 1.98 (user:1.83 system: 0.14)
Lines=24812608
Wallclock time for getline is 2.07 (user:1.68 system: 0.389)
Lines=24812608
Wallclock time for readwhole is 2.52 (user:1.79 system: 0.723)
$ ./bigfile
Lines=24812608
Wallclock time for mmap is 1.96 (user:1.83 system: 0.12)
Lines=24812608
Wallclock time for getline is 2.07 (user:1.67 system: 0.392)
Lines=24812608
Wallclock time for readwhole is 2.48 (user:1.76 system: 0.707)

这里是读取文件的三个不同的函数(有一些代码来测量时间和东西,当然,但为了减少这篇文章的大小,我选择不是所有的 - 和我玩了,看看是否有什么区别,因此上面的结果不是与这里的功能相同的顺序)

Here's the three different functions to read the file (there's some code to measure time and stuff too, of course, but for reducing the size of this post, I choose to not post all of that - and I played around with ordering to see if that made any difference, so results above are not in the same order as the functions here)

void func_readwhole(const char *name)
{
    string fullname = string("bigfile_") + name;
    ifstream f(fullname.c_str());

    if (!f) 
    {
        cerr << "could not open file for " << fullname << endl;
        exit(1);
    }

    f.seekg(0, ios::end);
    streampos size = f.tellg();

    f.seekg(0, ios::beg);

    char* buffer = new char[size];
    f.read(buffer, size);
    if (f.gcount() != size)
    {
        cerr << "Read failed ...\n";
        exit(1);
    }

    stringstream ss;
    ss.rdbuf()->pubsetbuf(buffer, size);

    int lines = 0;
    string str;
    while(getline(ss, str))
    {
        lines++;
    }

    f.close();


    cout << "Lines=" << lines << endl;

    delete [] buffer;
}

void func_getline(const char *name)
{
    string fullname = string("bigfile_") + name;
    ifstream f(fullname.c_str());

    if (!f) 
    {
        cerr << "could not open file for " << fullname << endl;
        exit(1);
    }

    string str;
    int lines = 0;

    while(getline(f, str))
    {
        lines++;
    }

    cout << "Lines=" << lines << endl;

    f.close();
}

void func_mmap(const char *name)
{
    char *buffer;

    string fullname = string("bigfile_") + name;
    int f = open(fullname.c_str(), O_RDONLY);

    off_t size = lseek(f, 0, SEEK_END);

    lseek(f, 0, SEEK_SET);

    buffer = (char *)mmap(NULL, size, PROT_READ, MAP_PRIVATE, f, 0);


    stringstream ss;
    ss.rdbuf()->pubsetbuf(buffer, size);

    int lines = 0;
    string str;
    while(getline(ss, str))
    {
        lines++;
    }

    munmap(buffer, size);
    cout << "Lines=" << lines << endl;
}

这篇关于getline同时读取文件vs读取整个文件,然后基于换行符分割的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆