在C ++中按行读取一个大文件 [英] Read a big file by lines in C++

查看:216
本文介绍了在C ++中按行读取一个大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个将近800M的大文件,我想逐行阅读.

I have a big file nearly 800M, and I want to read it line by line.

起初,我用Python 编写了程序,我使用了linecache.getline:

At first I wrote my program in Python, I use linecache.getline:

lines = linecache.getlines(fname)

大约需要1.2秒.

现在我想将程序移植到C ++中.

Now I want to transplant my program to C++.

我写了这些代码:

    std::ifstream DATA(fname);
    std::string line;
    vector<string> lines;

    while (std::getline(DATA, line)){
        lines.push_back(line);
    }

但是它很慢(花费几分钟).如何改进?

But it's slow(costs minutes). How to improve it?

  • Joachim Pileborg提到了mmap(),并且在Windows CreateFileMapping()上也可以使用.
  • Joachim Pileborg mentioned mmap(), and on windows CreateFileMapping() will work.

我的代码在VS2013下运行,当我使用调试"模式时,它需要162秒;

My code runs under VS2013, when I use "DEBUG" mode, it takes 162 seconds;

当我使用发布"模式时,只有7秒!

When I use "RELEASE" mode, only 7 seconds!

(非常感谢@DietmarKühl和@Andrew )

推荐答案

首先,您可能应该确保在启用优化的情况下进行编译.对于这样一个简单的算法,这可能无关紧要,但这实际上取决于您的向量/字符串库的实现.

First of all, you should probably make sure you are compiling with optimizations enabled. This might not matter for such a simple algorithm, but that really depends on your vector/string library implementations.

@angew建议, std :: ios_base :: sync_with_stdio(false) 在您编写的例程上有很大的不同.

As suggested by @angew, std::ios_base::sync_with_stdio(false) makes a big difference on routines like the one you have written.

另一种较小的优化方法是使用lines.reserve()来预先分配向量,以便push_back()不会导致大量的复制操作.但是,如果您碰巧事先知道大概会收到几行,这将非常有用.

Another, lesser, optimization would be to use lines.reserve() to preallocate your vector so that push_back() doesn't result in huge copy operations. However, this is most useful if you happen to know in advance approximately how many lines you are likely to receive.

使用上面建议的优化,读取800MB文本流可获得以下结果:

Using the optimizations suggested above, I get the following results for reading an 800MB text stream:

 20 seconds ## if average line length = 10 characters
  3 seconds ## if average line length = 100 characters
  1 second  ## if average line length = 1000 characters

如您所见,速度受每行开销的支配.这种开销主要发生在std::string类内部.

As you can see, the speed is dominated by per-line overhead. This overhead is primarily occurring inside the std::string class.

就存储分配开销而言,任何基于存储大量std::string的方法都可能不是最佳选择.在64位系统上,每个字符串std::string所需的 minimum 最少为16个字节.实际上,开销很可能会大大超过该开销,并且您会发现内存分配(在std::string内部)成为严重的瓶颈.

It is likely that any approach based on storing a large quantity of std::string will be suboptimal in terms of memory allocation overhead. On a 64-bit system, std::string will require a minimum of 16 bytes of overhead per string. In fact, it is very possible that the overhead will be significantly greater than that -- and you could find that memory allocation (inside of std::string) becomes a significant bottleneck.

为获得最佳的内存使用和性能,请考虑编写自己的例程以大块读取文件,而不要使用getline().然后,您可以应用类似于 flyweight模式的内容,以使用自定义管理各个行的索引字符串类.

For optimal memory use and performance, consider writing your own routine that reads the file in large blocks rather than using getline(). Then you could apply something similar to the flyweight pattern to manage the indexing of the individual lines using a custom string class.

P.S.另一个相关因素是物理磁盘I/O,缓存可能会绕过它,也可能不会绕过它.

P.S. Another relevant factor will be the physical disk I/O, which might or might not be bypassed by caching.

这篇关于在C ++中按行读取一个大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆