在C ++中按行读取一个大文件 [英] Read a big file by lines in C++
问题描述
我有一个将近800M的大文件,我想逐行阅读.
I have a big file nearly 800M, and I want to read it line by line.
起初,我用Python 编写了程序,我使用了linecache.getline:
At first I wrote my program in Python, I use linecache.getline:
lines = linecache.getlines(fname)
大约需要1.2秒.
现在我想将程序移植到C ++中.
Now I want to transplant my program to C++.
我写了这些代码:
std::ifstream DATA(fname);
std::string line;
vector<string> lines;
while (std::getline(DATA, line)){
lines.push_back(line);
}
但是它很慢(花费几分钟).如何改进?
But it's slow(costs minutes). How to improve it?
- Joachim Pileborg提到了
mmap()
,并且在WindowsCreateFileMapping()
上也可以使用.
- Joachim Pileborg mentioned
mmap()
, and on windowsCreateFileMapping()
will work.
我的代码在VS2013下运行,当我使用调试"模式时,它需要162
秒;
My code runs under VS2013, when I use "DEBUG" mode, it takes 162
seconds;
当我使用发布"模式时,只有7
秒!
When I use "RELEASE" mode, only 7
seconds!
(非常感谢@DietmarKühl和@Andrew )
推荐答案
首先,您可能应该确保在启用优化的情况下进行编译.对于这样一个简单的算法,这可能无关紧要,但这实际上取决于您的向量/字符串库的实现.
First of all, you should probably make sure you are compiling with optimizations enabled. This might not matter for such a simple algorithm, but that really depends on your vector/string library implementations.
@angew建议, std :: ios_base :: sync_with_stdio(false) 在您编写的例程上有很大的不同.
As suggested by @angew, std::ios_base::sync_with_stdio(false) makes a big difference on routines like the one you have written.
另一种较小的优化方法是使用lines.reserve()
来预先分配向量,以便push_back()
不会导致大量的复制操作.但是,如果您碰巧事先知道大概会收到几行,这将非常有用.
Another, lesser, optimization would be to use lines.reserve()
to preallocate your vector so that push_back()
doesn't result in huge copy operations. However, this is most useful if you happen to know in advance approximately how many lines you are likely to receive.
使用上面建议的优化,读取800MB文本流可获得以下结果:
Using the optimizations suggested above, I get the following results for reading an 800MB text stream:
20 seconds ## if average line length = 10 characters
3 seconds ## if average line length = 100 characters
1 second ## if average line length = 1000 characters
如您所见,速度受每行开销的支配.这种开销主要发生在std::string
类内部.
As you can see, the speed is dominated by per-line overhead. This overhead is primarily occurring inside the std::string
class.
就存储分配开销而言,任何基于存储大量std::string
的方法都可能不是最佳选择.在64位系统上,每个字符串std::string
所需的 minimum 最少为16个字节.实际上,开销很可能会大大超过该开销,并且您会发现内存分配(在std::string
内部)成为严重的瓶颈.
It is likely that any approach based on storing a large quantity of std::string
will be suboptimal in terms of memory allocation overhead. On a 64-bit system, std::string
will require a minimum of 16 bytes of overhead per string. In fact, it is very possible that the overhead will be significantly greater than that -- and you could find that memory allocation (inside of std::string
) becomes a significant bottleneck.
为获得最佳的内存使用和性能,请考虑编写自己的例程以大块读取文件,而不要使用getline()
.然后,您可以应用类似于 flyweight模式的内容,以使用自定义管理各个行的索引字符串类.
For optimal memory use and performance, consider writing your own routine that reads the file in large blocks rather than using getline()
. Then you could apply something similar to the flyweight pattern to manage the indexing of the individual lines using a custom string class.
P.S.另一个相关因素是物理磁盘I/O,缓存可能会绕过它,也可能不会绕过它.
P.S. Another relevant factor will be the physical disk I/O, which might or might not be bypassed by caching.
这篇关于在C ++中按行读取一个大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!