在c ++中快速文本文件读取 [英] Fast textfile reading in c++

查看：267 发布时间：2016/10/11 21:42:55 c++ performance io ifstream

本文介绍了在c ++中快速文本文件读取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前正在用c ++写一个程序，其中包括读取大量的大文本文件。每个有〜400.000行，在极端情况下每行4000个或更多字符。只是为了测试，我读一个文件使用ifstream和cplusplus.com提供的实现。它花了大约60秒，这是太长了。现在我想知道，是否有一个简单的方法来提高阅读速度？

编辑：
我使用的代码或多或少这样：

  string tmpString; 
 ifstream txtFile（path）; 
 if（txtFile.is_open（））
 {
 while（txtFile.good（））
 {
 m_numLines ++; 
 getline（txtFile，tmpString）; 
} 
 txtFile.close（）; 
}

编辑2：我读的文件只有82 MB大。我主要说，它可以达到4000，因为我认为可能有必要知道，以进行缓冲。

编辑3：谢谢大家的答案，但它似乎没有太多的空间改善给我的问题。我必须使用readline，因为我想计算行数。将ifstream实例化为二进制不会使读取速度更快。我将尝试尽可能多地并行化，至少应该工作。

编辑4：显然有一些事情我可以。大谢谢你把这么多的时间，这一点，我很感激它很多！ =）

解决方案

更新：请务必检查/ p>

内存映射文件为我提供了¹：

  #include< boost / iostreams / device / mapped_file.hpp> // for mmap 
 #include< algorithm> // for std :: find 
 #include< iostream> // for std :: cout 
 #include< cstring> 
 
 int main（）
 {
 boost :: iostreams :: mapped_file mmap（input.txt，boost :: iostreams :: mapped_file :: readonly）; 
 auto f = mmap.const_data（）; 
 auto l = f + mmap.size（）; 
 
 uintmax_t m_numLines = 0; 
 while（f& f！= l）
 if（（f = static_cast< const char *>（memchr（f，'\\\
'，lf））））
 m_numLines ++，f ++; 
 
 std :: cout<< m_numLines =<< m_numLines< \\\
; 
}

这应该相当快。

更新

如果它能帮助你测试这种方法，这里有一个版本 使用 mmap 直接而不使用Boost：看到它在Coliru上现场

  #include< algorithm> 
 #include< iostream> 
 #include< cstring> 
 
 // for mmap：
 #include< sys / mman.h> 
 #include< sys / stat.h> 
 #include< fcntl.h> 
 
 const char * map_file（const char * fname，size_t& length）; 
 
 int main（）
 {
 size_t length; 
 auto f = map_file（test.cpp，length）; 
 auto l = f + length; 
 
 uintmax_t m_numLines = 0; 
 while（f& f！= l）
 if（（f = static_cast< const char *>（memchr（f，'\\\
'，lf）））
 m_numLines ++，f ++; 
 
 std :: cout<< m_numLines =< m_numLines< \\\
; 
} 
 
 void handle_error（const char * msg）{
 perror（msg）; 
 exit（255）; 
} 
 
 const char * map_file（const char * fname，size_t& length）
 {
 int fd = open（fname，O_RDONLY）; 
 if（fd == -1）
 handle_error（open）; 
 
 //获取文件大小
 struct stat sb; 
 if（fstat（fd，& sb）== -1）
 handle_error（fstat）; 
 
 length = sb.st_size; 
 
 const char * addr = static_cast< const char *>（mmap（NULL，length，PROT_READ，MAP_PRIVATE，fd，0u））; 
 if（addr == MAP_FAILED）
 handle_error（mmap）; 
 
 // TODO关闭fd在某个时间点，调用munmap（...）
 return addr; 
}

更新

我可以通过查看GNU coreutils wc 的源代码找到最后的性能。让我惊讶的是，使用以下（大大简化）代码改编自 wc 在大约84％的时间运行与上面的内存映射文件：

  static uintmax_t wc（char const * fname）
 {
 static const auto BUFFER_SIZE = 16 * 1024; 
 int fd = open（fname，O_RDONLY）; 
 if（fd == -1）
 handle_error（open）; 
 
 / *建议我们的访问模式的内核。 * / 
 posix_fadvise（fd，0，0，1）; // FDADVICE_SEQUENTIAL 
 
 char buf [BUFFER_SIZE + 1]; 
 uintmax_t lines = 0; 
 
 while（size_t bytes_read = read（fd，buf，BUFFER_SIZE））
 {
 if（bytes_read ==（size_t）-1）
 handle_error失败）; 
 if（！bytes_read）
 break; 
 
 for（char * p = buf;（p =（char *）memchr（p，'\\\
'，（buf + bytes_read） -  p））; ++ p）
 ++行; 
} 
 
返回行; 
}

^{/ sup>见例如这里的基准：如何解析空格分隔在C ++中快速浮动？}

I am currently writing a program in c++ which includes reading lots of large text files. Each has ~400.000 lines with in extreme cases 4000 or more characters per line. Just for testing, I read one of the files using ifstream and the implementation offered by cplusplus.com. It took around 60 seconds, which is way too long. Now I was wondering, is there a straightforward way to improve reading speed?

edit: The code I am using is more or less this:
string tmpString; ifstream txtFile(path); if(txtFile.is_open()) { while(txtFile.good()) { m_numLines++; getline(txtFile, tmpString); } txtFile.close(); }
edit 2: The file I read is only 82 MB big. I mainly said that it could reach 4000 because I thought it might be necessary to know in order to do buffering.

edit 3: Thank you all for your answers, but it seems like there is not much room to improve given my problem. I have to use readline, since I want to count the number of lines. Instantiating the ifstream as binary didn't make reading any faster either. I will try to parallelize it as much as I can, that should work at least.

edit 4: So apparently there are some things I can to. Big thank you to sehe for putting so much time into this, I appreciate it a lot! =)
解决方案
Updates: Be sure to check the (surprising) updates below the initial answer

Memory mapped files have served me well¹:
#include <boost/iostreams/device/mapped_file.hpp> // for mmap #include <algorithm> // for std::find #include <iostream> // for std::cout #include <cstring> int main() { boost::iostreams::mapped_file mmap("input.txt", boost::iostreams::mapped_file::readonly); auto f = mmap.const_data(); auto l = f + mmap.size(); uintmax_t m_numLines = 0; while (f && f!=l) if ((f = static_cast<const char*>(memchr(f, '\n', l-f)))) m_numLines++, f++; std::cout << "m_numLines = " << m_numLines << "\n"; }
This should be rather quick.

Update

In case it helps you test this approach, here's a version using mmap directly instead of using Boost: see it live on Coliru
#include <algorithm> #include <iostream> #include <cstring> // for mmap: #include <sys/mman.h> #include <sys/stat.h> #include <fcntl.h> const char* map_file(const char* fname, size_t& length); int main() { size_t length; auto f = map_file("test.cpp", length); auto l = f + length; uintmax_t m_numLines = 0; while (f && f!=l) if ((f = static_cast<const char*>(memchr(f, '\n', l-f)))) m_numLines++, f++; std::cout << "m_numLines = " << m_numLines << "\n"; } void handle_error(const char* msg) { perror(msg); exit(255); } const char* map_file(const char* fname, size_t& length) { int fd = open(fname, O_RDONLY); if (fd == -1) handle_error("open"); // obtain file size struct stat sb; if (fstat(fd, &sb) == -1) handle_error("fstat"); length = sb.st_size; const char* addr = static_cast<const char*>(mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0u)); if (addr == MAP_FAILED) handle_error("mmap"); // TODO close fd at some point in time, call munmap(...) return addr; }

Update

The last bit of performance I could squeeze out of this I found by looking at the source of GNU coreutils wc. To my surprise using the following (greatly simplified) code adapted from wc runs in about 84% of the time taken with the memory mapped file above:
static uintmax_t wc(char const *fname) { static const auto BUFFER_SIZE = 16*1024; int fd = open(fname, O_RDONLY); if(fd == -1) handle_error("open"); /* Advise the kernel of our access pattern. */ posix_fadvise(fd, 0, 0, 1); // FDADVICE_SEQUENTIAL char buf[BUFFER_SIZE + 1]; uintmax_t lines = 0; while(size_t bytes_read = read(fd, buf, BUFFER_SIZE)) { if(bytes_read == (size_t)-1) handle_error("read failed"); if (!bytes_read) break; for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p) ++lines; } return lines; }

¹ see e.g. the benchmark here: How to parse space-separated floats in C++ quickly?

这篇关于在c ++中快速文本文件读取的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在c ++中快速文本文件读取 [英] Fast textfile reading in c++

问题描述

更新

更新

Update

Update

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

在c ++中快速文本文件读取 [英] Fast textfile reading in c++

问题描述

更新

更新

Update

Update

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭