在c ++中快速文本文件读取 [英] Fast textfile reading in c++
问题描述
我目前正在用c ++写一个程序,其中包括读取大量的大文本文件。每个有〜400.000行,在极端情况下每行4000个或更多字符。只是为了测试,我读一个文件使用ifstream和cplusplus.com提供的实现。它花了大约60秒,这是太长了。现在我想知道,是否有一个简单的方法来提高阅读速度?
编辑:
我使用的代码或多或少这样:
string tmpString;
ifstream txtFile(path);
if(txtFile.is_open())
{
while(txtFile.good())
{
m_numLines ++;
getline(txtFile,tmpString);
}
txtFile.close();
}
编辑2:我读的文件只有82 MB大。我主要说,它可以达到4000,因为我认为可能有必要知道,以进行缓冲。
编辑3:谢谢大家的答案,但它似乎没有太多的空间改善给我的问题。我必须使用readline,因为我想计算行数。将ifstream实例化为二进制不会使读取速度更快。我将尝试尽可能多地并行化,至少应该工作。
编辑4:显然有一些事情我可以。大谢谢你把这么多的时间,这一点,我很感激它很多! =)
更新:请务必检查/ p>
内存映射文件为我提供了 1 :
#include< boost / iostreams / device / mapped_file.hpp> // for mmap
#include< algorithm> // for std :: find
#include< iostream> // for std :: cout
#include< cstring>
int main()
{
boost :: iostreams :: mapped_file mmap(input.txt,boost :: iostreams :: mapped_file :: readonly);
auto f = mmap.const_data();
auto l = f + mmap.size();
uintmax_t m_numLines = 0;
while(f& f!= l)
if((f = static_cast< const char *>(memchr(f,'\\\
',lf))))
m_numLines ++,f ++;
std :: cout<< m_numLines =<< m_numLines< \\\
;
}
这应该相当快。
更新
如果它能帮助你测试这种方法,这里有一个版本 使用 mmap
直接而不使用Boost:看到它在Coliru上现场
#include< algorithm>
#include< iostream>
#include< cstring>
// for mmap:
#include< sys / mman.h>
#include< sys / stat.h>
#include< fcntl.h>
const char * map_file(const char * fname,size_t& length);
int main()
{
size_t length;
auto f = map_file(test.cpp,length);
auto l = f + length;
uintmax_t m_numLines = 0;
while(f& f!= l)
if((f = static_cast< const char *>(memchr(f,'\\\
',lf)))
m_numLines ++,f ++;
std :: cout<< m_numLines =< m_numLines< \\\
;
}
void handle_error(const char * msg){
perror(msg);
exit(255);
}
const char * map_file(const char * fname,size_t& length)
{
int fd = open(fname,O_RDONLY);
if(fd == -1)
handle_error(open);
//获取文件大小
struct stat sb;
if(fstat(fd,& sb)== -1)
handle_error(fstat);
length = sb.st_size;
const char * addr = static_cast< const char *>(mmap(NULL,length,PROT_READ,MAP_PRIVATE,fd,0u));
if(addr == MAP_FAILED)
handle_error(mmap);
// TODO关闭fd在某个时间点,调用munmap(...)
return addr;
}
更新
我可以通过查看GNU coreutils wc
的源代码找到最后的性能。让我惊讶的是,使用以下(大大简化)代码改编自 wc
在大约84%的时间运行与上面的内存映射文件:
static uintmax_t wc(char const * fname)
{
static const auto BUFFER_SIZE = 16 * 1024;
int fd = open(fname,O_RDONLY);
if(fd == -1)
handle_error(open);
/ *建议我们的访问模式的内核。 * /
posix_fadvise(fd,0,0,1); // FDADVICE_SEQUENTIAL
char buf [BUFFER_SIZE + 1];
uintmax_t lines = 0;
while(size_t bytes_read = read(fd,buf,BUFFER_SIZE))
{
if(bytes_read ==(size_t)-1)
handle_error失败);
if(!bytes_read)
break;
for(char * p = buf;(p =(char *)memchr(p,'\\\
',(buf + bytes_read) - p)); ++ p)
++行;
}
返回行;
}
/ sup>见例如这里的基准:如何解析空格分隔在C ++中快速浮动?
I am currently writing a program in c++ which includes reading lots of large text files. Each has ~400.000 lines with in extreme cases 4000 or more characters per line. Just for testing, I read one of the files using ifstream and the implementation offered by cplusplus.com. It took around 60 seconds, which is way too long. Now I was wondering, is there a straightforward way to improve reading speed?
edit: The code I am using is more or less this:
string tmpString;
ifstream txtFile(path);
if(txtFile.is_open())
{
while(txtFile.good())
{
m_numLines++;
getline(txtFile, tmpString);
}
txtFile.close();
}
edit 2: The file I read is only 82 MB big. I mainly said that it could reach 4000 because I thought it might be necessary to know in order to do buffering.
edit 3: Thank you all for your answers, but it seems like there is not much room to improve given my problem. I have to use readline, since I want to count the number of lines. Instantiating the ifstream as binary didn't make reading any faster either. I will try to parallelize it as much as I can, that should work at least.
edit 4: So apparently there are some things I can to. Big thank you to sehe for putting so much time into this, I appreciate it a lot! =)
Updates: Be sure to check the (surprising) updates below the initial answer
Memory mapped files have served me well1:
#include <boost/iostreams/device/mapped_file.hpp> // for mmap
#include <algorithm> // for std::find
#include <iostream> // for std::cout
#include <cstring>
int main()
{
boost::iostreams::mapped_file mmap("input.txt", boost::iostreams::mapped_file::readonly);
auto f = mmap.const_data();
auto l = f + mmap.size();
uintmax_t m_numLines = 0;
while (f && f!=l)
if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
m_numLines++, f++;
std::cout << "m_numLines = " << m_numLines << "\n";
}
This should be rather quick.
Update
In case it helps you test this approach, here's a version using mmap
directly instead of using Boost: see it live on Coliru
#include <algorithm>
#include <iostream>
#include <cstring>
// for mmap:
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
const char* map_file(const char* fname, size_t& length);
int main()
{
size_t length;
auto f = map_file("test.cpp", length);
auto l = f + length;
uintmax_t m_numLines = 0;
while (f && f!=l)
if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
m_numLines++, f++;
std::cout << "m_numLines = " << m_numLines << "\n";
}
void handle_error(const char* msg) {
perror(msg);
exit(255);
}
const char* map_file(const char* fname, size_t& length)
{
int fd = open(fname, O_RDONLY);
if (fd == -1)
handle_error("open");
// obtain file size
struct stat sb;
if (fstat(fd, &sb) == -1)
handle_error("fstat");
length = sb.st_size;
const char* addr = static_cast<const char*>(mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0u));
if (addr == MAP_FAILED)
handle_error("mmap");
// TODO close fd at some point in time, call munmap(...)
return addr;
}
Update
The last bit of performance I could squeeze out of this I found by looking at the source of GNU coreutils wc
. To my surprise using the following (greatly simplified) code adapted from wc
runs in about 84% of the time taken with the memory mapped file above:
static uintmax_t wc(char const *fname)
{
static const auto BUFFER_SIZE = 16*1024;
int fd = open(fname, O_RDONLY);
if(fd == -1)
handle_error("open");
/* Advise the kernel of our access pattern. */
posix_fadvise(fd, 0, 0, 1); // FDADVICE_SEQUENTIAL
char buf[BUFFER_SIZE + 1];
uintmax_t lines = 0;
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
{
if(bytes_read == (size_t)-1)
handle_error("read failed");
if (!bytes_read)
break;
for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
++lines;
}
return lines;
}
1 see e.g. the benchmark here: How to parse space-separated floats in C++ quickly?
这篇关于在c ++中快速文本文件读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!