最快的方法来找到文本中的行数(C ++) [英] Fastest way to find the number of lines in a text (C++)

查看:528
本文介绍了最快的方法来找到文本中的行数(C ++)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要读取文件中的行数,然后对该文件执行某些操作。当我尝试读取文件,并在每次迭代增加line_count变量,直到我到达eof。这在我的情况不是那么快。我使用了ifstream和fgets。他们都很慢。有一个hacky的方法来做到这一点,这也被使用,例如BSD,Linux内核或berkeley db。(可能是通过使用按位操作)。

I need to read the number of lines in a file before doing some operations on that file. When I try to read the file and increment the line_count variable at each iteration until i reach eof. It was not that fast in my case. I used both ifstream and fgets . They were both slow . Is there a hacky way to do this, which is also used by, for instance BSD, Linux kernel or berkeley db.(may be by using bitwise operations).

正如我之前说过的那样,该文件中有数百万行并且它保持变大,每行约有40或50个字符。我使用Linux。

As I told before there are millions of lines in that file and it keeps get larger, each line has about 40 or 50 characters. I'm using Linux.

注意:
我敢肯定会有人会说使用DB白痴。但在我的情况下,我不能使用db。

Note: I'm sure there will be people who might say use a DB idiot. But briefly in my case i can't use a db.

推荐答案

找到行计数的唯一方法是读取整个文件并计算行结束字符数。这样做的最快方法是将整个文件读入一个大型缓冲区,并进行一次读操作,然后通过计算'\\\
'字符的缓冲区。

The only way to find the line count is to read the whole file and count the number of line-end characters. The fastest way tom do this is probably to read the whole file into a large buffer with one read operation and then go through the buffer counting the '\n' characters.

由于您当前的文件大小似乎约为60Mb,这不是一个有吸引力的选择。你可以通过不读整个文件,但读取它的块,得到一些速度,比如说大小1Mb。你还说数据库不是问题,但它确实看起来是最好的长期解决方案。

As your current file size appears to be about 60Mb, this is not an attractive option. You can get some of the speed by not reading the whole file, but reading it in chunks., say of size 1Mb. You also say that a database is out of the question, but it really does look to be the best long-term solution.

编辑:我只是运行了一个小的基准,并使用缓冲的方法(缓冲区大小1024K)似乎是一个比一次读取一行的两倍快的getline()。这里是代码 - 我的测试是用g ++使用-O2优化级别:

I just ran a small benchmark on this and using the buffered approach (buffer size 1024K) seems to be a bit more than twice as fast as reading a line at a time with getline(). Here's the code - my tests were done with g++ using -O2 optimisation level:

#include <iostream>
#include <fstream>
#include <vector>
#include <ctime>
using namespace std;

unsigned int FileRead( istream & is, vector <char> & buff ) {
    is.read( &buff[0], buff.size() );
    return is.gcount();
}

unsigned int CountLines( const vector <char> & buff, int sz ) {
    int newlines = 0;
    const char * p = &buff[0];
    for ( int i = 0; i < sz; i++ ) {
    	if ( p[i] == '\n' ) {
    		newlines++;
    	}
    }
    return newlines;
}

int main( int argc, char * argv[] ) {
    time_t now = time(0);
    if ( argc == 1  ) {
    	cout << "lines\n";
    	ifstream ifs( "lines.dat" );
    	int n = 0;
    	string s;
    	while( getline( ifs, s ) ) {
    		n++;
    	}
    	cout << n << endl;
    }
    else {
    	cout << "buffer\n";
    	const int SZ = 1024 * 1024;
    	std::vector <char> buff( SZ );
    	ifstream ifs( "lines.dat" );
    	int n = 0;
    	while( int cc = FileRead( ifs, buff ) ) {
    		n += CountLines( buff, cc );
    	}
    	cout << n << endl;
    }
    cout << time(0) - now << endl;
}

这篇关于最快的方法来找到文本中的行数(C ++)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆