如何快速解析空间分隔的浮动在C ++? [英] How to parse space-separated floats in C++ quickly?

查看:114
本文介绍了如何快速解析空间分隔的浮动在C ++?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个有数百万行的文件,每行有3个浮点,用空格分隔。它需要很多时间来读取文件,所以我试图读取它们使用内存映射文件,只是为了发现问题不是与IO的速度,而是与解析的速度。

I have a file with millions of lines, each line has 3 floats separated by spaces. It takes a lot of time to read the file, so I tried to read them using memory mapped files only to find out that the problem is not with the speed of IO but with the speed of the parsing.

我当前的解析是获取流(称为文件)并执行以下操作:

My current parsing is to take the stream (called file) and do the following

float x,y,z;
file >> x >> y >> z;

Stack Overflow中的某个人建议使用Boost.Spirit,但是我找不到任何简单的教程如何使用它。

Someone in Stack Overflow recommended to use Boost.Spirit but I couldn't find any simple tutorial to explain how to use it.

我试图找到一个简单而有效的方法来解析一行如下:

I'm trying to find a simple and efficient way to parse a line that looks like this:

"134.32 3545.87 3425"

真的很感谢一些帮助。我想使用strtok拆分它,但我不知道如何将字符串转换为浮动,我不确定这是最好的方法。

I will really appreciate some help. I wanted to use strtok to split it, but I don't know how to convert strings to floats, and I'm not quite sure it's the best way.

我不介意如果解决方案将是Boost或不。我不介意如果它不会是最有效的解决方案,但我相信它是可能加倍的速度。

I don't mind if the solution will be Boost or not. I don't mind if it won't be the most efficient solution ever, but I'm sure that it is possible to double the speed.

提前感谢。

推荐答案

如果转换是瓶颈(这是很可能的),
你应该开始使用不同的可能在
标准。逻辑上,人们会期望他们非常接近,
但实际上,他们不总是:

If the conversion is the bottle neck (which is quite possible), you should start by using the different possiblities in the standard. Logically, one would expect them to be very close, but practically, they aren't always:


  • 已经确定 std :: ifstream 太慢。

将内存映射数据转换为 std :: istringstream
几乎肯定不是一个好的解决方案;

Converting your memory mapped data to an std::istringstream is almost certainly not a good solution; you'll first have to create a string, which will copy all of the data.

编写自己的 streambuf 直接从内存中读取,
而不复制(或使用已弃用的 std :: istrstream
可能是解决方案,虽然如果问题真的是
转换...这仍然使用相同的转换例程。

Writing your own streambuf to read directly from the memory, without copying (or using the deprecated std::istrstream) might be a solution, although if the problem really is the conversion... this still uses the same conversion routines.

您随时可以尝试 fscanf scanf 在你的内存映射
流。根据实现,它们可能比各种 istream 实现更快

You can always try fscanf, or scanf on your memory mapped stream. Depending on the implementation, they might be faster than the various istream implementations.

可能比任何这些都快要使用 strtod 。不需要
对此进行标记: strtod 跳过前导空格
(包括'\\\
'
),并有一个out参数,它放在第一个字符的
地址未读。结束条件是
a有点棘手,你的循环应该看起来有点像:

Probably faster than any of these is to use strtod. No need to tokenize for this: strtod skips leading white space (including '\n'), and has an out parameter where it puts the address of the first character not read. The end condition is a bit tricky, your loop should probably look a bit like:


    char* begin;    //  Set to point to the mmap'ed data...
                    //  You'll also have to arrange for a '\0'
                    //  to follow the data.  This is probably
                    //  the most difficult issue.
    char* end;
    errno = 0;
    double tmp = strtod( begin, &end );
    while ( errno == 0 && end != begin ) {
        //  do whatever with tmp...
        begin = end;
        tmp = strtod( begin, &end );
    }

如果这些都不够快,你必须考虑
的实际数据。它可能有一些额外的
约束,这意味着你可以编写
一个转换程序比更普通的转换更快;
例如 strtod 必须处理固定和科学,并且
必须是100%准确,即使有17个有效数字。
它还必须是特定于语言环境。所有这些都增加了
的复杂性,这意味着要执行的添加代码。但要注意:
编写一个有效和正确的转换例程,即使对于
一组有限的输入,也是不平凡的;你真的有
知道你在做什么。

If none of these are fast enough, you'll have to consider the actual data. It probably has some sort of additional constraints, which means that you can potentially write a conversion routine which is faster than the more general ones; e.g. strtod has to handle both fixed and scientific, and it has to be 100% accurate even if there are 17 significant digits. It also has to be locale specific. All of this is added complexity, which means added code to execute. But beware: writing an efficient and correct conversion routine, even for a restricted set of input, is non-trivial; you really do have to know what you are doing.

编辑:

好奇,我已经运行一些测试。除了
前面提到的解决方案,我写了一个简单的自定义转换器
只处理固定点(没有科学),最多
十进制后五位数,十进制
必须适合 int

Just out of curiosity, I've run some tests. In addition to the afore mentioned solutions, I wrote a simple custom converter, which only handles fixed point (no scientific), with at most five digits after the decimal, and the value before the decimal must fit in an int:

double
convert( char const* source, char const** endPtr )
{
    char* end;
    int left = strtol( source, &end, 10 );
    double results = left;
    if ( *end == '.' ) {
        char* start = end + 1;
        int right = strtol( start, &end, 10 );
        static double const fracMult[] 
            = { 0.0, 0.1, 0.01, 0.001, 0.0001, 0.00001 };
        results += right * fracMult[ end - start ];
    }
    if ( endPtr != nullptr ) {
        *endPtr = end;
    }
    return results;
}

(如果你真的使用这个, b处理。这只是为了实验
目的而快速敲掉,以读取我生成的测试文件,并且没有
else。)

(If you actually use this, you should definitely add some error handling. This was just knocked up quickly for experimental purposes, to read the test file I'd generated, and nothing else.)

接口正是 strtod 的接口,以简化编码。

The interface is exactly that of strtod, to simplify coding.

两个环境中的基准(在不同的机器上,
,所以任何时间的绝对值都不相关)。我得到
以下结果:

I ran the benchmarks in two environments (on different machines, so the absolute values of any times aren't relevant). I got the following results:

在使用VC 11(/ O2)编译的Windows 7下:

Under Windows 7, compiled with VC 11 (/O2):

Testing Using fstream directly (5 iterations)...
    6.3528e+006 microseconds per iteration
Testing Using fscan directly (5 iterations)...
    685800 microseconds per iteration
Testing Using strtod (5 iterations)...
    597000 microseconds per iteration
Testing Using manual (5 iterations)...
    269600 microseconds per iteration

/ p>

Under Linux 2.6.18, compiled with g++ 4.4.2 (-O2, IIRC):

Testing Using fstream directly (5 iterations)...
    784000 microseconds per iteration
Testing Using fscanf directly (5 iterations)...
    526000 microseconds per iteration
Testing Using strtod (5 iterations)...
    382000 microseconds per iteration
Testing Using strtof (5 iterations)...
    360000 microseconds per iteration
Testing Using manual (5 iterations)...
    186000 microseconds per iteration

在所有情况下,我正在读取554000行,每行3个随机
生成的浮点范围 [0 ... 10000]

In all cases, I'm reading 554000 lines, each with 3 randomly generated floating point in the range [0...10000).

最引人注目的是
fstream code> fscan 和 strtod 之间相对较小的
区别 code>)。第二件事是
只是简单的自定义转换函数获得多少,在
两个平台。必要的错误处理会减慢
,但差别仍然很大。我期望
有一些改进,因为它不处理标准转换例程所做的很多事情(如科学格式,
非常,非常小的数字,Inf和NaN,i18n,等等),但不是这个
多。

The most striking thing is the enormous difference between fstream and fscan under Windows (and the relatively small difference between fscan and strtod). The second thing is just how much the simple custom conversion function gains, on both platforms. The necessary error handling would slow it down a little, but the difference is still significant. I expected some improvement, since it doesn't handle a lot of things the the standard conversion routines do (like scientific format, very, very small numbers, Inf and NaN, i18n, etc.), but not this much.

这篇关于如何快速解析空间分隔的浮动在C ++?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆