为什么将0.1f改为0会使性能降低10倍? [英] Why does changing 0.1f to 0 slow down performance by 10x?

查看:261
本文介绍了为什么将0.1f改为0会使性能降低10倍?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么这一段代码

  const float x [16] = {1.1,1.2,1.3,1.4, 1.5,1.6,1.7,1.8,
1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
const float z [16] = {1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,
1.923,2.034,2.145,2.256,2.336,2.478,2.589,2.690};
float y [16];
for(int i = 0; i <16; i ++)
{
y [i] = x [i]
}

for(int j = 0; j <9000000; j ++)
{
for(int i = 0; i <16; i ++)
{
y [i] * = x [i];
y [i] / = z [i];
y [i] = y [i] + 0.1f; // < -
y [i] = y [i] - 0.1f; //< -
}
}

运行10次以上

  const float x [16] = {1.1,1.2,1.3, 1.4,1.5,1.6,1.7,1.8,
1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
const float z [16] = {1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,
1.923,2.034,2.145,2.256,2.336,2.478,2.589,2.690};
float y [16];
for(int i = 0; i <16; i ++)
{
y [i] = x [i]
}

for(int j = 0; j <9000000; j ++)
{
for(int i = 0; i <16; i ++)
{
y [i] * = x [i];
y [i] / = z [i];
y [i] = y [i] + 0; // < -
y [i] = y [i] - 0; // < -
}
}

2010 SP1。 (我尚未与其他编译器测试。)

解决方案

欢迎来到非正规化浮点他们可能对性能造成破坏!



反正常(或低于正常)的数字是一种黑客,得到一些非常接近零的额外值。非规格化浮点上的操作可以比标准化浮点的 慢几十到几百倍。这是因为许多处理器不能直接处理它们,并且必须使用微代码来陷阱和解析它们。



如果在10,000次迭代后打印出数字,根据是否使用 0 0.1 ,汇集到不同的值。

这里是在x64上编译的测试代码:

  int main(){

double start = omp_get_wtime();

const float x [16] = {1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
const float z [16] = {1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,1.923,2.034,2.145,2.256,2.367,2.478,2.589,2.690};
float y [16];
for(int i = 0; i <16; i ++)
{
y [i] = x [i]
}
for(int j = 0; j <9000000; j ++)
{
for(int i = 0; i <16; i ++)
{
y [i] * = x [i];
y [i] / = z [i];
#ifdef FLOATING
y [i] = y [i] + 0.1f;
y [i] = y [i] -0.1f;
#else
y [i] = y [i] +0;
y [i] = y [i] -0;
#endif

if(j> 10000)
cout<< y [i] ;
}
if(j> 10000)
cout<< endl;
}

double end = omp_get_wtime();
cout<< end-start<< endl;

系统(pause);
return 0;
}

输出

  #define FLOATING 
1.78814e-007 1.3411e-007 1.04308e-007 0 7.45058e-008 6.70552e-008 6.70552e-008 5.58794 e-007 3.05474e-007 2.16067e-007 1.71363e-007 1.49012e-007 1.2666e-007 1.11759e-007 1.04308e-007 1.04308e-007
1.78814e-007 1.3411e-007 1.04308e- 007 0 7.45058e-008 6.70552e-008 6.70552e-008 5.58794e-007 3.05474e-007 2.16067e-007 1.71363e-007 1.49012e-007 1.2666e-007 1.11759e-007 1.04308e-007 1.04308e-007

//#define FLOATING
6.30584e-044 3.92364e-044 3.08286e-044 0 1.82169e-044 1.54143e-044 2.10195e-044 2.46842e-029 7.56701e-044 4.06377e-044 3.92364e-044 3.22299e-044 3.08286e-044 2.66247e-044 2.66247e-044 2.24208e-044
6.30584e-044 3.92364e-044 3.08286e-044 0 1.82169e-044 1.54143 e-044 2.10195e-044 2.45208e-029 7.56701e-044 4.06377e-044 3.92364e-044 3.22299e-044 3.08286e-044 2.66247e-044 2.66247e-044 2.24208e-044

请注意,在第二次运行中,数字非常接近零。



非规范化数字通常很少,因此大多数处理器不会尝试有效地处理它们。






如果我们通过将此添加到代码的开头将将denormals清除为零来处理非规范化数字:

  _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); 

那么 0 的版本不再10倍慢,实际上变得更快。 (这要求代码在启用SSE的情况下编译。)



这意味着,而不是使用这些奇怪的较低精度几乎为零的值,我们只是舍入到零。



时间:Core i7 920 @ 3.5 GHz:

  //不要将正数清零。 
0.1f:0.564067
0:26.7669

//清除正数为零。
0.1f:0.587117
0:0.341406

与它是一个整数还是浮点无关。 0 0.1f 被转换/存储到两个循环之外的寄存器中。这对性能没有影响。


Why does this bit of code,

const float x[16] = {  1.1,   1.2,   1.3,     1.4,   1.5,   1.6,   1.7,   1.8,
                       1.9,   2.0,   2.1,     2.2,   2.3,   2.4,   2.5,   2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
                     1.923, 2.034, 2.145,   2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
    y[i] = x[i];
}

for (int j = 0; j < 9000000; j++)
{
    for (int i = 0; i < 16; i++)
    {
        y[i] *= x[i];
        y[i] /= z[i];
        y[i] = y[i] + 0.1f; // <--
        y[i] = y[i] - 0.1f; // <--
    }
}

run more than 10 times faster than the following bit (identical except where noted)?

const float x[16] = {  1.1,   1.2,   1.3,     1.4,   1.5,   1.6,   1.7,   1.8,
                       1.9,   2.0,   2.1,     2.2,   2.3,   2.4,   2.5,   2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
                     1.923, 2.034, 2.145,   2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
    y[i] = x[i];
}

for (int j = 0; j < 9000000; j++)
{
    for (int i = 0; i < 16; i++)
    {
        y[i] *= x[i];
        y[i] /= z[i];
        y[i] = y[i] + 0; // <--
        y[i] = y[i] - 0; // <--
    }
}

when compiling with Visual Studio 2010 SP1. (I haven't tested with other compilers.)

解决方案

Welcome to the world of denormalized floating-point! They can wreak havoc on performance!!!

Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation. Operations on denormalized floating-point can be tens to hundreds of times slower than on normalized floating-point. This is because many processors can't handle them directly and must trap and resolve them using microcode.

If you print out the numbers after 10,000 iterations, you will see that they have converged to different values depending on whether 0 or 0.1 is used.

Here's the test code compiled on x64:

int main() {

    double start = omp_get_wtime();

    const float x[16]={1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
    const float z[16]={1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,1.923,2.034,2.145,2.256,2.367,2.478,2.589,2.690};
    float y[16];
    for(int i=0;i<16;i++)
    {
        y[i]=x[i];
    }
    for(int j=0;j<9000000;j++)
    {
        for(int i=0;i<16;i++)
        {
            y[i]*=x[i];
            y[i]/=z[i];
#ifdef FLOATING
            y[i]=y[i]+0.1f;
            y[i]=y[i]-0.1f;
#else
            y[i]=y[i]+0;
            y[i]=y[i]-0;
#endif

            if (j > 10000)
                cout << y[i] << "  ";
        }
        if (j > 10000)
            cout << endl;
    }

    double end = omp_get_wtime();
    cout << end - start << endl;

    system("pause");
    return 0;
}

Output:

#define FLOATING
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007

//#define FLOATING
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.46842e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.45208e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044

Note how in the second run the numbers are very close to zero.

Denormalized numbers are generally rare and thus most processors don't try to handle them efficiently.


To demonstrate that this has everything to do with denormalized numbers, if we flush denormals to zero by adding this to the start of the code:

_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

Then the version with 0 is no longer 10x slower and actually becomes faster. (This requires that the code be compiled with SSE enabled.)

This means that rather than using these weird lower precision almost-zero values, we just round to zero instead.

Timings: Core i7 920 @ 3.5 GHz:

//  Don't flush denormals to zero.
0.1f: 0.564067
0   : 26.7669

//  Flush denormals to zero.
0.1f: 0.587117
0   : 0.341406

In the end, this really has nothing to do with whether it's an integer or floating-point. The 0 or 0.1f is converted/stored into a register outside of both loops. So that has no effect on performance.

这篇关于为什么将0.1f改为0会使性能降低10倍?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆