浮动乘法执行较慢的温度取决于操作数 [英] Floating multiplication performing slower depending of operands in C

查看:130
本文介绍了浮动乘法执行较慢的温度取决于操作数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我表演矩阵I $从文件中读取pviously $ p的模板计算。我使用两种不同类型的矩阵(非零型和零型)。这两种类型的共享边界(通常1000)的值,而元素的其余均为0为非零型零类型和1。

的code存储在相同大小的两个分配矩阵的文件的矩阵。然后,它利用自身的价值和邻居的值一个矩阵中的每个元素执行操作(添加×4和MUL×1),并将结果存储在第二矩阵。一旦计算完成后,对于矩阵的指针交换和相同的操作是用于次数有限数量执行。在这里,你拥有核心code:

 的#define GET(I,J)rMat [(I)* COLS +(J)
的#define PUT(I,J)wMat [(Ⅰ)* COLS +(J)]为(CUR_TIME = 0; CUR_TIME&下;时间步; CUR_TIME ++){
    对于(i = 1; I<行-1;我++){
        为(J = 1; J&下; C​​OLS-1; J ++){
            PUT(I,J)= 0.2F *(GET(ⅰ-1,j)的+ GET(I,J-1)+ GET(I,J)+ GET(I,J + 1)+ GET第(i + 1 ,j)的);
        }
    }
    //改变指针下一次迭代
    auxP = wMat;
    wMat = rMat;
    rMat = auxP;
}

我揭露的情况下使用的时间步长500固定金额(外迭代)和8192行8192列的矩阵大小,但同时改变时间步或矩阵大小的数量问题仍然存在。请注意,我只测量算法的这个具体部分的时间,因此从文件中也没有别的阅读矩阵影响时间的措施。

会发生什么,是我得到这取决于哪种类型的矩阵我用的,用零类型时获得更糟糕的表现(所有其他基质执行相同的非零类型,我已经尝试生成矩阵不同时间全随机值)。

我敢肯定它是乘法运算,因为如果我删除,只留下补充说,他们的表现是一样的。注意,与零矩阵型,大多数类型的总和的结果将是0,所以操作将是0.2 * 0

这行为是对我来说肯定是怪异,因为我认为浮点运算是独立的操作数,这看起来并不像这里的情况的值。我也试图捕捉并显示这是问题的情况下SIGFPE异常,但我没有获得结果。

在情况下,它可以帮助,我使用的是英特尔的Nehalem处理器和gcc 4.4.3。


解决方案

该问题已大多被确诊,但我会写在这里究竟发生了什么。

从本质上讲,提问者造型扩散;在边界上的初始量扩散到大格的全部。在每个时间步长t,在扩散的前缘的值将是0.2 ^ T(忽略在角部的影响)。

最小的归一化单precision值为2 ^ -126;当 CUR_TIME = 55 ,在扩散的前沿值是0.2 ^ 55,这是比2 ^ -127小一点。从这个时候向前迈进了一步,有的在网格单元格将包含的非正规的值。在提问的Nehalem上非规格化数据操作是不是归浮点数据相同的操作慢约100倍,说明放缓。

在网格最初充满了常量数据 1.0 ,数据永远不会太小,因此避免了非正规摊位。

请注意,更改数据类型为双击将推迟,但不能缓解这个问题。如果双precision用于计算,非规格化值(现在小于2 ^ -1022)将首先在第441次迭代产生

在precision在扩散的前沿成本,你可以通过启用刷新零来解决经济放缓,这将导致处理器产生零而不是算术运算非规格化的结果。 fenv.h> 头中的C库中所定义的函数进行p>

另一个(hackier,不太好的)修复将是非常小的非零值(填充这个矩阵 0x1.0p-126F ,最小正常数)。这也将prevent非正规从计算所产生的。

I am performing a stencil computation on a matrix I previously read from a file. I use two different kinds of matrices (NonZero type and Zero type). Both types share the value of the boundaries (1000 usually), whilst the rest of the elements are 0 for Zero type and 1 for NonZero type.

The code stores the matrix of the file in two allocated matrices of the same size. Then it performs an operation in every element of one matrix using its own value and values of neighbours (add x 4 and mul x 1), and stores the result in the second matrix. Once the computation is finished, the pointers for matrices are swapped and the same operation is perform for a finite amount of times. Here you have the core code:

#define GET(I,J) rMat[(I)*cols + (J)]
#define PUT(I,J) wMat[(I)*cols + (J)]

for (cur_time=0; cur_time<timeSteps; cur_time++) {
    for (i=1; i<rows-1; i++) {
        for (j=1; j<cols-1; j++) {
            PUT(i,j) = 0.2f*(GET(i-1,j) + GET(i,j-1) + GET(i,j) + GET(i,j+1) + GET(i+1,j));
        }
    }
    // Change pointers for next iteration
    auxP = wMat;
    wMat = rMat;
    rMat = auxP;
}

The case I am exposing uses a fixed amount of 500 timeSteps (outer iterations) and a matrix size of 8192 rows and 8192 columns, but the problem persists while changing number of timeSteps or matrix size. Note that I only measure time of this concrete part of algorithm, so reading matrix from file nor anything else affects the time measure.

What it happens, is that I get different times depending on which type of matrix I use, obtaining a much worse performance when using Zero type (every other matrix performs same as NonZero type, as I have already tried to generate a matrix full of random values).

I am certain it is the multiplication operation, as if I remove it and leave only the adds, they perform the same. Note that with Zero matrix type, most of the type the result of the sum will be 0, so the operation will be "0.2*0".

This behaviour is certainly weird for me, as I thought that floating point operations were independent of values of operands, which does not look like the case here. I have also tried to capture and show SIGFPE exceptions in case that was the problem, but I obtained no results.

In case it helps, I am using an Intel Nehalem processor and gcc 4.4.3.

解决方案

The problem has already mostly been diagnosed, but I will write up exactly what happens here.

Essentially, the questioner is modeling diffusion; an initial quantity on the boundary diffuses into the entirety of a large grid. At each time step t, the value at the leading edge of the diffusion will be 0.2^t (ignoring effects at the corners).

The smallest normalized single-precision value is 2^-126; when cur_time = 55, the value at the frontier of the diffusion is 0.2^55, which is a bit smaller than 2^-127. From this time step forward, some of the cells in the grid will contain denormal values. On the questioner's Nehalem, operations on denormal data are about 100 times slower than the same operation on normalized floating point data, explaining the slowdown.

When the grid is initially filled with constant data of 1.0, the data never gets too small, and so the denormal stall is avoided.

Note that changing the data type to double would delay, but not alleviate the issue. If double precision is used for the computation, denormal values (now smaller than 2^-1022) will first arise in the 441st iteration.

At the cost of precision at the leading edge of the diffusion, you could fix the slowdown by enabling "Flush to Zero", which causes the processor to produce zero instead of denormal results in arithmetic operations. This is done by toggling a bit in the FPSCR or MXSCR, preferably via the functions defined in the <fenv.h> header in the C library.

Another (hackier, less good) "fix" would be to fill the matrix initially with very small non-zero values (0x1.0p-126f, the smallest normal number). This would also prevent denormals from arising in the computation.

这篇关于浮动乘法执行较慢的温度取决于操作数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆