浮点数与定点数:Intel I5 CPU上的速度 [英] Floating-point number vs fixed-point number: speed on Intel I5 CPU

查看:552
本文介绍了浮点数与定点数:Intel I5 CPU上的速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个C/C ++程序,它涉及密集的 32位浮点矩阵数学计算,例如加法,减法,乘法,除法等.

I have a C/C++ program which involves intensive 32-bit floating-point matrix math computations such as addition, subtraction, multiplication, division, etc.

我可以通过将 32位浮点数数字转换为 16位定点数数字来加快程序速度吗?我可以获得多少速度增益?

Can I speed up my program by converting 32-bit floating-point numbers into 16-bit fixed-point numbers ? How much speed gain can I get ?

当前,我正在使用Intel I5 CPU.我正在使用Openblas执行矩阵计算.我应该如何重新实现诸如 cblas_dgemm 之类的Openblas函数来执行定点计算?

Currently I'm working on a Intel I5 CPU. I'm using Openblas to perform the matrix calculations. How should I re-implement Openblas functions such as cblas_dgemm to perform fixed-point calculations ?

我知道SSE(简单SIMD扩展)一次对4x32 = 8x16 = 128位数据进行操作,即4个32位浮点类型或8个16位定点类型.我猜想从32位浮点转换为16位定点后,我的程序会快两倍.

I know that SSE(Simple SIMD Extensions) operates on 4x32=8x16=128 bit data at one time, i.e., 4 32-bit floating-point type or 8 16-bit fixed-point type. I guess that after conversion from 32-bit floating-point to 16-bit fixed-point, my program would be twice faster.

推荐答案

摘要:即使每个向量有两倍的元素,现代FPU硬件也很难用定点击败.

Summary: Modern FPU hardware is hard to beat with fixed-point, even if you have twice as many elements per vector.

现代BLAS库通常针对缓存性能(具有缓存阻塞/循环平铺)以及指令吞吐量进行了很好的调整.这使得他们非常很难被击败.尤其是DGEMM对于这种优化有很大的空间,因为它可以对O(N ^ 2)数据进行O(N ^ 3)处理,因此值得仅对一个输入大小的缓存大小的块进行此类处理.

Modern BLAS library are typically very well tuned for cache performance (with cache blocking / loop tiling) as well as for instruction throughput. That makes them very hard to beat. Especially DGEMM has lots of room for this kind of optimization, because it does O(N^3) work on O(N^2) data, so it's worth transposing just a cache-sized chunk of one input, and stuff like that.

可能的帮助是通过以16位半浮点数格式存储浮点数来减少内存瓶颈.没有硬件支持以这种格式对它们进行数学运算,只有几条指令可以在加载/存储时在该格式和普通的32位元素浮点向量之间进行转换: F16C扩展 ,最早由AMD Bulldozer和Intel IvyBridge支持.

What might help is reducing memory bottlenecks by storing your floats in 16-bit half-float format. There is no hardware support for doing math on them in that format, just a couple instructions to convert between that format and normal 32-bit element float vectors while loading/storing: VCVTPH2PS (__m256 _mm256_cvtph_ps(__m128i)) and VCVTPS2PH (__m128i _mm256_cvtps_ph(__m256 m1, const int imm8_rounding_control). These two instructions comprise the F16C extension, first supported by AMD Bulldozer and Intel IvyBridge.

IDK(如果任何BLAS库支持该格式).

IDK if any BLAS libraries support that format.

SSE/AVX没有任何整数除法指令.如果只用常量除,则可能不需要真正的div指令.因此,这是固定点的主要绊脚石.

SSE/AVX doesn't have any integer division instructions. If you're only dividing by constants, you might not need a real div instruction, though. So that's one major stumbling block for fixed point.

不动点的另一个大缺点是,乘以后要更正小数(二进制?)点的位置而产生的移位额外成本.如果每个向量具有16位固定点的元素数量是原来的两倍,那将使您获得任何收益.

Another big downside of fixed point is the extra cost of shifting to correct the position of the decimal (binary?) point after multiplies. That will eat into any gain you could get from having twice as many elements per vector with 16-bit fixed point.

SSE/AVX实际上有很多打包16位乘法的选择(比任何其他元素大小更好).打包乘以产生低半部分,高半部分(有符号或无符号),甚至是从顶部下方的2位中取16位的一位,并进行四舍五入( Agner Fog的insn表,以及

SSE/AVX actually has quite a good selection of packed 16-bit multiplies (better than for any other element size). There's packed multiply producing the low half, high half (signed or unsigned), and even one that takes 16 bits from 2 bits below the top, with rounding (PMULHRSW.html). Skylake runs those at two per clock, with 5 cycle latency. There are also integer multiply-add instructions, but they do a horizontal add between pairs of multiply results. (See Agner Fog's insn tables, and also the x86 tag wiki for performance links.) Haswell and previous don't have as many integer-vector add and multiply execution units. Often code bottlenecks on total uop throughput, not on a specific execution port anyway. (But a good BLAS library might even have hand-tuned asm.)

如果输入和输出是整数,则使用整数向量通常更快,而不是转换为浮点数. (例如,请参见缩放字节像素值(y = ax + b)和SSE2(浮点数)?,在这里我使用16位定点处理8位整数).

If your inputs and outputs are integer, it's often faster to work with integer vectors, instead of converting to floats. (e.g. see my answer on Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?, where I used 16-bit fixed-point to deal with 8-bit integers).

但是,如果您真的在使用浮点数,并且要进行很多乘除运算,则只需使用硬件FPU.它们在现代CPU中具有惊人的强大功能,并且使定点处理在大多数任务中已过时.正如@Iwill所指出的那样,FMA指令是FP吞吐量(有时甚至是延迟)的又一重大提升.

But if you're really working with floats, and have a lot of multiplying and dividing to do, just use the hardware FPUs. They're amazingly powerful in modern CPUs, and have made fixed-point mostly obsolete for many tasks. As @Iwill points out, FMA instructions are another big boost for FP throughput (and sometimes latency).

整数加/减/比较指令(但不是乘法)也比FP指令的等待时间短.

Integer add/subtract/compare instructions (but not multiply) are also lower latency than their FP counterparts.

这篇关于浮点数与定点数:Intel I5 CPU上的速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆