整数vs双算术性能? [英] Integer vs double arithmetic performance?

查看:113
本文介绍了整数vs双算术性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个C#类,使用整数来执行2D可分离卷积,以获得比双倍的更好的性能。问题是我没有获得真正的性能增益。



这是X过滤器代码(对于int和double的情况都是有效的):


  foreach(pixel)
{
int value = 0; (int k = 0; k< filterOffsetsX.Length; k ++)
{
value + = InputImage [index + filterOffsetsX [k]] * filterValuesX [k] // index是相对于当前像素位置
}
tempImage [index] = value;
}

在整数情况下,值,InputImage和tempImage的int,Image < byte> 和Image < int> 类型。 >
在双重情况下,value,InputImage和tempImage为double,Image < double> 和Image < double> 类型。

(在每种情况下,filterValues都是int [])

(类Image < T> 是一个extern dll的一部分,它应该类似于.NET Drawing Image类..)。



我的目标是通过int + =(byte * int)vs double + =(double * int)



实现快速的性能以下次数是200次重复的平均值。

过滤器大小9 = 0.031(double)0.027(int)

过滤器大小13 = 0.042(double)0.038(int)

过滤器大小25 = 0.078 (双)0.070(int)



性能提升最小。这可能是由流水线停顿和次优代码引起的?



编辑:简化了删除不重要的代码的代码。



EDIT2:我不认为我有缓存未命中相关的问题,因为索引遍历相邻的内存单元格(行后排)。此外,filterOffstetsX仅包含与同一行上的像素和滤波器大小/ 2的最大距离的较小的偏移关系。问题可能存在于第二个可分离滤波器(Y滤波器)中,但时间并不是如此。 p>

解决方案

似乎你说的是,即使是你最长的时间,你只运行内循环5000次。最后我检查的FPU(不可否认,很久以前)只需要5个周期来执行乘法,而不是整数单位。因此,通过使用整数,您将节省约25,000个CPU周期。假设没有高速缓存未命中或其他任何会导致CPU在任何情况下坐下来等待。



假设现代英特尔酷睿CPU的时钟频率为2.5Ghz,您可能希望通过使用整数单位保存约10微秒的运行时。善良微不足道我做一个生活的实时编程,我们不会在这里排出太多的CPU浪费,即使我们在某个地方缺少一个截止日期。



digEmAll是一个非常好的点在评论中。如果编译器和优化器正在做他们的工作,整个事情是流水线的。这意味着在实际中,与整数单元相比,整个内部循环将需要比FPU运行5个周期,而不是每个操作。如果是这样,你预计的时间节省将会很小,这将是很难衡量的。



如果你真的做了足够的浮点操作,使整个shebang花了很长时间,我建议您考虑做以下一个或多个操作:


  1. 并行化算法并在处理器中可用的每个CPU上运行。

  2. 不要在CLR上运行它(使用本机C ++或Ada或Fortran或某些东西)。

  3. 重写它以在GPU上运行。 GPU本质上是阵列处理器,旨在对浮点值数组进行大规模并行计算。


i'm writing a C# class to perform 2D separable convolution using integers to obtain better performance than double counterpart. The problem is that i don't obtain a real performance gain.

This is the X filter code (it is valid both for int and double cases):

foreach (pixel)
{
      int value = 0;
      for (int k = 0; k < filterOffsetsX.Length; k++)
      {
          value += InputImage[index + filterOffsetsX[k]] * filterValuesX[k];  //index is relative to current pixel position
      }
      tempImage[index] = value;
 }

In the integer case "value", "InputImage" and "tempImage" are of "int", "Image<byte>" and "Image<int>" types.
In the double case "value", "InputImage" and "tempImage" are of "double", "Image<double>" and "Image<double>" types.
(filterValues is int[] in each case)
(The class Image<T> is part of an extern dll. It should be similar to .NET Drawing Image class..).

My goal is to achieve fast perfomance thanks to int += (byte * int) vs double += (double * int)

The following times are mean of 200 repetitions.
Filter size 9 = 0.031 (double) 0.027 (int)
Filter size 13 = 0.042 (double) 0.038 (int)
Filter size 25 = 0.078 (double) 0.070 (int)

The performance gain is minimal. Can this be caused by pipeline stall and suboptimal code?

EDIT: simplified the code deleting unimportant vars.

EDIT2: i don't think i have a cache miss related problema because "index"iterate through adjacent memory cells (row after row fashion). Moreover "filterOffstetsX" contains only small offsets relatives to pixels on the same row and at a max distance of filter size / 2. The problem can be present in the second separable filter (Y-filter) but times are not so different.

解决方案

It seems like you are saying you are only running that inner loop 5000 times in even your longest case. The FPU last I checked (admittedly a long time ago) only took about 5 more cycles to perform a multiply than the integer unit. So by using integers you would be saving about 25,000 CPU cycles. That's assuming no cache misses or anything else that would cause the CPU to sit and wait in either event.

Assuming a modern Intel Core CPU clocked in the neighborhood of 2.5Ghz, You could expect to have saved about 10 microseconds runtime by using the integer unit. Kinda paltry. I do realtime programming for a living, and we wouldn't sweat that much CPU wastage here, even if we were missing a deadline somewhere.

digEmAll makes a very good point in the comments though. If the compiler and optimizer are doing their jobs, the entire thing is pipelined. That means that in actuality the entire innner loop will take 5 cycles longer to run with the FPU than the Integer Unit, not each operation in it. If that were the case, your expected time savings would be so small it would be tough to measure them.

If you really are doing enough floating-point ops to make the entire shebang take a very long time, I'd suggest looking into doing one or more of the following:

  1. Parallelize your algorithm and run it on every CPU available from your processor.
  2. Don't run it on the CLR (use native C++, or Ada or Fortran or something).
  3. Rewrite it to run on the GPU. GPUs are essentially array processors and are designed to do massively parallel math on arrays of floating-point values.

这篇关于整数vs双算术性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆