整数VS双运算性能? [英] Integer vs double arithmetic performance?

查看:132
本文介绍了整数VS双运算性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

结果
我正在写一个C#类使用整数获得比双对口更好的性能进行二维卷积分离。但问题是,我没有获得一个真正的性能增益。


i'm writing a C# class to perform 2D separable convolution using integers to obtain better performance than double counterpart. The problem is that i don't obtain a real performance gain.

这是X过滤器代码(它既是int和double情况下有效):

This is the X filter code (it is valid both for int and double cases):

foreach (pixel)
{
      int value = 0;
      for (int k = 0; k < filterOffsetsX.Length; k++)
      {
          value += InputImage[index + filterOffsetsX[k]] * filterValuesX[k];  //index is relative to current pixel position
      }
      tempImage[index] = value;
 }

在整案价值,inputImage的和tempImage是INT的图片<字节> 和图像< INT> 。类型

在双案价值,inputImage的和tempImage是双,图像<双> 和图像<双重方式> 类型的结果
(filterValues为int []在每种情况下)结果
(类图片< T> 是一个外部的dll的一部分,它应该是类似.NET绘制图像类..)

In the integer case "value", "InputImage" and "tempImage" are of "int", "Image<byte>" and "Image<int>" types.
In the double case "value", "InputImage" and "tempImage" are of "double", "Image<double>" and "Image<double>" types.
(filterValues is int[] in each case)
(The class Image<T> is part of an extern dll. It should be similar to .NET Drawing Image class..).

我的目标是实现快速性能比较由于INT + =(BYTE * INT)VS双+ =(双* INT)

My goal is to achieve fast perfomance thanks to int += (byte * int) vs double += (double * int)

下面的时间是平均的200次重复。结果
滤镜尺寸9 = 0.031(双)0.027(INT)结果
滤镜尺寸13 = 0.042(双)0.038(INT)结果
滤镜尺寸25 = 0.078 (双)0.070(INT)

The following times are mean of 200 repetitions.
Filter size 9 = 0.031 (double) 0.027 (int)
Filter size 13 = 0.042 (double) 0.038 (int)
Filter size 25 = 0.078 (double) 0.070 (int)

的性能提升微乎其微。可以这样用管道间和次优的代码引起

The performance gain is minimal. Can this be caused by pipeline stall and suboptimal code?

编辑:简化了代码删除不重要瓦尔。

simplified the code deleting unimportant vars.

EDIT2:我不认为我有一个高速缓存未命中相关problema因为指数,通过相邻的存储单元迭代(行后的时尚行)。此外filterOffstetsX仅包含小偏移亲属象素在同一行,并在过滤器的大小的最大距离/ 2。该问题可以是存在于第二可分离滤波器(Y-滤波器),但时间并不那么不同

i don't think i have a cache miss related problema because "index"iterate through adjacent memory cells (row after row fashion). Moreover "filterOffstetsX" contains only small offsets relatives to pixels on the same row and at a max distance of filter size / 2. The problem can be present in the second separable filter (Y-filter) but times are not so different.

推荐答案

好像你是说你只运行内部循环5000次,甚至是你的最长的案件。该FPU最后我检查(诚然,在很久以前)只花了大约5个周期来执行乘法比整数单元。因此,通过使用整数你会节省大约25,000个CPU周期。这是假设没有缓存未命中或其他任何会导致CPU坐等无论哪种情况。

It seems like you are saying you are only running that inner loop 5000 times in even your longest case. The FPU last I checked (admittedly a long time ago) only took about 5 more cycles to perform a multiply than the integer unit. So by using integers you would be saving about 25,000 CPU cycles. That's assuming no cache misses or anything else that would cause the CPU to sit and wait in either event.

假设一个现代化的英特尔酷睿CPU主频在2.5Ghz的附近,你可以期望挽救约10的微秒的使用整数单元运行。有点微不足道。我做实时编程为生,我们就不会在这里出汗这么多CPU浪费,即使我们被遗漏的地方限期。

Assuming a modern Intel Core CPU clocked in the neighborhood of 2.5Ghz, You could expect to have saved about 10 microseconds runtime by using the integer unit. Kinda paltry. I do realtime programming for a living, and we wouldn't sweat that much CPU wastage here, even if we were missing a deadline somewhere.

digEmAll使一个很好的点在尽管评论。如果编译器和优化正在做他们的工作,整个事情是流水线。这意味着,在实际中的整个肠子环的将需要5个周期更长的时间来与FPU比整数单元,在它不每个操作运行。如果是这样的话,你的预期时间节约会这么小这将是很难测量它们。

digEmAll makes a very good point in the comments though. If the compiler and optimizer are doing their jobs, the entire thing is pipelined. That means that in actuality the entire innner loop will take 5 cycles longer to run with the FPU than the Integer Unit, not each operation in it. If that were the case, your expected time savings would be so small it would be tough to measure them.

如果你真的做的不够浮点老年退休金计划,使整个家当需要很长的时间,我会建议寻找到做一个或多个以下内容:

If you really are doing enough floating-point ops to make the entire shebang take a very long time, I'd suggest looking into doing one or more of the following:


  1. 并行化的算法,并可以从您的处理器每个CPU上运行它。

  2. 请不要在CLR(使用原生C ++和Ada或Fortran语言或某事)运行它。

  3. 重写它在GPU上运行。 GPU是基本阵列处理器,其目的是做浮点值数组大规模并行运算。

这篇关于整数VS双运算性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆