已从size_t铸造性能翻倍 [英] Cast performance from size_t to double

查看:108
本文介绍了已从size_t铸造性能翻倍的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TL; DR :为什么乘以为size_t /铸造数据慢,为什么这每个平台有所不同

TL;DR: Why is multiplying/casting data in size_t slow and why does this vary per platform?

我有一些性能问题,我不完全理解。上下文是相机帧捕获器,其中一个128×128 uint16_t图象被读出并在几个100Hz的速率后处理。

I'm having some performance issues that I don't fully understand. The context is a camera frame grabber where a 128x128 uint16_t image is read and post-processed at a rate of several 100 Hz.

在后期处理我生成柱状图框架>组织相容 uint32_t的,并拥有 thismaxval = 2 ^ 16个元素,基本上我都吻合强度值。使用这个柱状图我计算的总和平方和:

In the post-processing I generate a histogram frame->histo which is of uint32_t and has thismaxval = 2^16 elements, basically I tally all intensity values. Using this histogram I calculate the sum and squared sum:

double sum=0, sumsquared=0;
size_t thismaxval = 1 << 16;

for(size_t i = 0; i < thismaxval; i++) {
    sum += (double)i * frame->histo[i];
    sumsquared += (double)(i * i) * frame->histo[i];
}

仿形code使用配置文件我得到了以下(样本,百分比,code):

Profiling the code with profile I got the following (samples, percentage, code):

 58228 32.1263 :  sum += (double)i * frame->histo[i];
116760 64.4204 :  sumsquared += (double)(i * i) * frame->histo[i];

或,第一行占用的CPU时间的32%,第二行64%。

or, the first line takes up 32% of CPU time, the second line 64%.

我做了一些基准测试,它似乎是数据类型/铸造这是有问题的。当我改变code到

I did some benchmarking and it seems to be the datatype/casting that's problematic. When I change the code to

uint_fast64_t isum=0, isumsquared=0;

for(uint_fast32_t i = 0; i < thismaxval; i++) {
    isum += i * frame->histo[i];
    isumsquared += (i * i) * frame->histo[i];
}

运行〜10倍的速度更快。然而,这样的表现也击中每个平台各不相同。在工作​​站,酷睿i7 CPU 950 @ 3.07GHz的code是速度快10倍。在我Macbook8,1,其中有英特尔酷睿i7 Sandy Bridge的2.7千兆赫(2620M)在code是只快2倍。

it runs ~10x faster. However, this performance hit also varies per platform. On the workstation, a Core i7 CPU 950 @ 3.07GHz the code is 10x faster. On my Macbook8,1, which has a Intel Core i7 Sandy Bridge 2.7 GHz (2620M) the code is only 2x faster.

现在我很纳闷:


  1. 为什么原来的code这么慢,容易加快?

  2. 为什么会这样每个平台差别如此之大?

更新:

我编译上述code以

g++ -O3  -Wall cast_test.cc -o cast_test

UPDATE2:

我跑了优化codeS通过分析器(仪器在Mac 的,喜欢的Shark ),并发现两件事情:

I ran the optimized codes through a profiler (Instruments on Mac, like Shark) and found two things:

1)的循环本身发生在某些情况下,有相当多的时间。 thismaxval 的类型为为size_t

1) The looping itself takes a considerable amount of time in some cases. thismaxval is of type size_t.


  1. 的(为size_t我= 0; I&LT; thismaxval;我++)拿我总运行时间的17%

  2. 为(uint_fast32_t我= 0; I&LT; thismaxval;我++)需要3.5%

  3. 的for(int i = 0; I&LT; thismaxval;我++)在探查显示不出来,我以为这是小于0.1%

  1. for(size_t i = 0; i < thismaxval; i++) takes 17% of my total runtime
  2. for(uint_fast32_t i = 0; i < thismaxval; i++) takes 3.5%
  3. for(int i = 0; i < thismaxval; i++) does not show up in the profiler, I assume it's less than 0.1%

2)的数据类型和铸造物如下:

2) The datatypes and casting matter as follows:


  1. sumsquared + =(双)(I * I)*组织相容[I]; 15%(以为size_t我

  2. sumsquared + =(双)(I * I)*组织相容[I]; 36%(以 uint_fast32_t我

  3. isumsquared + =(I * I)*组织相容[I]; 13%(以 uint_fast32_t我 uint_fast64_t isumsquared

  4. isumsquared + =(I * I)*组织相容[I]; 11%(以 INT I uint_fast64_t isumsquared

  1. sumsquared += (double)(i * i) * histo[i]; 15% (with size_t i)
  2. sumsquared += (double)(i * i) * histo[i]; 36% (with uint_fast32_t i)
  3. isumsquared += (i * i) * histo[i]; 13% (with uint_fast32_t i, uint_fast64_t isumsquared)
  4. isumsquared += (i * i) * histo[i]; 11% (with int i, uint_fast64_t isumsquared)

出人意料的是, INT uint_fast32_t

UPDATE4:

我跑了不同的数据类型和不同的编译器更多的测试,一台机器上。结果如下:

I ran some more tests with different datatypes and different compilers, on one machine. The results are as follows.

有关testd 0 - 2相关code是

For testd 0 -- 2 the relevant code is

    for(loop_t i = 0; i < thismaxval; i++)
        sumsquared += (double)(i * i) * histo[i];

sumsquared 双和 loop_t 为size_t uint_fast32_t INT 的测试0,1和2。

with sumsquared a double, and loop_t size_t, uint_fast32_t and int for tests 0, 1 and 2.

有关测试3--5了code是

For tests 3--5 the code is

    for(loop_t i = 0; i < thismaxval; i++)
        isumsquared += (i * i) * histo[i];

isumsquared 类型 uint_fast64_t loop_t 再次为size_t uint_fast32_t INT 的测试3,4和5。

with isumsquared of type uint_fast64_t and loop_t again size_t, uint_fast32_t and int for tests 3, 4 and 5.

我使用的编译器GCC 4.2.1,GCC 4.4.7,GCC 4.6.3和gcc 4.7.0。时序是的code的CPU总时间的百分比,所以他们表现出相对表现,不是绝对的(虽然运行在-21相当稳定)。该CPU时间为两行,因为我不肯定是否探查正确分离code两行。

The compilers I used are gcc 4.2.1, gcc 4.4.7, gcc 4.6.3 and gcc 4.7.0. The timings are in percentages of total cpu time of the code, so they show relative performance, not absolute (although the runtime was quite constant at 21s). The cpu time is for both two lines, because I'm not quite sure if the profiler correctly separated the two lines of code.


gcc:    4.2.1  4.4.7  4.6.3  4.7.0
----------------------------------
test 0: 21.85  25.15  22.05  21.85
test 1: 21.9   25.05  22     22
test 2: 26.35  25.1   21.95  19.2
test 3: 7.15   8.35   18.55  19.95
test 4: 11.1   8.45   7.35   7.1
test 5: 7.1    7.8    6.9    7.05

根据此,似乎铸造是昂贵的,不论我使用整数类型。

Based on this, it seems that casting is expensive, regardless of what integer type I use.

此外,似乎GCC 4.6和4.7不能够正确地优化循环3(为size_t和uint_fast64_t)

Also, it seems gcc 4.6 and 4.7 are not able to optimize loop 3 (size_t and uint_fast64_t) properly.

推荐答案

有关你原来的问题:


  1. 的code是缓慢的,因为它涉及到从整数转换
    浮动数据类型。这就是为什么它很容易加速时使用也
    的整数数据类型为求和变量,因为它不要求
    浮点转换了。

  2. 不同的是若干的结果
    因素。例如,它取决于平台是如何能够高效
    执行一个内部 - >浮充转换。此外,该转换
    可在计划也陷入困境处理器内部优化
    流和prediction发动机,缓存...也是内部
    并行化的功能的处理器可以在一个巨大的影响力
    这样的计算。

有关的其他问题:


  • 令人惊讶的int是不是uint_fast32_t快?什么是
    你的平台上的sizeof(为size_t)和sizeof(INT)?一想我可以做的,两者都是
    大概64位,因此强制转换为32bit的,不仅可以给你
    计算误差,还包括不同的尺寸的铸
    点球。

在一般尽量避免可见和隐藏蒙上尽可能好,如果这些不是真的有必要。例如试图找出什么是真正的数据类型隐藏在你的环境为size_t(GCC),并使用一个循环变量。
在您的例子UINT的平方不能是float数据类型所以它是没有意义的双在这里使用。坚持为整数类型,以获得最佳性能。

In general try to avoid visible and hidden casts as good as possible if these aren't really necessary. For example try to find out what real datatype is hidden behind "size_t" on your environment (gcc) and use that one for the loop-variable. In your example the square of uint's cannot be a float datatype so it makes no sense to use double here. Stick to integer types to achieve maximum performance.

这篇关于已从size_t铸造性能翻倍的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆