已从size_t铸造性能翻倍 [英] Cast performance from size_t to double

查看：108 发布时间：2016/8/19 16:14:24 c performance casting

本文介绍了已从size_t铸造性能翻倍的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

TL; DR ：为什么乘以为size_t /铸造数据慢，为什么这每个平台有所不同

TL;DR: Why is multiplying/casting data in size_t slow and why does this vary per platform?

我有一些性能问题，我不完全理解。上下文是相机帧捕获器，其中一个128×128 uint16_t图象被读出并在几个100Hz的速率后处理。

I'm having some performance issues that I don't fully understand. The context is a camera frame grabber where a 128x128 uint16_t image is read and post-processed at a rate of several 100 Hz.

在后期处理我生成柱状图框架＆GT;组织相容是 uint32_t的，并拥有 thismaxval = 2 ^ 16个元素，基本上我都吻合强度值。使用这个柱状图我计算的总和平方和：

In the post-processing I generate a histogram frame->histo which is of uint32_t and has thismaxval = 2^16 elements, basically I tally all intensity values. Using this histogram I calculate the sum and squared sum:

double sum=0, sumsquared=0;
size_t thismaxval = 1 << 16;

for(size_t i = 0; i < thismaxval; i++) {
    sum += (double)i * frame->histo[i];
    sumsquared += (double)(i * i) * frame->histo[i];
}

仿形code使用配置文件我得到了以下（样本，百分比，code）：

Profiling the code with profile I got the following (samples, percentage, code):

 58228 32.1263 :  sum += (double)i * frame->histo[i];
116760 64.4204 :  sumsquared += (double)(i * i) * frame->histo[i];

或，第一行占用的CPU时间的32％，第二行64％。

or, the first line takes up 32% of CPU time, the second line 64%.

我做了一些基准测试，它似乎是数据类型/铸造这是有问题的。当我改变code到

I did some benchmarking and it seems to be the datatype/casting that's problematic. When I change the code to

uint_fast64_t isum=0, isumsquared=0;

for(uint_fast32_t i = 0; i < thismaxval; i++) {
    isum += i * frame->histo[i];
    isumsquared += (i * i) * frame->histo[i];
}

运行〜10倍的速度更快。然而，这样的表现也击中每个平台各不相同。在工作站，酷睿i7 CPU 950 @ 3.07GHz的code是速度快10倍。在我Macbook8,1，其中有英特尔酷睿i7 Sandy Bridge的2.7千兆赫（2620M）在code是只快2倍。

it runs ~10x faster. However, this performance hit also varies per platform. On the workstation, a Core i7 CPU 950 @ 3.07GHz the code is 10x faster. On my Macbook8,1, which has a Intel Core i7 Sandy Bridge 2.7 GHz (2620M) the code is only 2x faster.

现在我很纳闷：

为什么原来的code这么慢，容易加快？

为什么会这样每个平台差别如此之大？

更新：

我编译上述code以

g++ -O3  -Wall cast_test.cc -o cast_test

UPDATE2：

我跑了优化codeS通过分析器（仪器在Mac 的，喜欢的Shark ），并发现两件事情：

I ran the optimized codes through a profiler (Instruments on Mac, like Shark) and found two things:

1）的循环本身发生在某些情况下，有相当多的时间。 thismaxval 的类型为为size_t 。

1) The looping itself takes a considerable amount of time in some cases. thismaxval is of type size_t.

的（为size_t我= 0; I＆LT; thismaxval;我++）拿我总运行时间的17％

为（uint_fast32_t我= 0; I＆LT; thismaxval;我++）需要3.5％

的for（int i = 0; I＆LT; thismaxval;我++）在探查显示不出来，我以为这是小于0.1％

for(size_t i = 0; i < thismaxval; i++) takes 17% of my total runtime
for(uint_fast32_t i = 0; i < thismaxval; i++) takes 3.5%
for(int i = 0; i < thismaxval; i++) does not show up in the profiler, I assume it's less than 0.1%

2）的数据类型和铸造物如下：

2) The datatypes and casting matter as follows:

sumsquared + =（双）（I * I）*组织相容[I]; 15％（以为size_t我）

sumsquared + =（双）（I * I）*组织相容[I]; 36％（以 uint_fast32_t我）

isumsquared + =（I * I）*组织相容[I]; 13％（以 uint_fast32_t我， uint_fast64_t isumsquared ）

isumsquared + =（I * I）*组织相容[I]; 11％（以 INT I ， uint_fast64_t isumsquared ）

sumsquared += (double)(i * i) * histo[i]; 15% (with size_t i)
sumsquared += (double)(i * i) * histo[i]; 36% (with uint_fast32_t i)
isumsquared += (i * i) * histo[i]; 13% (with uint_fast32_t i, uint_fast64_t isumsquared)
isumsquared += (i * i) * histo[i]; 11% (with int i, uint_fast64_t isumsquared)

出人意料的是， INT 比 uint_fast32_t ？

UPDATE4：

我跑了不同的数据类型和不同的编译器更多的测试，一台机器上。结果如下：

I ran some more tests with different datatypes and different compilers, on one machine. The results are as follows.

有关testd 0 - 2相关code是

For testd 0 -- 2 the relevant code is

    for(loop_t i = 0; i < thismaxval; i++)
        sumsquared += (double)(i * i) * histo[i];

与 sumsquared 双和 loop_t 为size_t ， uint_fast32_t 和 INT 的测试0，1和2。

with sumsquared a double, and loop_t size_t, uint_fast32_t and int for tests 0, 1 and 2.

有关测试3--5了code是

For tests 3--5 the code is

    for(loop_t i = 0; i < thismaxval; i++)
        isumsquared += (i * i) * histo[i];

与 isumsquared 类型 uint_fast64_t 和 loop_t 再次为size_t ， uint_fast32_t 和 INT 的测试3,4和5。

with isumsquared of type uint_fast64_t and loop_t again size_t, uint_fast32_t and int for tests 3, 4 and 5.

我使用的编译器GCC 4.2.1，GCC 4.4.7，GCC 4.6.3和gcc 4.7.0。时序是的code的CPU总时间的百分比，所以他们表现出相对表现，不是绝对的（虽然运行在-21相当稳定）。该CPU时间为两行，因为我不肯定是否探查正确分离code两行。

The compilers I used are gcc 4.2.1, gcc 4.4.7, gcc 4.6.3 and gcc 4.7.0. The timings are in percentages of total cpu time of the code, so they show relative performance, not absolute (although the runtime was quite constant at 21s). The cpu time is for both two lines, because I'm not quite sure if the profiler correctly separated the two lines of code.


gcc:    4.2.1  4.4.7  4.6.3  4.7.0
----------------------------------
test 0: 21.85  25.15  22.05  21.85
test 1: 21.9   25.05  22     22
test 2: 26.35  25.1   21.95  19.2
test 3: 7.15   8.35   18.55  19.95
test 4: 11.1   8.45   7.35   7.1
test 5: 7.1    7.8    6.9    7.05

或

根据此，似乎铸造是昂贵的，不论我使用整数类型。

Based on this, it seems that casting is expensive, regardless of what integer type I use.

此外，似乎GCC 4.6和4.7不能够正确地优化循环3（为size_t和uint_fast64_t）

Also, it seems gcc 4.6 and 4.7 are not able to optimize loop 3 (size_t and uint_fast64_t) properly.

已从size_t铸造性能翻倍 [英] Cast performance from size_t to double

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

已从size_t铸造性能翻倍 [英] Cast performance from size_t to double

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭