慢堆阵列​​的性能 [英] Slow heap array performance

查看:126
本文介绍了慢堆阵列​​的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到奇怪的内存访问性能问题,任何想法?

I'm experiencing strange memory access performance problem, any ideas?

int* pixel_ptr = somewhereFromHeap;

int local_ptr[307200]; //local

//this is very slow
for(int i=0;i<307200;i++){
  pixel_ptr[i] = someCalculatedVal ;
}

//this is very slow
for(int i=0;i<307200;i++){
  pixel_ptr[i] = 1 ; //constant
}

//this is fast
for(int i=0;i<307200;i++){
  int val = pixel_ptr[i];
  local_ptr[i] = val;
}

//this is fast
for(int i=0;i<307200;i++){
  local_ptr[i] = someCalculatedVal ;
}

试图巩固值本地扫描行

Tried consolidating values to local scanline

int scanline[640]; // local

//this is very slow
for(int i=xMin;i<xMax;i++){
  int screen_pos = sy*screen_width+i;
  int val = scanline[i];
  pixel_ptr[screen_pos] = val ;
}

//this is fast
for(int i=xMin;i<xMax;i++){
  int screen_pos = sy*screen_width+i;
  int val = scanline[i];
  pixel_ptr[screen_pos] = 1 ; //constant
}

//this is fast
for(int i=xMin;i<xMax;i++){
  int screen_pos = sy*screen_width+i;
  int val = i; //or a constant
  pixel_ptr[screen_pos] = val ;
}

//this is slow
for(int i=xMin;i<xMax;i++){
  int screen_pos = sy*screen_width+i;
  int val = scanline[0];
  pixel_ptr[screen_pos] = val ;
}

任何想法?我使用的MinGW与CFLAGS -01 -std = C ++ 11 -fpermissive。

Any ideas? I'm using mingw with cflags -01 -std=c++11 -fpermissive.

UPDATE4:
我不得不说,这些都是从我的节目片段,并有前后跑重code /功能。该扫描线块并在函数退出前结束运行。

update4: I have to say that these are snippets from my program and there are heavy code/functions running before and after. The scanline block did ran at the end of function before exit.

现在适当的测试程序。 thks到@Iwillnotexist。

Now with proper test program. thks to @Iwillnotexist.

#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>

#define SIZE 307200
#define SAMPLES 1000

double local_test(){
    int local_array[SIZE];

    timeval start, end;
    long cpu_time_used_sec,cpu_time_used_usec;
    double cpu_time_used;

    gettimeofday(&start, NULL);
    for(int i=0;i<SIZE;i++){
        local_array[i] = i;
    }
    gettimeofday(&end, NULL);
    cpu_time_used_sec = end.tv_sec- start.tv_sec;
    cpu_time_used_usec = end.tv_usec- start.tv_usec;
    cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;

    return cpu_time_used;
}

double heap_test(){
    int* heap_array=new int[SIZE];

    timeval start, end;
    long cpu_time_used_sec,cpu_time_used_usec;
    double cpu_time_used;

    gettimeofday(&start, NULL);
    for(int i=0;i<SIZE;i++){
        heap_array[i] = i;
    }
    gettimeofday(&end, NULL);
    cpu_time_used_sec = end.tv_sec- start.tv_sec;
    cpu_time_used_usec = end.tv_usec- start.tv_usec;
    cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;

    delete[] heap_array;

    return cpu_time_used;
}


double heap_test2(){
    static int* heap_array = NULL;

    if(heap_array==NULL){
        heap_array = new int[SIZE];
    }

    timeval start, end;
    long cpu_time_used_sec,cpu_time_used_usec;
    double cpu_time_used;

    gettimeofday(&start, NULL);
    for(int i=0;i<SIZE;i++){
        heap_array[i] = i;
    }
    gettimeofday(&end, NULL);
    cpu_time_used_sec = end.tv_sec- start.tv_sec;
    cpu_time_used_usec = end.tv_usec- start.tv_usec;
    cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;

    return cpu_time_used;
}


int main (int argc, char** argv){
    double cpu_time_used = 0;

    for(int i=0;i<SAMPLES;i++)
        cpu_time_used+=local_test();

    printf("local: %f ms\n",cpu_time_used);

    cpu_time_used = 0;

    for(int i=0;i<SAMPLES;i++)
        cpu_time_used+=heap_test();

    printf("heap_: %f ms\n",cpu_time_used);

    cpu_time_used = 0;

    for(int i=0;i<SAMPLES;i++)
        cpu_time_used+=heap_test2();

    printf("heap2: %f ms\n",cpu_time_used);

}

已遵守不进行优化。

Complied with no optimization.

本地:577.201000毫秒

local: 577.201000 ms

heap_:826.802000毫秒

heap_: 826.802000 ms

heap2:686.401000毫秒

heap2: 686.401000 ms

与新删除第一个堆测试是2倍速度较慢。 (分页的建议?)

The first heap test with new and delete is 2x slower. (paging as suggested?)

与重用堆阵列第二堆仍然是1.2倍速度较慢。
但我想第二次测试是不是实用,因为往往到其他codeS之前运行,至少我的情况后。对于我而言,我当然pixel_ptr仅在一次分配
prograim初始化。

The second heap with reused heap array is still 1.2x slower. But I guess the second test is not that practical as there tend to other codes running before and after at least for my case. For my case, my pixel_ptr of course only allocated once during prograim initialization.

但是,如果任何人有解决方案/主意,超速东西请回复!

But if anyone has solutions/idea to speeding things up please reply!

我还在困惑为什么堆写这么比堆栈段慢得多。
当然,必须有一些技巧,使堆更多的CPU /缓存flavourable。

I'm still perplexed why heap write is so much slower than stack segment. Surely there must be some tricks to make the heap more cpu/cache flavourable.

最后更新:

我重新审视,再次反汇编,而这个时候,忽然我有一个想法,为什么我的一些断点
不激活。该方案看上去非常短所以我怀疑编译威力
已删除冗余虚拟code,我把它解释了为什么本地阵列快神奇的许多倍。

I revisited, the disassemblies again and this time, suddenly I have an idea why some of my breakpoints don't activate. The program looks suspiciously shorter thus I suspect the complier might have removed the redundant dummy code I put in which explains why the local array is magically many times faster.

推荐答案

我有点好奇,所以我做了测试,确实我可以测量堆和栈接入之间的差异。

I was a bit curious so I did the test, and indeed I could measure a difference between stack and heap access.

第一个猜想是,生成的程序集是不同的,但服用后一看,它实际上是堆和栈相同(这是有道理的,内存不应该受到歧视)。

The first guess would be that the generated assembly is different, but after taking a look, it is actually identical for heap and stack (which makes sense, memory shouldn't be discriminated).

如果该组件是相同的,则差值必须来自分页机制。的猜测是,在栈上,页面已经分配,​​但在堆上,第一次访问会导致缺页和页面分配(不可见,这一切都发生在内核级)。为了验证这一点,我也做了同样的测试,但首先我要测量前一次访问堆。测试了为栈和堆相同的时间。可以肯定,我也做了一个测试中,我第一次访问堆,但只有每4096字节(每1024 INT),那么8192,因为一个页面通常是4096字节长。其结果是,只访问每4096个字节还给出了堆和栈的同时,但访问每8192给出的差,但不及没有previous访问的。这是因为只有页面的一半访问和分配事先

If the assembly is the same, then the difference must come from the paging mechanism. The guess is that on the stack, the pages are already allocated, but on the heap, first access cause a page fault and page allocation (invisible, it all happens at kernel level). To verify this, I did the same test, but first I would access the heap once before measuring. The test gave identical times for stack and heap. To be sure, I also did a test in which I first accessed the heap, but only every 4096 bytes (every 1024 int), then 8192, because a page is usually 4096 bytes long. The result is that accessing only every 4096 bytes also gives the same time for heap and stack, but accessing every 8192 gives a difference, but not as much as with no previous access at all. This is because only half of the pages were accessed and allocated beforehand.

因此​​,答案是在栈上,内存页已经分配,​​但在堆上,页面上即时分配。这取决于操作系统分页政策,但各大PC操作系统可能有类似的一个。

So the answer is that on the stack, memory pages are already allocated, but on the heap, pages are allocated on-the-fly. This depends on the OS paging policy, but all major PC OSes probably have a similar one.

有关我用Windows中的所有测试,使用MS编译器针对64位。

For all the tests I used Windows, with MS compiler targeting x64.

编辑:作为测试,我测一个单一的,更大的循环,所以出现了在每个存储单元只有一个访问。 删除 ING阵列和测量同一回路多时间应给予栈和堆相似的时间,因为删除 ING内存可能不取消分配的页面,并且他们已经分配给下一个循环(如​​果下一个在同一个空间分配)。

For the test, I measured a single, larger loop, so there was only one access at each memory location. deleteing the array and measuring the same loop multiple time should give similar times for stack and heap, because deleteing memory probably don't de-allocate the pages, and they are already allocated for the next loop (if the next new allocated on the same space).

这篇关于慢堆阵列​​的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆