使用C / Intel汇编,什么是测试如果一个128字节的内存块包含全零的最快方法? [英] Using C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?

查看:265
本文介绍了使用C / Intel汇编,什么是测试如果一个128字节的内存块包含全零的最快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从我的第一个问题继续,我试图通过优化发现VTune™可视化存储器热点剖析一个64位的C程序。

Continuing on from my first question, I am trying to optimize a memory hotspot found via VTune profiling a 64-bit C program.

在我特别想找到来测试内存的128字节块包含全零的最快方法。你可以假设的内存块任何所需的内存对齐;我用64字节对齐。

In particular, I'd like to find the fastest way to test if a 128-byte block of memory contains all zeros. You may assume any desired memory alignment for the memory block; I used 64-byte alignment.

我使用的是电脑与英特尔的Ivy Bridge酷睿i7 3770处理器,32 GB内存和Microsoft Visual Studio 2010的C编译器的免费版本。

I am using a PC with an Intel Ivy Bridge Core i7 3770 processor with 32 GB of memory and the free version of Microsoft vs2010 C compiler.

我的第一个尝试是:

const char* bytevecM;    // 4 GB block of memory, 64-byte aligned
size_t* psz;             // size_t is 64-bits
// ...
// "m7 & 0xffffff80" selects the 128 byte block to test for all zeros
psz = (size_t*)&bytevecM[(unsigned int)m7 & 0xffffff80];
if (psz[0]  == 0 && psz[1]  == 0
&&  psz[2]  == 0 && psz[3]  == 0
&&  psz[4]  == 0 && psz[5]  == 0
&&  psz[6]  == 0 && psz[7]  == 0
&&  psz[8]  == 0 && psz[9]  == 0
&&  psz[10] == 0 && psz[11] == 0
&&  psz[12] == 0 && psz[13] == 0
&&  psz[14] == 0 && psz[15] == 0) continue;
// ...

对应的汇编的VTune™可视化分析如下:

VTune profiling of the corresponding assembly follows:

cmp    qword ptr [rax],      0x0       0.171s
jnz    0x14000222                     42.426s
cmp    qword ptr [rax+0x8],  0x0       0.498s
jnz    0x14000222                      0.358s
cmp    qword ptr [rax+0x10], 0x0       0.124s
jnz    0x14000222                      0.031s
cmp    qword ptr [rax+0x18], 0x0       0.171s
jnz    0x14000222                      0.031s
cmp    qword ptr [rax+0x20], 0x0       0.233s
jnz    0x14000222                      0.560s
cmp    qword ptr [rax+0x28], 0x0       0.498s
jnz    0x14000222                      0.358s
cmp    qword ptr [rax+0x30], 0x0       0.140s
jnz    0x14000222
cmp    qword ptr [rax+0x38], 0x0       0.124s
jnz    0x14000222
cmp    qword ptr [rax+0x40], 0x0       0.156s
jnz    0x14000222                      2.550s
cmp    qword ptr [rax+0x48], 0x0       0.109s
jnz    0x14000222                      0.124s
cmp    qword ptr [rax+0x50], 0x0       0.078s
jnz    0x14000222                      0.016s
cmp    qword ptr [rax+0x58], 0x0       0.078s
jnz    0x14000222                      0.062s
cmp    qword ptr [rax+0x60], 0x0       0.093s
jnz    0x14000222                      0.467s
cmp    qword ptr [rax+0x68], 0x0       0.047s
jnz    0x14000222                      0.016s
cmp    qword ptr [rax+0x70], 0x0       0.109s
jnz    0x14000222                      0.047s
cmp    qword ptr [rax+0x78], 0x0       0.093s
jnz    0x14000222                      0.016s

我能够通过英特尔instrinsics改善是:

I was able to improve on that via Intel instrinsics:

const char* bytevecM;                        // 4 GB block of memory
__m128i* psz;                                // __m128i is 128-bits
__m128i one = _mm_set1_epi32(0xffffffff);    // all bits one
// ...
psz = (__m128i*)&bytevecM[(unsigned int)m7 & 0xffffff80];
if (_mm_testz_si128(psz[0], one) && _mm_testz_si128(psz[1], one)
&&  _mm_testz_si128(psz[2], one) && _mm_testz_si128(psz[3], one)
&&  _mm_testz_si128(psz[4], one) && _mm_testz_si128(psz[5], one)
&&  _mm_testz_si128(psz[6], one) && _mm_testz_si128(psz[7], one)) continue;
// ...

对应的汇编的VTune™可视化分析如下:

VTune profiling of the corresponding assembly follows:

movdqa xmm0, xmmword ptr [rax]         0.218s
ptest  xmm0, xmm2                     35.425s
jnz    0x14000ddd                      0.700s
movdqa xmm0, xmmword ptr [rax+0x10]    0.124s
ptest  xmm0, xmm2                      0.078s
jnz    0x14000ddd                      0.218s
movdqa xmm0, xmmword ptr [rax+0x20]    0.155s
ptest  xmm0, xmm2                      0.498s
jnz    0x14000ddd                      0.296s
movdqa xmm0, xmmword ptr [rax+0x30]    0.187s
ptest  xmm0, xmm2                      0.031s
jnz    0x14000ddd
movdqa xmm0, xmmword ptr [rax+0x40]    0.093s
ptest  xmm0, xmm2                      2.162s
jnz    0x14000ddd                      0.280s
movdqa xmm0, xmmword ptr [rax+0x50]    0.109s
ptest  xmm0, xmm2                      0.031s
jnz    0x14000ddd                      0.124s
movdqa xmm0, xmmword ptr [rax+0x60]    0.109s
ptest  xmm0, xmm2                      0.404s
jnz    0x14000ddd                      0.124s
movdqa xmm0, xmmword ptr [rax+0x70]    0.093s
ptest  xmm0, xmm2                      0.078s
jnz    0x14000ddd                      0.016s

正如你所看到的,有较少的汇编指令这个版本进一步被证明是更快的时序测试。

As you can see, there are fewer assembly instructions and this version further proved to be faster in timing tests.

由于我在英特尔SSE / AVX指令方面比较薄弱,我就如何他们可能会更好地使用,以加快这一code欢迎咨询。

Since I am quite weak in the area of Intel SSE/AVX instructions, I welcome advice on how they might be better employed to speed up this code.

虽然我走遍了数百个可用instrinsics的,我可能已经错过了理想的。特别是,我无法有效地使用_mm_cmpeq_epi64();我看着这个禀的不等于的版本(这似乎更适合这个问题),但想出干燥。虽然低于code作品:

Though I scoured the hundreds of available instrinsics, I may have missed the ideal ones. In particular, I was unable to effectively employ _mm_cmpeq_epi64(); I looked for a "not equal" version of this instrinsic (which seems better suited to this problem) but came up dry. Though the below code "works":

if (_mm_testz_si128(_mm_andnot_si128(_mm_cmpeq_epi64(psz[7], _mm_andnot_si128(_mm_cmpeq_epi64(psz[6], _mm_andnot_si128(_mm_cmpeq_epi64(psz[5], _mm_andnot_si128(_mm_cmpeq_epi64(psz[4], _mm_andnot_si128(_mm_cmpeq_epi64(psz[3], _mm_andnot_si128(_mm_cmpeq_epi64(psz[2], _mm_andnot_si128(_mm_cmpeq_epi64(psz[1], _mm_andnot_si128(_mm_cmpeq_epi64(psz[0], zero), one)), one)), one)), one)), one)), one)), one)), one), one)) continue;

它是边缘不可读和(不出所料)被证明是比上面给出的两个版本的方式要慢。我觉得肯定必须有聘请如何可能实现_mm_cmpeq_epi64(),并欢迎咨询更优雅的方式。

it is borderline unreadable and (unsurprisingly) proved to be way slower than the two versions given above. I feel sure there must be a more elegant way to employ _mm_cmpeq_epi64() and welcome advice on how that might be achieved.

在除了使用内部函数从C,生吃英特尔汇编语言解决这个问题也是值得欢迎的。

In addition to using intrinsics from C, raw Intel assembly language solutions to this problem are also welcome.

推荐答案

的主要问题,正如其他人所指出的,是你正在检查的128个字节的数据丢失的数据缓存和/或的TLB 并打算DRAM,它是缓慢的。 VTune™可视化告诉你这个

The main problem, as others have pointed out, is that the 128-byte data you are checking is missing the data cache and/or the TLB and going to DRAM, which is slow. VTune is telling you this

cmp    qword ptr [rax],      0x0       0.171s
jnz    0x14000222                     42.426s

您有其它更小,热点中途下来

You have another, smaller, hotspot half-way down

cmp    qword ptr [rax+0x40], 0x0       0.156s
jnz    0x14000222                      2.550s

这些42.4 +2.5秒占到了JNZ指令确实引起内存中的previous负载摆摊...处理器坐在那里什么都不做45秒总数超过您异形程序的时候.. .waiting的DRAM。

Those 42.4 + 2.5 seconds accounted to the JNZ instructions are really a stall caused by the previous load from memory... the processor is sitting around doing nothing for 45 seconds total over the time you profiled the program...waiting on DRAM.

您可能会问什么第二热点中途下来的全部。好吧,你正在访问的128字节和缓存线是64字节,CPU尽快为您启动prefetching因为它读取的第一个64个字节...但你没有做足够的工作与第64 -bytes彻底重叠将内存延迟。

You might ask what the 2nd hotspot half-way down is all about. Well, you are accessing 128-bytes and cache lines are 64-bytes, the CPU started prefetching for you as soon as it read the first 64-bytes... but you didn't do enough work with the first 64-bytes to totally overlap the latency of going to memory.

的Ivy Bridge的内存带宽是非常高的(这取决于你的系统,但我猜超过10 GB /秒)。你的内存块是4GB,你应该能够提前压缩THRU它小于1秒,如果你按顺序访问它,让CPU prefetch数据给你。

The memory bandwidth of Ivy Bridge is very high (it depends on your system, but I'm guessing over 10 GB/sec). Your block of memory is 4GB, you should be able to zip thru it in less than 1 second if you access it sequentially and let the CPU prefetch data ahead for you.

我的猜测是,你是在一个非连续的方式访问128字节块挫败CPU数据prefetcher。

My guess is you are thwarting the CPU data prefetcher by accessing the 128-byte blocks in a non-contiguous fashion.

更改您的访问模式是连续的,你会惊奇地发现如何更快运行。
然后,您可以不用担心优化的一个新的水平,这将确保该分支prediction效果很好。

Change your access pattern to be sequential and you'll be surprised how much faster it runs. You can then worry about the next level of optimization, which will be making sure the branch prediction works well.

另一个要考虑的是 TLB缺失。这些是昂贵的,尤其是在64位的系统。而不是使用4KB页考虑使用2MB 大内存页。请参阅此链接为这些Windows的支持:大页面支持(Windows)

Another thing to consider is TLB misses. Those are costly, especially in a 64-bit system. Rather than using 4KB pages consider using 2MB huge pages. See this link for Windows support for these: Large-Page Support (Windows)

如果您必须访问4GB的数据有点随机的方式,但你知道时间提前 M7 值的序列(索引),那么你可以通过管道的存储器中取出提前明确您所使用的(它需要几个领先的时候,你会用它来有效的100个CPU周期)。见

If you must access the 4GB data in a somewhat random fashion, but you know ahead of time the sequence of m7 values (your index) then you can pipeline the memory fetching explicitly ahead of your use (it needs to be several 100 CPU cycles ahead of when you will be using it to be effective). See

  • _mm_prefetch
  • usage of "_mm_prefetch(...)"
  • MM_PREFETCH

下面有一些链接,可能一般是有益的内存优化的主题

Here are some links that might be helpful in general on the subject of memory optimizations

什么每个程序员都应该知道的关于乌尔里希Drepper内存

What Every Programmer Should Know About Memory by Ulrich Drepper

<一个href=\"http://www.akkadia.org/drepper/cpumemory.pdf\">http://www.akkadia.org/drepper/cpumemory.pdf

计算机体系结构:事情的编程语言从来没有告诉过你,用香草萨特

Machine Architecture: Things Your Programming Language Never Told You, by Herb Sutter

<一个href=\"http://www.gotw.ca/publications/concurrency-ddj.htm\">http://www.gotw.ca/publications/concurrency-ddj.htm

<一个href=\"http://nwcpp.org/static/talks/2007/Machine_Architecture_-_NWCPP.pdf\">http://nwcpp.org/static/talks/2007/Machine_Architecture_-_NWCPP.pdf

<一个href=\"http://video.google.com/videoplay?docid=-4714369049736584770#\">http://video.google.com/videoplay?docid=-4714369049736584770#

这篇关于使用C / Intel汇编,什么是测试如果一个128字节的内存块包含全零的最快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆