AVX:数据对齐:存储崩溃,storeu,加载,loadu不 [英] AVX: data alignment: store crash, storeu, load, loadu doesn't

查看:61
本文介绍了AVX:数据对齐:存储崩溃,storeu,加载,loadu不的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在修改RNNLM神经网络以研究语言模型.但是考虑到我的语料库的大小,它的运行速度确实很慢.我尝试优化matrix * vector例程(对于小型数据集,该例程占总时间的63%(我希望在较大的数据集上情况会更糟)).现在,我被内在函数所困扰.

I am modifying RNNLM a neural net to study language model. However given the size of my corpus it's running real slow. I tried to optimize the matrix*vector routine (which is the one accountable for 63% of total time for small data set (I would expect it to be worse on larger sets)). Right now I am stuck with intrinsics.

    for (b=0; b<(to-from)/8; b++) 
    {
        val = _mm256_setzero_ps();
        for (a=from2; a<to2; a++) 
        {
            t1 = _mm256_set1_ps (srcvec.ac[a]);
            t2 = _mm256_load_ps(&(srcmatrix[a+(b*8+from+0)*matrix_width].weight));
            //val =_mm256_fmadd_ps (t1, t2, t3)
            t3 = _mm256_mul_ps(t1,t2);
            val = _mm256_add_ps (val, t3);
        }
        t4 = _mm256_load_ps(&(dest.ac[b*8+from+0]));
        t4 = _mm256_add_ps(t4,val);
        _mm256_store_ps (&(dest.ac[b*8+from+0]), t4);
    }

此示例崩溃于:

_mm256_store_ps (&(dest.ac[b*8+from+0]), t4);

但是,如果我更改为

_mm256_storeu_ps (&(dest.ac[b*8+from+0]), t4);

(我想用u表示未对齐)一切都按预期工作.我的问题是:为什么 load 可以工作(如果数据未对齐,则不应该这样做),而store不能工作. (而且两者都在同一地址上操作).

(with u for unaligned i suppose) everything works as intended. My question is: why would load work (whereas it is not supposed to, if the data is unaligned) and store doesn't. (furthermore both are operating on the same address).

dest.ac已使用

dest.ac have been allocated using

void *_aligned_calloc(size_t nelem, size_t elsize, size_t alignment=64)
{
    size_t max_size = (size_t)-1;

    // Watch out for overflow
    if(elsize == 0 || nelem >= max_size/elsize)
        return NULL;

    size_t size = nelem * elsize;
    void *memory = _mm_malloc(size+64, alignment);
    if(memory != NULL)
        memset(memory, 0, size);
    return memory;
}

,并且至少有50个元素. (顺便说一句,在VS2012中,我对某些随机分配使用了非法指令,因此我使用linux.)

and it's at least 50 elements long. (BTW with VS2012 I have an illegal instruction on some random assignment, so I use linux.)

先谢谢您, 阿肯色州.

thank you in advance, Arkantus.

推荐答案

TL:DR :在优化的代码中,

TL:DR: in optimized code, loads will fold into memory operands for other operations, which don't have alignment requirements in AVX. Stores won't.

您的示例代码本身不会编译,因此我无法轻松检查_mm256_load_ps编译到的指令.

Your sample code doesn't compile by itself, so I can't easily check what instruction _mm256_load_ps compiles to.

我用gcc 4.9进行了一个小实验,它对_mm256_load_ps根本不生成vmovaps,因为我只将加载结果用作另一条指令的输入.它使用内存操作数生成该指令. AVX指令对其存储操作数没有对齐要求. (越过高速缓存行会降低性能,而越过页面边界会带来较大的影响,但是您的代码仍然有效.)

I tried a small experiment with gcc 4.9, and it doesn't generate a vmovaps at all for _mm256_load_ps, since I only used the result of the load as an input to one other instruction. It generates that instruction with a memory operand. AVX instructions have no alignment requirements for their memory operands. (There is a performance hit for crossing a cache line, and a bigger hit for crossing a page boundary, but your code still works.)

另一方面,商店确实会生成vmov...指令.由于您使用的是需要对齐的版本,因此它会在未对齐的地址上出现故障.只需使用未对齐的版本即可;对齐地址后,速度会一样快;而对齐地址时,它仍然可以工作.

The store, on the other hand, does generate a vmov... instruction. Since you used the alignment-required version, it faults on unaligned addresses. Simply use the unaligned version; it'll be just as fast when the address is aligned, and still work when it isn't.

我没有仔细检查您的代码以查看所有访问是否应该对齐.我认为不是,从您所说的方式来看,为什么不对未对齐的负载也不会出错.就像我说的那样,可能您的代码只是没有编译为任何vmovaps加载指令,否则即使对齐"的AVX加载也不会在未对齐的地址上出错.

I didn't check your code carefully to see if all the accesses SHOULD be aligned. I assume not, from the way you phrased it to just ask why you weren't also getting faults for unaligned loads. Like I said, probably your code just didn't compile to any vmovaps load instructions, or else even "aligned" AVX loads don't fault on unaligned addresses.

您是否在Sandy/Ivybridge CPU上运行AVX(没有AVX2或FMA?)?我认为这就是为什么您的FMA内幕被注释掉的原因.

Are you running AVX (without AVX2 or FMA?) on a Sandy/Ivybridge CPU? I assume that's why your FMA instrinsics are commented out.

这篇关于AVX:数据对齐:存储崩溃,storeu,加载,loadu不的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆