尝试专门使用内在函数_mm256_storeu_pd()时出现分段错误 [英] Segmentation fault when trying to use intrinsics specifically _mm256_storeu_pd()

查看:139
本文介绍了尝试专门使用内在函数_mm256_storeu_pd()时出现分段错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

似乎自己通过在mm256调用中强制转换cij2指针进行了修复

Seemed to have fixed it myself by type casting the cij2 pointer inside the mm256 call

所以_mm256_storeu_pd((double *)cij2,vecC);

so _mm256_storeu_pd((double *)cij2,vecC);

我不知道为什么这会改变任何东西...

I have no idea why this changed anything...

我正在编写一些代码,并尝试利用Intel手动矢量化技术.但是每当我运行代码时,在尝试使用我的双* cij2时都会遇到分段错误.

I'm writing some code and trying to take advantage of the Intel manual vectorization. But whenever I run the code I get a segmentation fault on trying to use my double *cij2.

if( q == 0)
{
    __m256d vecA;
    __m256d vecB;
    __m256d vecC;
    for (int i = 0; i < M; ++i)
      for (int j = 0; j < N; ++j)
      {
        double cij = C[i+j*lda];
        double *cij2 = (double *)malloc(4*sizeof(double));
        for (int k = 0; k < K; k+=4)
        {
          vecA = _mm256_load_pd(&A[i+k*lda]);
          vecB = _mm256_load_pd(&B[k+j*lda]);
          vecC = _mm256_mul_pd(vecA,vecB);
          _mm256_storeu_pd(cij2, vecC);
          for (int x = 0; x < 4; x++)
          {
            cij += cij2[x];
          }

        }
        C[i+j*lda] = cij;
      }

我已将问题精确定位到cij2指针.如果我注释掉包含该指针的两行代码,则代码可以很好地运行,它不会像应有的那样工作,但实际上会运行.

I've pinpointed the problem to the cij2 pointer. If i comment out the 2 lines that include that pointer the code runs fine, it doesn't work like it should but it'll actually run.

我的问题是,为什么我会在这里出现细分错误?我知道我已经正确分配了内存,并且内存是64位大小的double的256向量.

My question is why would i get a segmentation fault here? I know I've allocated the memory correctly and that the memory is a 256 vector of double's with size 64 bits.

阅读评论后,我来补充一些说明.我所做的第一件事是使用malloc将_mm_malloc更改为普通分配.不应影响任何一种方式,但理论上会给我更多的呼吸空间.

After reading the comments I've come to add some clarification. First thing I did was change the _mm_malloc to just a normal allocation using malloc. Shouldn't affect either way but will give me some more breathing room theoretically.

第二个问题不是来自分配的空值返回,我添加了两个循环以遍历数组,并确保可以修改内存而不会崩溃,所以我相对确定这不是问题.问题似乎源于从vecC到数组的数据加载.

Second the problem isn't coming from a null return on the allocation, I added a couple loops in to increment through the array and make sure I could modify the memory without it crashing so I'm relatively sure that isn't the problem. The problem seems to stem from the loading of the data from vecC to the array.

最后,我无法使用BLAS通话.这是针对并行性类的.我知道调用比我更聪明的方法会容易得多,但不幸的是,如果尝试执行此操作,我将得到0.

Lastly I can not use BLAS calls. This is for a parallelisms class. I know it would be much simpler to call on something way smarter than I but unfortunately I'll get a 0 if I try that.

推荐答案

您动态分配 double * cij2 =(double *)malloc(4 * sizeof(double)); ,但是您从不释放它.这真是愚蠢.使用 double cij2 [4] ,尤其是如果您不打算对齐它时.您永远不需要一个以上的暂存缓冲区,它是一个固定的小尺寸,因此只需使用自动存储即可.

You dynamically allocate double *cij2 = (double *)malloc(4*sizeof(double)); but you never free it. This is just silly. Use double cij2[4], especially if you're not going to bother to align it. You never need more than one scratch buffer at once, and it's a small fixed size, so just use automatic storage.

在C ++ 11中,您将使用 alignas(32)double cij2 [4] ,因此您可以使用 _mm256_store_pd 而不是storeu.(或者只是为了确保存储地址不会因未对齐的地址而变慢).

In C++11, you'd use alignas(32) double cij2[4] so you could use _mm256_store_pd instead of storeu. (Or just to make sure storeu isn't slowed down by an unaligned address).

如果您确实要调试原始文件,请在出现段错误时使用调试器对其进行捕获,然后查看指针值.确保这是明智的选择.

If you actually want to debug your original, use a debugger to catch it when it segfaults, and look at the pointer value. Make sure it's something sensible.

您用于测试内存有效的方法(例如,对其进行循环或注释掉)听起来像是它们可能导致您的许多循环都被优化掉了,所以问题就不会发生.

Your methods for testing that the memory was valid (like looping over it, or commenting stuff out) sound like they could lead to a lot of your loop being optimized away, so the problem wouldn't happen.

当程序崩溃时,您也可以查看asm指令.向量内在函数可以直接直接映射到x86 asm(除非编译器发现更有效的方式).

When your program crashes, you can also look at the asm instructions. Vector intrinsics map fairly directly to x86 asm (except when the compiler sees a more efficient way).

如果您将水平总和从循环中拉到k上,您的实现将减少很多.不必存储每个乘法结果并水平相加,而应将向量加到向量累加器中.hsum在k的循环之外.

Your implementation would suck a lot less if you pulled the horizontal sum out of the loop over k. Instead of storing each multiply result and horizontally adding it, use a vector add into a vector accumulator. hsum it outside the loop over k.

    __m256d cij_vec = _mm256_setzero_pd();
    for (int k = 0; k < K; k+=4) {
      vecA = _mm256_load_pd(&A[i+k*lda]);
      vecB = _mm256_load_pd(&B[k+j*lda]);
      vecC = _mm256_mul_pd(vecA,vecB);
      cij_vec = _mm256_add_pd(cij_vec, vecC);  // TODO: use multiple accumulators to keep multiple VADDPD or VFMAPD instructions in flight.
    }
    C[i+j*lda] = hsum256_pd(cij_vec);  // put the horizontal sum in an inline function

有关良好的hsum256_pd实现(除了存储到内存和使用标量循环),请参见

For good hsum256_pd implementations (other than storing to memory and using a scalar loop), see Fastest way to do horizontal float vector sum on x86 (I included an AVX version there. It should be easy to adapt the pattern of shuffling to 256b double-precision.) This will help your code a lot, since you still have O(N^2) horizontal sums (but not O(N^3) with this change).

理想情况下,您可以并行累加4个 i 值的结果,而无需水平求和.

Ideally you could accumulate results for 4 i values in parallel, and not need horizontal sums.

VADDPD的延迟为3到4个时钟,吞吐量为每1到0.5个时钟之一,因此您需要3到8个向量累加器来使执行单元饱和.或使用FMA,最多10个向量累加器(例如在Haswell上的FMA ... PD具有5c的延迟,每0.5c的吞吐量有1个).请参阅 Agner Fog的说明表和优化指南,以了解更多信息.同样,标签Wiki的问题.

VADDPD has a latency of 3 to 4 clocks, and a throughput of one per 1 to 0.5 clocks, so you need from 3 to 8 vector accumulators to saturate the execution units. Or with FMA, up to 10 vector accumulators (e.g. on Haswell where FMA...PD has 5c latency and one per 0.5c throughput). See Agner Fog's instruction tables and optimization guides to learn more about that. Also the x86 tag wiki.

此外,理想情况下,嵌套循环的方式应允许您连续访问三个阵列中的两个,因为高速缓存访​​问模式对于matmul(大量数据重用)至关重要.即使您不喜欢并一次放入适合缓存的小块.甚至转置您的输入矩阵之一也是一个胜利,因为这会花费O(N ^ 2)并加快O(N ^ 3)的过程.我看到您的内部循环在访问 A [] 时当前的步幅为 lda .

Also, ideally nest your loops in a way that gave you contiguous access to two of your three arrays, since cache access patterns are critical for matmul (lots of data reuse). Even if you don't get fancy and transpose small blocks at a time that fit in cache. Even transposing one of your input matrices can be a win, since that costs O(N^2) and speeds up the O(N^3) process. I see your inner loop currently has a stride of lda while accessing A[].

这篇关于尝试专门使用内在函数_mm256_storeu_pd()时出现分段错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆