使用SSE内在函数进行大小为100 * 100的矩阵乘法 [英] Matrix Multiplication of size 100100 using SSE Intrinsics*

查看：142 发布时间：2020/5/7 19:46:18 c sse matrix-multiplication intrinsics

本文介绍了使用SSE内在函数进行大小为100 * 100的矩阵乘法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

    int MAX_DIM = 100;
    float a[MAX_DIM][MAX_DIM]__attribute__   ((aligned(16)));
    float b[MAX_DIM][MAX_DIM]__attribute__   ((aligned(16)));
    float d[MAX_DIM][MAX_DIM]__attribute__   ((aligned(16)));
    /*
     * I fill these arrays with some values
     */

for(int i=0;i<MAX_DIM;i+=1){

      for(int j=0;j<MAX_DIM;j+=4){

        for(int k=0;k<MAX_DIM;k+=4){

          __m128 result = _mm_load_ps(&d[i][j]);

          __m128 a_line  = _mm_load_ps(&a[i][k]);

          __m128 b_line0 = _mm_load_ps(&b[k][j+0]);

          __m128 b_line1 = _mm_loadu_ps(&b[k][j+1]);

          __m128 b_line2 = _mm_loadu_ps(&b[k][j+2]);

          __m128 b_line3 = _mm_loadu_ps(&b[k][j+3]);

         result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0x00), b_line0));
         result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0x55), b_line1));
         result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0xaa), b_line2));
         result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0xff), b_line3));
         _mm_store_ps(&d[i][j],result);
        }
      }
    }

上面的代码我使用SSE进行矩阵乘法.该代码以流的方式运行，我从行中取4个元素与b列中的4个元素相乘，然后移到a行中的下4个元素和b列中的下4个元素

我收到错误消息Segmentation fault (core dumped)我真的不知道为什么

我在ubuntu 16.04.5上使用gcc 5.4.0

分割错误已通过_mm_loadu_ps解决逻辑上也有问题，如果有人帮助我找到逻辑，我会很高兴

解决方案

分段错误已由_mm_loadu_ps解决，逻辑也有问题...

您正在b[k][j+0..7]上加载4个重叠的窗口. (这就是为什么您需要loadu的原因.)

也许您打算加载b[k][j+0]，+4，+8，+12?如果是这样，则应将b对齐64，以便所有四个负载都来自同一高速缓存行(出于性能考虑).交错访问并不是很好，但是使用您触摸的每个高速缓存行的全部64个字节要好于在没有阻塞的标量代码中使行占优与列占优完全出错.

我从a行的4个元素中乘以b
列的4个元素

我不确定您的文字描述是否描述了您的代码.

除非已经转置了b，否则无法通过SIMD加载从同一列中加载多个值，因为它们在内存中不是连续的.

C多维数组是行主"的:最后一个索引是移到下一个更高的内存地址时变化最快的索引.您是否认为_mm_loadu_ps(&b[k][j+1])会给您b[k+0..3][j+1]?如果是这样，这是 SSE矩阵-矩阵乘法(该问题使用的是32-位整数，而不是32位浮点数，但存在相同的布局问题.有关工作循环结构，请参见.)

要对此进行调试，请将简单的值模式放入b[]中.喜欢

#include <stdalign.>

alignas(64) float b[MAX_DIM][MAX_DIM] = {
    0000, 0001, 0002, 0003, 0004, ...,
    0100, 0101, 0102, ...,
    0200, 0201, 0202, ...,
 };

 // i.e. for (...) b[i][j] = 100 * i + j;

然后，当您在调试器中逐步执行代码时，您可以看到向量中最终包含哪些值.

对于您的a[][]值，也许使用90000.0 + 100 * i + j，所以如果您查看的是寄存器(而不是C变量)，您仍然可以分辨出哪些值是a和哪些是b.

相关:

Ulrich Drepper的每个程序员应该了解的内存 显示了一种优化的matmul，它具有SSE instrinsics的高速缓存阻止功能，可实现双精度.应该很容易适应float.
BLAS如何获得如此出色的性能? (您可能只想使用经过优化的matmul库；调整matmul以实现最佳的缓存阻止并不容易，但很重要)
矩阵与块的乘法
C语言相对于Python/numpy语言的数学性能较差有其他问题的链接
如何优化矩阵乘法(matmul)代码以在单个处理器内核上快速运行

    int MAX_DIM = 100;
    float a[MAX_DIM][MAX_DIM]__attribute__   ((aligned(16)));
    float b[MAX_DIM][MAX_DIM]__attribute__   ((aligned(16)));
    float d[MAX_DIM][MAX_DIM]__attribute__   ((aligned(16)));
    /*
     * I fill these arrays with some values
     */

for(int i=0;i<MAX_DIM;i+=1){

      for(int j=0;j<MAX_DIM;j+=4){

        for(int k=0;k<MAX_DIM;k+=4){

          __m128 result = _mm_load_ps(&d[i][j]);

          __m128 a_line  = _mm_load_ps(&a[i][k]);

          __m128 b_line0 = _mm_load_ps(&b[k][j+0]);

          __m128 b_line1 = _mm_loadu_ps(&b[k][j+1]);

          __m128 b_line2 = _mm_loadu_ps(&b[k][j+2]);

          __m128 b_line3 = _mm_loadu_ps(&b[k][j+3]);

         result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0x00), b_line0));
         result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0x55), b_line1));
         result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0xaa), b_line2));
         result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0xff), b_line3));
         _mm_store_ps(&d[i][j],result);
        }
      }
    }

the above code I made to make matrix multiplication using SSE. the code runs as flows I take 4 elements from row from a multiply it by 4 elements from a column from b and move to the next 4 elements in the row of a and next 4 elements in column b

I get an error Segmentation fault (core dumped) I don't really know why

I use gcc 5.4.0 on ubuntu 16.04.5

Edit : The segmentation fault was solved by _mm_loadu_ps Also there is something wrong with logic i will be greatfull if someone helps me to find it

解决方案

The segmentation fault was solved by _mm_loadu_ps Also there is something wrong with logic...

You're loading 4 overlapping windows on b[k][j+0..7]. (This is why you needed loadu).

Perhaps you meant to load b[k][j+0], +4, +8, +12? If so, you should align b by 64, so all four loads come from the same cache line (for performance). Strided access is not great, but using all 64 bytes of every cache line you touch is a lot better than getting row-major vs. column-major totally wrong in scalar code with no blocking.

I take 4 elements from row from a multiply it by 4 elements from a column from b

I'm not sure your text description describes your code.

Unless you've already transposed b, you can't load multiple values from the same column with a SIMD load, because they aren't contiguous in memory.

C multidimensional arrays are "row major": the last index is the one that varies most quickly when moving to the next higher memory address. Did you think that _mm_loadu_ps(&b[k][j+1]) was going to give you b[k+0..3][j+1]? If so, this is a duplicate of SSE matrix-matrix multiplication (That question is using 32-bit integer, not 32-bit float, but same layout problem. See that for a working loop structure.)

To debug this, put a simple pattern of values into b[]. Like

#include <stdalign.>

alignas(64) float b[MAX_DIM][MAX_DIM] = {
    0000, 0001, 0002, 0003, 0004, ...,
    0100, 0101, 0102, ...,
    0200, 0201, 0202, ...,
 };

 // i.e. for (...) b[i][j] = 100 * i + j;

Then when you step through your code in the debugger, you can see what values end up in your vectors.

For your a[][] values, maybe use 90000.0 + 100 * i + j so if you're looking at registers (instead of C variables) you can still tell which values are a and which are b.

Ulrich Drepper's What Every Programmer Should Know About Memory shows an optimized matmul with cache-blocking with SSE instrinsics for double-precision. Should be straightforward to adapt for float.
How does BLAS get such extreme performance? (You might want to just use an optimized matmul library; tuning matmul for optimal cache-blocking is non-trivial but important)
Matrix Multiplication with blocks
Poor maths performance in C vs Python/numpy has some links to other questions
how to optimize matrix multiplication (matmul) code to run fast on a single processor core

这篇关于使用SSE内在函数进行大小为100 * 100的矩阵乘法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用SSE内在函数进行大小为100 * 100的矩阵乘法 [英] Matrix Multiplication of size 100100 using SSE Intrinsics*

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用SSE内在函数进行大小为100 * 100的矩阵乘法 [英] Matrix Multiplication of size 100*100 using SSE Intrinsics

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

使用SSE内在函数进行大小为100 * 100的矩阵乘法 [英] Matrix Multiplication of size 100100 using SSE Intrinsics*

登录关闭