c++ - 如何优化和加速矩阵的乘法? [英] how to optimize and speed up the multiplication of matrix in c++?

查看:136
本文介绍了c++ - 如何优化和加速矩阵的乘法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是矩阵乘法的优化实现,该例程执行矩阵乘法运算.C := C + A * B(其中 A、B 和 C 是以列优先格式存储的 n×n 矩阵)在退出时,A 和 B 保持它们的输入值.

this is optimized implementation of matrix multiplication and this routine performs a matrix multiplication operation. C := C + A * B (where A, B, and C are n-by-n matrices stored in column-major format) On exit, A and B maintain their input values.

void matmul_optimized(int n, int *A, int *B, int *C)
{
    // to the effective bitwise calculation
    // save the matrix as the different type
    int i, j, k;
    int cij;
    for (i = 0; i < n; ++i) {
        for (j = 0; j < n; ++j) {
            cij = C[i + j * n]; // the initialization into C also, add separate additions to the product and sum operations and then record as a separate variable so there is no multiplication
            for (k = 0; k < n; ++k) {
                cij ^= A[i + k * n] & B[k + j * n]; // the multiplication of each terms is expressed by using & operator the addition is done by ^ operator.
            }
            C[i + j * n] = cij; // allocate the final result into C         }
    }
}

基于上述函数/方法,我如何更快地加速矩阵的乘法?

how do I more speed up the multiplication of matrix based on above function/method?

此函数在 2048 x 2048 矩阵中进行了测试.

this function is tested up to 2048 by 2048 matrix.

函数 matmul_optimized 是用 matmul 完成的.

the function matmul_optimized is done with matmul.

#include <stdio.h>
#include <stdlib.h>

#include "cpucycles.c"
#include "helper_functions.c"
#include "matmul_reference.c"
#include "matmul_optimized.c"

int main()
{
    int i, j;
    int n = 1024; // Number of rows or columns in the square matrices
    int *A, *B;   // Input matrices
    int *C1, *C2; // Output matrices from the reference and optimized implementations

    // Performance and correctness measurement declarations
    long int CLOCK_start, CLOCK_end, CLOCK_total, CLOCK_ref, CLOCK_opt;
    long int COUNTER, REPEAT = 5;
    int difference;
    float speedup;

    // Allocate memory for the matrices
    A = malloc(n * n * sizeof(int));
    B = malloc(n * n * sizeof(int));
    C1 = malloc(n * n * sizeof(int));
    C2 = malloc(n * n * sizeof(int));

    // Fill bits in A, B, C1
    fill(A, n * n);
    fill(B, n * n);
    fill(C1, n * n);

    // Initialize C2 = C1
    for (i = 0; i < n; i++)
        for (j = 0; j < n; j++)
            C2[i * n + j] = C1[i * n + j];

    // Measure performance of the reference implementation
    CLOCK_total = 0;
    for (COUNTER = 0; COUNTER < REPEAT; COUNTER++)
    {
        CLOCK_start = cpucycles();
        matmul_reference(n, A, B, C1);
        CLOCK_end = cpucycles();
        CLOCK_total = CLOCK_total + CLOCK_end - CLOCK_start;
    }
    CLOCK_ref = CLOCK_total / REPEAT;
    printf("n=%d Avg cycle count for reference implementation = %ld\n", n, CLOCK_ref);

    // Measure performance of the optimized implementation
    CLOCK_total = 0;
    for (COUNTER = 0; COUNTER < REPEAT; COUNTER++)
    {
        CLOCK_start = cpucycles();
        matmul_optimized(n, A, B, C2);
        CLOCK_end = cpucycles();
        CLOCK_total = CLOCK_total + CLOCK_end - CLOCK_start;
    }
    CLOCK_opt = CLOCK_total / REPEAT;
    printf("n=%d Avg cycle count for optimized implementation = %ld\n", n, CLOCK_opt);

    speedup = (float)CLOCK_ref / (float)CLOCK_opt;

    // Check correctness by comparing C1 and C2
    difference = 0;
    for (i = 0; i < n; i++)
        for (j = 0; j < n; j++)
            difference = difference + C1[i * n + j] - C2[i * n + j];

    if (difference == 0)
        printf("Speedup factor = %.2f\n", speedup);
    if (difference != 0)
        printf("Reference and optimized implementations do not match\n");

    //print(C2, n);

    free(A);
    free(B);
    free(C1);
    free(C2);
    return 0;
}

推荐答案

优化矩阵-矩阵乘法需要仔细注意以下几个问题:

Optimizing matrix-matrix multiplication requires careful attention to be paid to a number of issues:

  • 首先,您需要能够使用向量指令.只有向量指令可以访问架构中固有的并行性.因此,要么您的编译器需要能够自动映射到向量指令,要么您必须手动执行此操作,例如通过调用 AVX-2 指令的向量内部库(适用于 x86 架构).

  • First, you need to be able to use vector instructions. Only vector instructions can access parallelism inherent in the architecture. So, either your compiler needs to be able to automatically map to vector instructions, or you have to do so by hand, for example by calling the vector intrinsic library for AVX-2 instructions (for x86 architectures).

接下来,您需要仔细注意内存层次结构.如果不这样做,您的表现很容易下降到峰值的 5% 以下.

Next, you need to pay careful attention to the memory hierarchy. Your performance can easily drop to less than 5% of peak if you don't do this.

一旦你做对了,你就有希望将计算分解成足够小的计算块,你也可以通过 OpenMP 或 pthreads 并行化.

Once you do this right, you will hopefully have broken the computation up into small enough computational chunks that you can also parallelize via OpenMP or pthreads.

可以在 http://www.cs.utexas.edu/users/flame/laff/pfhp/LAFF-On-PfHP.html.(这是一项正在进行的工作.)最后,您将获得一个实现,其性能接近于高性能库(如英特尔数学内核库 (MKL) 或类似 BLAS 的库实例化)所获得的性能软件 (BLIS).

A document that carefully steps through what is required can be found at http://www.cs.utexas.edu/users/flame/laff/pfhp/LAFF-On-PfHP.html. (This is very much a work in progress.) At the end of it all, you will have an implementation that gets close to the performance attained by high-performance libraries like Intel's Math Kernel Library (MKL) or the BLAS-like Library Instantiation Software (BLIS).

(而且,实际上,您还可以有效地合并 Strassen 的算法.但那是另一个故事,在这些笔记的第 3.5.3 单元中讲述.)

(And, actually, you CAN then also effectively incorporate Strassen's algorithm. But that is another story, told in Unit 3.5.3 of these notes.)

您可能会发现以下线程相关:BLAS 是如何得到这样的极致性能?

You may find the following thread relevant: How does BLAS get such extreme performance?

这篇关于c++ - 如何优化和加速矩阵的乘法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆