在 ARM/Raspberry PI 上的多个内核上运行特征密集矩阵乘法时性能下降 [英] Performance drops when running Eigen dense matrix multiplications over multiple cores on ARM / Raspberry PI

查看:98
本文介绍了在 ARM/Raspberry PI 上的多个内核上运行特征密集矩阵乘法时性能下降的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现在 ARM 32 或 64 位 Raspberry PI 4 上并行运行 2 或 3 个线程上的特征密集矩阵乘法时,性能显着下降.

我无法理解这个问题,因为 RPI 4 有 4 个内核,理论上可以在真正的并行处理中处理多达 4 个线程.此外,我无法在我的笔记本电脑(英特尔 I9 4 核处理器)上重现该问题,无论我并行运行 1 个、2 个还是 3 个线程,每个线程都保持相同的性能.

在我的实验中(

  1. 运行 100 个 64x64 矩阵的 100,000 次乘法循环.结果如下图所示:

据我所知,9 个 64x64 矩阵所需的总内存缓存为:

64 x 64 x sizeof(double) x 9 = 294,912 字节

此内存量占 1 MiB 缓存的 28.1%,为运行到处理器内存中的其他对象留出一些空间.9 个矩阵是每个线程 3 个矩阵,即矩阵 A、B 和 C.注意我使用 C.noalias() = A * B; 来避免 A * B 的临时矩阵.

  1. 运行 100 次循环 128x128 矩阵的 100,000 次乘法.结果如下图所示:

九个 128x128 矩阵的预期内存量为 1,179,648 字节,超过可用缓存总量的 112%.因此,最后一种情况很可能会遇到处理器缓存瓶颈.

我认为之前图表中显示的结果证实了@Surt 假设,我会接受他/她的回答是正确的.仔细查看图表,当矩阵大小为 16 或 64 时,1 个单线程和 2 个或更多线程的场景之间可能会略有不同.我认为这是由于一般的 OS 调度程序开销.

解决方案

基本上,您的 PI 的缓存比台式 PC CPU 小得多,这意味着您在 PI 上的程序比 PC 更频繁地发生缓存冲突.

512*512*4 (sizeof(float)) 或每个实例 1 MB.

在可能有 12 MB L3 缓存的 PC 上,您永远不会使用 RAM(可能在分配时),而 PI 上的小缓存将被吹走.

<块引用><块引用>

Raspberry Pi 4 使用 Broadcom BCM2711 SoC,配备 1.5 GHz 64 位四核 ARM Cortex-A72 处理器和 1 MiB 共享二级缓存.

因此 PI 将花费大量时间从 RAM 中提取数据.

另一方面,如果您在不同线程之间分配了相同的工作,那么您可能会看到性能提高(或至少没有降低),某些阻塞方案甚至可能利用了 L1 缓存(在两台机器上).

I have found a significant drop of performance when running Eigen dense matrix multiplication on 2 or 3 threads in parallel on ARM 32 or 64 bits on Raspberry PI 4.

I can't understand this issue because RPI 4 has 4 cores and theoretically can deal with up to 4 threads in real parallelism. Furthermore, I cannot reproduce the problem on my laptop (Intel I9 4-core processor), where each thread keeps the same performance regardless I am running one, 2 or 3 threads in parallel.

In my experiments (see this repo for details), I'm running different threads on 4-core Raspberry Pi 4 buster. The problem occurs on both 32 and 64 bits versions. In order to illustrate the issue, I wrote a program where each thread holds its own data and then process dense matrix multiplication using the own data as an totally individual unit of processing:

void worker(const std::string & id) {

  const MatrixXd A = 10 * MatrixXd::Random(size, size);
  const MatrixXd B = 10 * MatrixXd::Random(size, size);
  MatrixXd C;
  double test = 0;

  for (int step = 0; step < 30; ++step) {
      test += foo(A, B, C);
  }

  std::cout << "test value is:" << test << "\n";

}

where foo is simply a loop with 100 matrix multiplication calls:

const int size = 512;

float foo(const MatrixXd &A, const MatrixXd &B, MatrixXd &C) {
  float result = 0.0;
  for (int i = 0; i < 100; ++i)
  {
      C.noalias() = A * B;

      int x = 0;
      int y = 0;

      result += C(x, y);
  }
  return result;
}

Using chrono package I have found that each step in the thread loop:

test += foo(A, B, C);

takes nearly 9.000 ms if I am running only a single thread:

int main(int argc, char ** argv)
{

    Eigen::initParallel();

    std::cout << Eigen::nbThreads() << " eigen threads\n";

    std::thread t0(worker, "t-0");
    t0.join();

    return 0;
}

The problem comes up when I try to run 2 or more threads in parallel:

std::thread t0(worker, "t-0");
std::thread t1(worker, "t-1");

t0.join();
t1.join();

By my measurements (the detailed results can be found on the mentioned repository), when I am running two threads in parallel, each cycle of 100 multiplications takes 11.000 ms or more. When I am running 3 threads the performance is way worse (~23.000ms).

In the same experiment on my laptop (Ubuntu 20.04 64 bit, 4x Intel I9 9900K processors) the performance of each thread is nearly the same (~ 1600 ms) even if I am running only one thread, two or three.

The code I am using in this experiment + compilation instructions etc can be found in this repo: https://github.com/doleron/eigen3-multithread-arm-issue

EDIT about @Surt answer:

In order to check out the hypothesis of @Surt, I performed some slighty different experiments.

  1. Running 100 cycles of 100,000 multiplications of matrices 16x16. The results can be seen in the following chart:

  1. Running 100 cycles of 100,000 multiplications of matrices 64x64. The results can be seen in the following chart:

By my counts, the total memory cache required for 9 matrices 64x64 is:

64 x 64 x sizeof(double) x 9 = 294,912 in bytes

This amount of memory represents 28,1% of the 1 MiB cache which leaves some space for other objects running into the processors memory. 9 matrices are 3 matrices per thread, namely matrix A, B and C. Note I'm using C.noalias() = A * B; to avoid a temporary matrix for A * B.

  1. Running 100 cycles of 100,000 multiplications of matrices 128x128. The results can be seen in the following chart:

The expected amount of memory for nine 128x128 matrices is 1,179,648 bytes, more than 112% of total avalable cache. So, this last scenario is likely to hit the processors cache bottleneck.

I think the results shown in the previous charts confirm the @Surt hypothesis and I will accept he/she answer as correct. Checking the charts carefully it is possible to see a slighty difference among the scenario with 1 single thread and 2 or more threads when matrix's size is 16 or 64. I think that it is due to general OS scheduler overhead.

解决方案

Basically your PI has a much smaller cache than your desktop PC CPU, which means your program on PI has cache collisions much oftener than your PC.

512*512*4 (sizeof(float)) or 1 MB per instance. 

On your PC with perhaps 12 MB L3 cache you will never to to RAM (maybe on allocation) while the tiny cache on the PI will be blown away.

The Raspberry Pi 4 uses a Broadcom BCM2711 SoC with a 1.5 GHz 64-bit quad-core ARM Cortex-A72 processor, with 1 MiB shared L2 cache.

So the PI will use a lot of time pulling in data from RAM.

If you on the other hand had split the same work between the different threads then you might have seen an improved performance (or at least not decreased), some blocking scheme might even have leveraged L1 cache (on both machines).

这篇关于在 ARM/Raspberry PI 上的多个内核上运行特征密集矩阵乘法时性能下降的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆