多线程矩阵乘法性能问题 [英] Multi threaded matrix multiplication performance issue

查看:234
本文介绍了多线程矩阵乘法性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Java进行多线程乘法.我正在练习多线程编程.以下是我从另一个stackoverflow帖子中获取的代码.

I am using java for multi threaded multiplication. I am practicing multi threaded programming. Following is the code that I took from another post of stackoverflow.

public class MatMulConcur {

private final static int NUM_OF_THREAD =1 ;
private static Mat matC;

public static Mat matmul(Mat matA, Mat matB) {
matC = new Mat(matA.getNRows(),matB.getNColumns());
return mul(matA,matB);
}

private static Mat mul(Mat matA,Mat matB) {

int numRowForThread;
int numRowA = matA.getNRows();
int startRow = 0;

Worker[] myWorker = new Worker[NUM_OF_THREAD];

for (int j = 0; j < NUM_OF_THREAD; j++) {
    if (j<NUM_OF_THREAD-1){
        numRowForThread = (numRowA / NUM_OF_THREAD);
    } else {
        numRowForThread = (numRowA / NUM_OF_THREAD) + (numRowA % NUM_OF_THREAD);
    }
    myWorker[j] = new Worker(startRow, startRow+numRowForThread,matA,matB);
    myWorker[j].start();
    startRow += numRowForThread;
}

for (Worker worker : myWorker) {
    try {
        worker.join();
    } catch (InterruptedException e) {

    }
  }
  return matC;
 }

private static class Worker extends Thread {

private int startRow, stopRow;
private Mat matA, matB;

public Worker(int startRow, int stopRow, Mat matA, Mat matB) {
    super();
    this.startRow = startRow;
    this.stopRow = stopRow;
    this.matA = matA;
    this.matB = matB;
}

@Override
public void run() {
    for (int i = startRow; i < stopRow; i++) {
        for (int j = 0; j < matB.getNColumns(); j++) {
            double sum = 0;
            for (int k = 0; k < matA.getNColumns(); k++) {
                sum += matA.get(i, k) * matB.get(k, j);
            }
            matC.set(i, j, sum);
        }
    }
  }
}

我为1,10,20,...,100个线程运行了该程序,但是性能却下降了.以下是时间表

I ran this program for 1,10,20,...,100 threads but performance is decreasing instead. Following is the time table

  1. 线程1需要18毫秒
  2. 线程10需要18毫秒
  3. 线程20需要35毫秒
  4. 线程30需要38毫秒
  5. 线程40需要43毫秒
  6. 线程50需要48毫秒
  7. 线程60需要57毫秒
  8. 线程70需要66毫秒
  9. 线程80需要74毫秒
  10. 线程90需要87毫秒
  11. 线程100需要98毫秒
  1. Thread 1 takes 18 Milliseconds
  2. Thread 10 takes 18 Milliseconds
  3. Thread 20 takes 35 Milliseconds
  4. Thread 30 takes 38 Milliseconds
  5. Thread 40 takes 43 Milliseconds
  6. Thread 50 takes 48 Milliseconds
  7. Thread 60 takes 57 Milliseconds
  8. Thread 70 takes 66 Milliseconds
  9. Thread 80 takes 74 Milliseconds
  10. Thread 90 takes 87 Milliseconds
  11. Thread 100 takes 98 Milliseconds

有什么想法吗?

推荐答案

人们认为使用多个线程会自动(神奇地!)使任何计算速度更快.不是这样的 1 .

People think that using multiple threads will automatically (magically!) make any computation go faster. This is not so1.

有很多因素会使多线程加速比您预期的要慢,或者确实导致速度下降.

There are a number of factors that can make multi-threading speedup less than you expect, or indeed result in a slowdown.

  1. 具有N个核心(或超线程)的计算机最多可以完成 N倍的计算,而计算速度是具有1个核心的计算机的N倍.这意味着当您有T个线程的T> N时,计算性能将被限制为N.(此外,由于时间的限制,这些线程取得了进步.)

  1. A computer with N cores (or hyperthreads) can do computations at most N times as fast as a computer with 1 core. This means that when you have T threads where T > N, the computational performance will be capped at N. (Beyond that, the threads make progress because of time slicing.)

计算机具有一定的内存带宽;即它每秒只能在主存储器上执行一定数量的读/写操作.如果您的应用程序的需求超出了内存子系统可以实现的范围,它将停滞(几纳秒).如果有许多内核同时执行许多线程,那么重要的是总需求.

A computer has a certain amount of memory bandwidth; i.e. it can only perform a certain number of read/write operations per second on main memory. If you have an application where the demand exceeds what the memory subsystem can achieve, it will stall (for a few nanoseconds). If there are many cores executing many threads at the same time, then it is the aggregate demand that matters.

处理共享变量或数据结构的典型多线程应用程序将使用volatile或显式同步来执行此操作.两者都增加了对存储系统的需求.

A typical multi-threaded application working on shared variables or data structures will either use volatile or explicit synchronization to do this. Both of these increase the demand on the memory system.

使用显式同步并且两个线程要同时持有锁时,其中一个将被阻塞.此锁争用会减慢计算速度.确实,如果锁上发生过过去争用,则计算可能会变慢.

When explicit synchronization is used and two threads want to hold a lock at the same time, one of them will be blocked. This lock contention slows down the computation. Indeed, the computation is likely to be slowed down if there was past contention on the lock.

线程创建非常昂贵.即使从线程池获取现有线程也可能相对昂贵.如果您使用线程执行的任务太小,则设置成本可能会超过可能的加速速度.

Thread creation is expensive. Even acquiring an existing thread from a thread pool can be relatively expensive. If the task that you perform with the thread is too small, the setup costs can outweigh the possible speedup.

还有一个问题,就是您可能会遇到基准测试写得不好的问题;例如在进行时序测量之前,可能无法正确预热JVM.

There is also the issue that you may be running into problems with a poorly written benchmark; e.g. the JVM may not be properly warmed up before taking the timing measurements.

您的问题中没有足够的细节来确定上述哪些因素可能会影响您的应用程序的性能.但这可能是1 2和5的组合...,具体取决于使用了多少个内核,CPU内存缓存的大小,矩阵的大小以及其他因素.

There is insufficient detail in your question to be sure which of the above factors is likely to affect your application's performance. But it is likely to be a combination of 1 2 and 5 ... depending on how many cores are used, how big the CPUs memory caches are, how big the matrix is, and other factors.

1-确实,如果的确如此,那么我们就不需要购买具有很多内核的计算机.我们可以使用越来越多的线程.如果您有足够的内存,则可以在一台计算机上进行无限量的计算.比特币采矿将是轻而易举的事.当然,这不是真的.

这篇关于多线程矩阵乘法性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆