分析:ForkJoinPool的性能 [英] Analysis: Performance of ForkJoinPool

查看:137
本文介绍了分析:ForkJoinPool的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题

由于Fork-Join似乎是当前的炒作,并在许多答案中得到推荐,我想:为什么不对它的实际运行速度进行一些研究?

要对此进行测量,我编写了一个小程序(请参见下面的代码),该程序进行了一些数字加法运算,并使用各种参数(包括线程数,fork-depth和fork-spread)将其分叉,然后测量了执行时间尤其是实际计算所花费的时间与分叉所花费的时间.

抽象答案

虽然实施得很好,但是ForkJoin是并行执行任务的极其低效的方法,因为每个fork的成本都很高.天真的问题优化实现可以轻松地归档99%的线程执行时间(这超过了用Fork-Join衡量的所有结果),因此,这样的实现始终比Fork-Join实现快.另外,如果每个fork的实际任务很小,那么Fork-Join实现可能比单线程线性实现要慢得多.

因此,Fork-Join只是一个问题,即它是否对代码的体系结构有所帮助,因为它与其他实现相比没有任何性能上的好处.因此,仅在以下情况下才应使用Fork-Join:

  • 性能并不重要,任务经常需要等待其他任务的结果继续.因此,基本上,如果Fork-Join结构大大简化了天真的实现的任务.

  • 实际任务大大超过了分叉的成本,因此损失可以忽略不计.在我的测试中,一个添加了2个值的循环必须每个分叉至少循环10000次才能获得合理的性能.

有关详细分析,请参见此处指向我.

测试设置

在我的程序中,我有一个RecursiveTask计算给定N的斐波那契数列,这将实际计算减少到3个赋值和1个加法.对于任何给定的CPU,这应该是次要的任务.

在测试中,我改变了线程数量,每个任务的派生数量以及Fibonacci循环的长度.另外,我使用async参数进行了一些测试,但是将此参数设置为false仅显示了计算时间的少量减少,因此我跳过了这一点.由于结果没有显着差异,因此大多数情况下也忽略了传播参数(每叉货叉).

通常,计算时间非常稳定,实际花费在任务上的时间百分比通常相差不到1%,因此,每个测试集已在计算机上运行了大约5次(如果数字不稳定,则运行更多次).否则为具有4个核心(+4个超核心)的空闲系统,然后选择中值执行时间.

已经通过各种测试变量验证了正确的执行,尤其是已验证所使用的实际线程数与最初给定的 parallelism 参数没有区别.

详细的测试结果

位置:

  • Time total是从主线程角度来看整个计算所花费的总时间.
  • Time task是在所有合并的分叉中实际计算斐波那契数列所花费的时间.
  • Time task percentage是由于线程(时间任务/总时间)而带来的相对增益.
  • spread->depth是(设置)传播(每个叉子的叉子)和(计算出的)分叉深度.
  • threads是实际使用的线程数量.
  • task-time/thread是每个线程实际用于整体计算斐波那契数列的时间.

传播->深度测试:

Time total: 8766.670 ms, time task: 1717.418 ms ( 19.59%), spread->depth:  2->26, thread#: 1, task-time/thread: 19.59%
Time total: 7872.244 ms, time task: 1421.478 ms ( 18.06%), spread->depth: 10-> 8, thread#: 1, task-time/thread: 18.06%
Time total: 7336.052 ms, time task: 1280.036 ms ( 17.45%), spread->depth: 100-> 4, thread#: 1, task-time/thread: 17.45%

结论:分叉数量仅产生较小的影响(分叉更少=更好),实现似乎相当复杂.在其他设置下也收集到了类似的结果,因此我在这里跳过了这些.

Fib(0)(几乎所有花在分叉上的时间)

Time total: 7866.777 ms, time task: 1421.488 ms ( 18.07%), spread->depth: 10-> 8, thread#: 1, task-time/thread: 18.07%
Time total: 7085.142 ms, time task: 1349.207 ms ( 19.04%), spread->depth: 10-> 8, thread#: 2, task-time/thread:  9.52%
Time total: 6580.609 ms, time task: 1476.467 ms ( 22.44%), spread->depth: 10-> 8, thread#: 4, task-time/thread:  5.61%

结论:对于一个非常小的任务,大多数时间都花在了分叉上,使得单线程实现比任何Fork-Join设置快大约5倍.即使有多个线程,使用Fork-Join也无法获得任何性能提升.

Fib(100)

Time total: 12487.634 ms, time task: 5707.720 ms ( 45.71%), spread->depth: 10-> 8, thread#: 1, task-time/thread: 45.71%
Time total:  8386.855 ms, time task: 5768.881 ms ( 68.78%), spread->depth: 10-> 8, thread#: 2, task-time/thread: 34.39%
Time total:  7078.769 ms, time task: 6086.997 ms ( 85.99%), spread->depth: 10-> 8, thread#: 4, task-time/thread: 21.50%

结论:似乎已经接近单线程执行的收支平衡点,而多线程开始产生影响.仍然单线程实现会比任何Fork-Join设置都要快.

Fib(1000)

Time total:  5941.344 ms, time task:  5228.258 ms ( 88.00%), spread->depth: 10-> 7, thread#: 1, task-time/thread: 88.00%
Time total:  3160.818 ms, time task:  5244.241 ms (165.91%), spread->depth: 10-> 7, thread#: 2, task-time/thread: 82.96%
Time total: 16301.697 ms, time task: 53351.694 ms (327.28%), spread->depth: 10-> 8, thread#: 4, task-time/thread: 81.82%

结论:对于多线程执行,时间开始趋于稳定,接近线性增益,而每个线程的计算时间仍然有20%左右花在了分叉上.尽管此时派生可以通过线程提高性能,但朴素的实现仍然会明显更快.

Fib(10000)

Time total:  5204.786 ms, time task:  5119.133 ms ( 98.35%), spread->depth: 10-> 6, thread#: 1, task-time/thread: 98.35%
Time total: 26033.889 ms, time task: 51084.118 ms (196.22%), spread->depth: 10-> 7, thread#: 2, task-time/thread: 98.11%
Time total: 13183.573 ms, time task: 51637.471 ms (391.68%), spread->depth: 10-> 7, thread#: 4, task-time/thread: 97.92%

结论:在此数字下,计算权衡了分叉的成本.尽管天真的实现仍然会稍微快一些,但是如果以另一种方式来实现该任务更加困难,那么分叉所造成的损失就可以忽略不计.

代码

 public class Test {

  static final int NUM_THREADS = 4;
  static final int SPREAD = 10;
  static final int LOOPS = 4000000;
  static final int CALCULATION_N = 10000;
  static final boolean DO_ASYNC = true;
  //---
  static final long MAX_DEPTH = Math.round(Math.log(LOOPS) / Math.log(SPREAD)); // try to have the execution take about the same time

  private static class Task extends RecursiveTask<Integer> {

    final static AtomicLong timeExecute = new AtomicLong(0);
    final static AtomicLong totalLoops = new AtomicLong(0);
    final long depth;

    public Task(final long depth) {
      this.depth = depth;
    }

    @Override
    protected Integer compute() {
      if (depth < MAX_DEPTH) {
        final Task[] subTasks = new Task[SPREAD];
        for (int i = 0; i < subTasks.length; ++i) {
          subTasks[i] = new Task(depth + 1);
        }
        try {
          invokeAll(subTasks);
          final long startTime = System.nanoTime();
          int result = 0;
          for (final Task task : subTasks) {
            if (task.isCompletedNormally()) {
              result += task.get();
            }
          }
          timeExecute.addAndGet(System.nanoTime() - startTime);
          return result;
        } catch (Exception e) {
          this.completeExceptionally(e);
          return null;
        }
      } else {
        totalLoops.incrementAndGet();
        final long startTime = System.nanoTime();
        int a = 0, b = 1, h;
        for (int n = 0; n < CALCULATION_N; ++n) {
          h = b;
          b = a + b;
          a = h;
        }
        timeExecute.addAndGet(System.nanoTime() - startTime);
        return b;
      }
    }
  }

  public static void main(String[] args) {
    final AtomicInteger threadCount = new AtomicInteger(0);
    final ForkJoinPool pool = new ForkJoinPool(NUM_THREADS, new ForkJoinPool.ForkJoinWorkerThreadFactory() {
      @Override
      public ForkJoinWorkerThread newThread(ForkJoinPool pool) {
        threadCount.getAndIncrement();
        final ForkJoinWorkerThread result = ForkJoinPool.defaultForkJoinWorkerThreadFactory.newThread(pool);
        result.setPriority(Thread.MIN_PRIORITY);
        return result;
      }
    }, null, DO_ASYNC);
    final long startTime = System.nanoTime();
    final Integer result = pool.invoke(new Task(0));
    final double duration = ((double) (System.nanoTime() - startTime)) / 1000000.0;
    final double executionDuration = ((double) Task.timeExecute.get()) / 1000000.0;
    final double executionPercent = executionDuration / duration * 100.0;
    final double executionPercentPerThread = executionPercent / ((double) NUM_THREADS);

    System.out.println("Completed: " + result + " in " + Task.totalLoops.get() + " loops.");
    System.out.println(String.format("Time total: %8.3f ms, time task: %8.3f ms (%6.2f%%), spread->depth: %2d->%2d, thread#: %1d, task-time/thread: %5.2f%%", duration, executionDuration, executionPercent, SPREAD, MAX_DEPTH, threadCount.get(), executionPercentPerThread));
  }
}
 

请随时指出任何错误或提出改进建议.对于某些奖励积分,我将接受最有价值的答案.

解决方案

建议:

  • 打印制作的货叉数量+已完成工作的成本估算(即,如果您切换到货叉,则要添加的数量或总长度为BigInteger).这个比例将显示您的分叉策略有多有效,并让您了解什么是最小的有意义的工作规模.
  • 检查算法-Fibonacci呈指数增长,您的任务返回整数,因此您应该很快就会溢出.

因此,目标是选择一个阈值,该阈值将表明是否分叉:

实现算法时要考虑的主要事项之一 使用fork/join并行性选择的阈值决定了 任务是否将执行顺序计算而不是执行 分叉并行子任务.

如果阈值太大,则程序可能无法创建 足够的任务以充分利用可用的 处理器/核.

如果阈值太小,则任务创建和创建的开销 管理可能会变得很重要.

通常,需要进行一些实验才能找到一个 适当的阈值. 来源

这也可能有用:here for a more in-depth analysis that I was pointed to.

Test Setup

In my program I had a RecursiveTask calculate a Fibonacci series for a given N, which reduces the actual calculation to 3 assignments and 1 addition. For any given CPU this should be a minor task.

Over the test I varied the amount of threads, the amount of forks per task and the length of the Fibonacci loop. In addition I did some tests with the async parameter, but setting this one to false only showed a minor reduction in calculation time, so I skipped that. The spread parameter (forks per fork) has as well been mostly skipped, as there was no significant difference in the result.

In general the calculation time is very stable, the actual percentage of time spent on the task usually varies by less than 1%, therefore each test set has been run about 5 times (or more if the numbers were unstable) on an otherwise idle system with 4 cores (+4 hyper-cores) and then the median execution time has been selected.

The proper execution has been verified through various test variables, especially the number of actual threads used has been verified to never differ from the initially given parallelism parameter.

Detailed Test Results

Where:

  • Time total is the total time the entire calculation took from the perspective of the main thread.
  • Time task is the time spent on actually calculating the Fibonacci series in all forks combined.
  • Time task percentage is the relative gain thanks to threading (time task / time total).
  • spread->depth is the (set) spread (fork per fork) and the (calculated) forking-depth.
  • threads is the actual amount of threads used.
  • task-time/thread is the time each thread actually spent on calculating the Fibonacci series over-all.

Spread->depth test:

Time total: 8766.670 ms, time task: 1717.418 ms ( 19.59%), spread->depth:  2->26, thread#: 1, task-time/thread: 19.59%
Time total: 7872.244 ms, time task: 1421.478 ms ( 18.06%), spread->depth: 10-> 8, thread#: 1, task-time/thread: 18.06%
Time total: 7336.052 ms, time task: 1280.036 ms ( 17.45%), spread->depth: 100-> 4, thread#: 1, task-time/thread: 17.45%

Conclusion: Number of forks only has a minor effect (still less forks = better), the implementation seems to be fairly sophisticated. Similar results were collected with other settings, so I skip these here.

Fib(0) (almost all time spent on forking)

Time total: 7866.777 ms, time task: 1421.488 ms ( 18.07%), spread->depth: 10-> 8, thread#: 1, task-time/thread: 18.07%
Time total: 7085.142 ms, time task: 1349.207 ms ( 19.04%), spread->depth: 10-> 8, thread#: 2, task-time/thread:  9.52%
Time total: 6580.609 ms, time task: 1476.467 ms ( 22.44%), spread->depth: 10-> 8, thread#: 4, task-time/thread:  5.61%

Conclusion: With a very minor task, most time is spent on forking, making a single-threaded implementation about 5 times faster than any Fork-Join setup. Even with multiple threads it is impossible to gain any performance increase using Fork-Join.

Fib(100)

Time total: 12487.634 ms, time task: 5707.720 ms ( 45.71%), spread->depth: 10-> 8, thread#: 1, task-time/thread: 45.71%
Time total:  8386.855 ms, time task: 5768.881 ms ( 68.78%), spread->depth: 10-> 8, thread#: 2, task-time/thread: 34.39%
Time total:  7078.769 ms, time task: 6086.997 ms ( 85.99%), spread->depth: 10-> 8, thread#: 4, task-time/thread: 21.50%

Conclusion: seems to be close to the break-even point for single-threaded execution, while multi-threading begins to have an impact. Still a single-threaded implementation would be faster than any Fork-Join setup.

Fib(1000)

Time total:  5941.344 ms, time task:  5228.258 ms ( 88.00%), spread->depth: 10-> 7, thread#: 1, task-time/thread: 88.00%
Time total:  3160.818 ms, time task:  5244.241 ms (165.91%), spread->depth: 10-> 7, thread#: 2, task-time/thread: 82.96%
Time total: 16301.697 ms, time task: 53351.694 ms (327.28%), spread->depth: 10-> 8, thread#: 4, task-time/thread: 81.82%

Conclusion: Times start to stabilize here for multi-threading execution with a near linear gain, while still ~20% of the calculation time per thread is spent on forking. While at this point forking can increase performance through threading, a naive implementation would still be noticeably faster.

Fib(10000)

Time total:  5204.786 ms, time task:  5119.133 ms ( 98.35%), spread->depth: 10-> 6, thread#: 1, task-time/thread: 98.35%
Time total: 26033.889 ms, time task: 51084.118 ms (196.22%), spread->depth: 10-> 7, thread#: 2, task-time/thread: 98.11%
Time total: 13183.573 ms, time task: 51637.471 ms (391.68%), spread->depth: 10-> 7, thread#: 4, task-time/thread: 97.92%

Conclusion: At this number the calculation out-weights the cost for forking. While a naive implementation would still be slightly faster, the loss caused by forking is negligible if the task would have been much more difficult to implement in another way.

Code

public class Test {

  static final int NUM_THREADS = 4;
  static final int SPREAD = 10;
  static final int LOOPS = 4000000;
  static final int CALCULATION_N = 10000;
  static final boolean DO_ASYNC = true;
  //---
  static final long MAX_DEPTH = Math.round(Math.log(LOOPS) / Math.log(SPREAD)); // try to have the execution take about the same time

  private static class Task extends RecursiveTask<Integer> {

    final static AtomicLong timeExecute = new AtomicLong(0);
    final static AtomicLong totalLoops = new AtomicLong(0);
    final long depth;

    public Task(final long depth) {
      this.depth = depth;
    }

    @Override
    protected Integer compute() {
      if (depth < MAX_DEPTH) {
        final Task[] subTasks = new Task[SPREAD];
        for (int i = 0; i < subTasks.length; ++i) {
          subTasks[i] = new Task(depth + 1);
        }
        try {
          invokeAll(subTasks);
          final long startTime = System.nanoTime();
          int result = 0;
          for (final Task task : subTasks) {
            if (task.isCompletedNormally()) {
              result += task.get();
            }
          }
          timeExecute.addAndGet(System.nanoTime() - startTime);
          return result;
        } catch (Exception e) {
          this.completeExceptionally(e);
          return null;
        }
      } else {
        totalLoops.incrementAndGet();
        final long startTime = System.nanoTime();
        int a = 0, b = 1, h;
        for (int n = 0; n < CALCULATION_N; ++n) {
          h = b;
          b = a + b;
          a = h;
        }
        timeExecute.addAndGet(System.nanoTime() - startTime);
        return b;
      }
    }
  }

  public static void main(String[] args) {
    final AtomicInteger threadCount = new AtomicInteger(0);
    final ForkJoinPool pool = new ForkJoinPool(NUM_THREADS, new ForkJoinPool.ForkJoinWorkerThreadFactory() {
      @Override
      public ForkJoinWorkerThread newThread(ForkJoinPool pool) {
        threadCount.getAndIncrement();
        final ForkJoinWorkerThread result = ForkJoinPool.defaultForkJoinWorkerThreadFactory.newThread(pool);
        result.setPriority(Thread.MIN_PRIORITY);
        return result;
      }
    }, null, DO_ASYNC);
    final long startTime = System.nanoTime();
    final Integer result = pool.invoke(new Task(0));
    final double duration = ((double) (System.nanoTime() - startTime)) / 1000000.0;
    final double executionDuration = ((double) Task.timeExecute.get()) / 1000000.0;
    final double executionPercent = executionDuration / duration * 100.0;
    final double executionPercentPerThread = executionPercent / ((double) NUM_THREADS);

    System.out.println("Completed: " + result + " in " + Task.totalLoops.get() + " loops.");
    System.out.println(String.format("Time total: %8.3f ms, time task: %8.3f ms (%6.2f%%), spread->depth: %2d->%2d, thread#: %1d, task-time/thread: %5.2f%%", duration, executionDuration, executionPercent, SPREAD, MAX_DEPTH, threadCount.get(), executionPercentPerThread));
  }
}

Feel free to point out any mistakes or to make suggestions for improvement. I will accept the most valuable answer for some bonus points.

解决方案

Suggestions:

  • Print the number of forks made + the cost estimation of the work which is done (i.e. the number of additions or a length of BigIntegers which are being summed if you switch to them). This proportion will show how effective is your forking strategy and give you an understanding of what is the minimal job size which makes sense.
  • Check your algorithm - Fibonacci has an exponential growth, you task returns integer, so you should be getting an overflow very soon.

So, the goal is to choose a threshold which would say to fork or not to fork:

One of the main things to consider when implementing an algorithm using fork/join parallelism is choosing the threshold which determines whether a task will execute a sequential computation rather than forking parallel sub-tasks.

If the threshold is too large, then the program might not create enough tasks to fully take advantage of the available processors/cores.

If the threshold is too small, then the overhead of task creation and management could become significant.

In general, some experimentation will be necessary to find an appropriate threshold value. Source

This also could be useful: How to determine the proper work division threshold of a fork-join task.

这篇关于分析:ForkJoinPool的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆