约。 π用于比较java中的Sequential v / s并行速度。为什么.parallel()比较慢? [英] Approx. of π used to compare Sequential v/s Parallel speeds in java. Why .parallel() was slower?

查看:127
本文介绍了约。 π用于比较java中的Sequential v / s并行速度。为什么.parallel()比较慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以解释一下为什么顺序版π近似比并行版快吗?



我无法弄清楚



我正在使用一个非常着名的π近似示例。我在单位平方((0,0)到(1,1))中选择随机点,看看有多少随机点落在单位圆的区域内。分数应该是π/ 4的值。

 公共类PIEstimation {
final static int NUM_SAMPLES = 100000000;

public static void main(String [] args){

sequentialVersion();
parallelVersion();
System.out.println(Real PI:=+ Math.PI);
}

public static void sequentialVersion(){
final long start = System.nanoTime();

最终长计数= LongStream
.rangeClosed(1,NUM_SAMPLES)
.filter(e - > {
double x = Math.random();
double y = Math.random();
返回x * x + y * y< 1;
})。count();

final long duration =((System.nanoTime() - start)/ 1_000_000);

System.out.println(顺序版本:PI~+ 4.0 *(计数/(双)NUM_SAMPLES)+以
+持续时间+msecs计算);
}

public static void parallelVersion(){
final long start = System.nanoTime();

最终长计数= LongStream
.rangeClosed(1,NUM_SAMPLES)
.parallel()
.filter(e - > {
double x = Math.random();
double y = Math.random();
返回x * x + y * y< 1;
})。count();

final long duration =((System.nanoTime() - start)/ 1_000_000);

System.out.println(并行版本:PI~+ 4.0 *(计数/(双)NUM_SAMPLES)+以
+持续时间+msecs计算);
}

}

结果:

 顺序版本:PI~3.14176568以4893 msecs计算
并行版本:PI~3.1417546以12044 msecs计算
Real PI := 3.141592653589793


解决方案

我得到更糟糕的结果并行运行在我的机器上(3.0 GHz Intel Core i7,两个内核,四个线程):

 顺序:PI~3.14175124以4952 msecs计算
parallel:PI~3.14167776以21320 msecs计算

我怀疑主要原因是 Math.random()是线程安全的,因此它会在每次调用时同步。由于有多个线程都在尝试同时获取随机数,因此它们都争用同一个锁。这增加了大量的开销。请注意 <$ c的规范$ c> Math.random() 说明如下:


此方法已正确同步允许多个线程正确使用。但是,如果许多线程需要以很高的速率生成伪随机数,它可能会减少每个线程争用自己的伪随机数生成器。


为避免锁争用,请改为使用 ThreadLocalRandom

  long count = LongStream.rangeClosed(1,NUM_SAMPLES)
.parallel()
.filter(e - > {
ThreadLocalRandom cur = ThreadLocalRandom.current();
double x = cur.nextDouble();
double y = cur.nextDouble();
返回x * x + y * y< 1;
})
.count() ;

这会得到以下结果:

  sequential2:PI~3.14169156以1171 msecs计算
parallel2:PI~3.14166796以648 msecs计算

这是1.8倍的加速,对于双核机器来说也不算太糟糕。请注意,顺序运行时也会更快,可能是因为根本没有锁定开销。



除此之外:通常对于基准测试我建议使用JMH。然而,这个基准似乎运行得足够长,以便给出相对速度的合理指示。但是,为了获得更精确的结果,我建议使用JMH。



UPDATE



以下是其他结果(用户3666197在评论中请求),使用 NUM_SAMPLES 1_000_000_000 与原始 100_000_000 。我复制了上面的结果以便于比较。

  NUM_SAMPLES = 100_000_000 

顺序:PI ~3.14175124以4952 msecs计算
parallel:PI~3.14167776以21320 msecs计算
顺序2:PI~3.14169156以1171 msecs计算
parallel2:PI~3.14166796以648 msecs计算

NUM_SAMPLES = 1_000_000_000

顺序:PI~3.141572896以47730 msecs计算
parallel:PI~3.141543836以228969 msecs计算
顺序2:PI~3.1414865以12843 msecs $计算b $ b parallel2:PI~3.141635704以7953 msecs计算

顺序 parallel 结果(大部分)与问题中的代码相同, sequential2 parallel2 正在使用我修改过的 ThreadLocalRandom 代码。正如人们所预料的那样,新的时间总计大约延长了10倍。较长的 parallel2 运行速度并不像人们预期的那么快,尽管它并不完全脱节,显示在双核机器上加速1.6倍。 / p>

Can someone please explain me why the sequential version π-approximation was faster than the parallel one?

I can't figure it out

I'm playing around with using a very well-known π-approximation example. I pick random points in the unit square ( ( 0, 0 ) to ( 1, 1 ) ) and see how many of random points do fall inside the area of unit circle. The fraction should be the value of π / 4.

public class PIEstimation {
    final static int NUM_SAMPLES = 100000000;

    public static void main(String[] args) {

    sequentialVersion();
    parallelVersion();
    System.out.println("               Real PI:= " + Math.PI);
    }

    public static void sequentialVersion() {
    final long start = System.nanoTime();

    final long count = LongStream
        .rangeClosed(1, NUM_SAMPLES)
        .filter(e -> {
                double x = Math.random();
                double y = Math.random();
                return x * x + y * y < 1;
    }).count();

    final long duration = ((System.nanoTime() - start) / 1_000_000);

    System.out.println("Sequential Version: PI ~ " + 4.0 * (count / (double) NUM_SAMPLES) + " calculated in "
        + duration + " msecs");
    }

    public static void parallelVersion() {
    final long start = System.nanoTime();

    final long count = LongStream
        .rangeClosed(1, NUM_SAMPLES)
        .parallel()
        .filter(e -> {
                double x = Math.random();
                double y = Math.random();
                return x * x + y * y < 1;
    }).count();

    final long duration = ((System.nanoTime() - start) / 1_000_000);

    System.out.println("  Parallel Version: PI ~ " + 4.0 * (count / (double) NUM_SAMPLES) + " calculated in "
        + duration + " msecs");
    }

}

The results:

Sequential Version: PI ~ 3.14176568 calculated in  4893 msecs
  Parallel Version: PI ~ 3.1417546  calculated in 12044 msecs
               Real PI:= 3.141592653589793

解决方案

I get even worse results running in parallel on my machine (3.0 GHz Intel Core i7, two cores, four threads):

sequential: PI ~ 3.14175124 calculated in  4952 msecs
  parallel: PI ~ 3.14167776 calculated in 21320 msecs

I suspect the main reason is that Math.random() is thread-safe, and so it synchronizes around every call. Since there are multiple threads all trying to get random numbers at the same time, they're all contending for the same lock. This adds a tremendous amount of overhead. Note that the specification for Math.random() says the following:

This method is properly synchronized to allow correct use by more than one thread. However, if many threads need to generate pseudorandom numbers at a great rate, it may reduce contention for each thread to have its own pseudorandom-number generator.

To avoid lock contention, use ThreadLocalRandom instead:

long count = LongStream.rangeClosed(1, NUM_SAMPLES)
                       .parallel()
                       .filter(e -> {
                           ThreadLocalRandom cur = ThreadLocalRandom.current();
                           double x = cur.nextDouble();
                           double y = cur.nextDouble();
                           return x * x + y * y < 1;
                       })
                       .count();

This gives the following results:

sequential2: PI ~ 3.14169156 calculated in 1171 msecs
  parallel2: PI ~ 3.14166796 calculated in  648 msecs

which is 1.8x speedup, not too bad for a two-core machine. Note that this is also faster when run sequentially, probably because there's no lock overhead at all.

Aside: Normally for benchmarks I'd suggest using JMH. However, this benchmark seems to run long enough that it gives a reasonable indication of relative speeds. For more precise results, though, I do recommend using JMH.

UPDATE

Here are additional results (requested by user3666197 in comments), using a NUM_SAMPLES value of 1_000_000_000 compared to the original 100_000_000. I've copied the results from above for easy comparison.

NUM_SAMPLES = 100_000_000

sequential:  PI ~ 3.14175124 calculated in    4952 msecs
parallel:    PI ~ 3.14167776 calculated in   21320 msecs
sequential2: PI ~ 3.14169156 calculated in    1171 msecs
parallel2:   PI ~ 3.14166796 calculated in     648 msecs

NUM_SAMPLES = 1_000_000_000

sequential:  PI ~ 3.141572896 calculated in  47730 msecs
parallel:    PI ~ 3.141543836 calculated in 228969 msecs
sequential2: PI ~ 3.1414865   calculated in  12843 msecs
parallel2:   PI ~ 3.141635704 calculated in   7953 msecs

The sequential and parallel results are (mostly) the same code as in the question, and sequential2 and parallel2 are using my modified ThreadLocalRandom code. The new timings are overall roughly 10x longer, as one would expect. The longer parallel2 run isn't quite as fast as one would expect, though it's not totally out of line, showing about a 1.6x speedup on a two-core machine.

这篇关于约。 π用于比较java中的Sequential v / s并行速度。为什么.parallel()比较慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆