在 C# 中按元素相乘数组具有意想不到的性能 [英] Multiplying arrays element-wise has unexpected performance in C#

查看:31
本文介绍了在 C# 中按元素相乘数组具有意想不到的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想找到按元素相乘两个数组的最佳方法.这是一个更广泛的项目的一部分,其中性能但不是唯一的考虑因素.

我今天开始用 C# (Linqpad) 编写一些函数,所以它没有以任何方式进行优化.下面代码的输出如下:

Environment.ProcessorCount: 4向量<double>.Count: 4对于顺序:129ms,总和:2.30619276241231E+25Plinq:344ms,总和:2.30619276241231E+25Parallel.For: 137ms, 2.30619276241231E+25Simd 顺序:100 毫秒,总和:2.30619276241231E+25Simd 并行:761ms

这包括乘法的执行时间和作为检查结果的总和.这里有一些奇怪的结果(我对 C# 有点生疏,所以它很可能是我的代码):

  • regular for 比 parallel.for 快
  • plinq 相对于其他人来说非常慢 - 我在这里做了什么傻事吗?
  • simd 是最快的,但不是很多
  • 间歇性地,simd 方法需要更长的时间
  • 并行 simd 甚至可能吗(给出实现或解释的奖励积分)?

我的代码如下 - 引用了 Nuget System.Numerics.Vector 包.如果您有任何意见、建议、更正或替代方案,我将不胜感激...

使用 System.Threading.Tasks;使用 System.Numerics;使用 System.Collections.Concurrent;无效主(){var random = new Random();var arraySize = 20_000_001;var x = new double[arraySize];var y = new double[arraySize];for (var i = 0; i < x.Length; ++i){x[i] = random.Next();y[i] = random.Next();}Console.WriteLine($"Environment.ProcessorCount: {Environment.ProcessorCount}");Console.WriteLine($"Vector.Count: {Vector.Count}\n");MultiplyFor(x, y);MultiplyPlinq(x, y);MultiplyParallelFor(x, y);乘以SIMD(x, y);MultiplyParallelSIMD(x, y);}void MultiplyPlinq(double[] x, double[] y){var result = new double[x.Length];var sw = 新秒表();sw.开始();结果 = ParallelEnumerable.Range(0, x.Length).Select(i => x[i] * y[i]).ToArray();sw.停止();Console.WriteLine($"Plinq: {sw.ElapsedMilliseconds}ms, sum: {SumCheck(result)}");}double SumCheck(double[] x){返回 Math.Round(x.Sum() , 4);}void MultiplyFor(double[] x, double[] y){var result = new double[x.Length];var sw = 新秒表();sw.开始();for (var i = 0; i < x.Length; ++i){结果[i] = x[i] * y[i];}sw.停止();Console.WriteLine($"对于顺序:{sw.ElapsedMilliseconds}ms, sum: {SumCheck(result)}");}void MultiplyParallelFor(double[] x, double[] y){var result = new double[x.Length];var sw = 新秒表();sw.开始();Parallel.For(0, x.Length, i =>{结果[i] = x[i] * y[i];});sw.停止();Console.WriteLine($"Parallel.For: {sw.ElapsedMilliseconds}ms, {SumCheck(result)}");}void MultiplySIMD(double[] x, double[] y){var sw = 新秒表();sw.开始();var 结果 = MultiplyByVectors(x, y);sw.停止();//2 个内核,4 个逻辑,256b 寄存器Console.WriteLine($"Simd 顺序: {sw.ElapsedMilliseconds}ms, sum: {SumCheck(result)}");}double[] MultiplyByVectors(double[] x, double[] y){var result = new double[x.Length];var vectorSize = Vector.Count;国际我;for (i = 0; i (x, i);var vy = new Vector(y, i);(vx * vy).CopyTo(result, i);}for (; i  {var complete = i * chunkSize;var take = Math.Min(chunkSize, x.Length - i * chunkSize);var xSegment = x.Skip((int)complete).Take((int)take);var ySegment = y.Skip((int)complete).Take((int)take);var 结果 = MultiplyByVectors(xSegment.ToArray(), ySegment.ToArray());});sw.停止();Console.WriteLine($"Simd parallel: {sw.ElapsedMilliseconds}ms");}

解决方案

Parallel.For 最简单的形式不适合非常细粒度的工作负载,因为在每个循环上调用匿名函数的开销抵消了并行性的好处(匿名函数不能内联).解决办法是对数据进行分区,这样多个分区并行处理,而每个分区用一个快速的直接循环处理:

Parallel.ForEach(Partitioner.Create(0, x.Length), range =>{for (int i = range.Item1; i 

内置Partitioner 在其当前实现 创建与 CPU 核心数 x 3 一样多的分区.

关于并行化 SIMD 操作,在我自己的实验中,我没有在我的 PC 中观察到令人印象深刻的性能改进.我的理论是(这只是一个疯狂的猜测,而不是有根据的猜测),SIMD 计算发生得太快,以至于 RAM 跟不上 CPU 处理数据的速度.

I want to find the best way to multiply two arrays element-wise. This is one part of a wider project where performance but not the only consideration.

I started writing some functions today in C# (Linqpad) and so it hasn't been optimised in any way. The output from the code below is as follows:

Environment.ProcessorCount: 4
Vector<double>.Count: 4

For sequential: 129ms, sum: 2.30619276241231E+25
Plinq: 344ms, sum: 2.30619276241231E+25
Parallel.For: 137ms, 2.30619276241231E+25
Simd sequential: 100ms, sum: 2.30619276241231E+25
Simd parallel: 761ms

This consists of the execution time for the multiplication and a sum over the results as a check. There are a few odd results here (and I'm a little rusty in C# so it could well be my code):

  • regular for is faster than parallel.for
  • plinq is very slow relative to the others - have I done something silly here?
  • simd is the fastest but not by much
  • intermittently the simd method takes a lot longer
  • is parallel simd even possible (bonus points for giving an implementation or explanation)?

My code is as below - there is a reference to the Nuget System.Numerics.Vector package. I'd appreciate any comments, suggestions, corrections or alternatives...

using System.Threading.Tasks;
using System.Numerics;
using System.Collections.Concurrent;

void Main()
{
    var random = new Random();

    var arraySize = 20_000_001;

    var x = new double[arraySize];
    var y = new double[arraySize];

    for (var i = 0; i < x.Length; ++i)
    {
        x[i] = random.Next();
        y[i] = random.Next();
    }

    Console.WriteLine($"Environment.ProcessorCount: {Environment.ProcessorCount}");
    Console.WriteLine($"Vector<double>.Count: {Vector<double>.Count}\n");

    MultiplyFor(x, y);
    MultiplyPlinq(x, y);
    MultiplyParallelFor(x, y);
    MultiplySIMD(x, y);
    MultiplyParallelSIMD(x, y);

}

void MultiplyPlinq(double[] x, double[] y)
{
    var result = new double[x.Length];

    var sw = new Stopwatch();

    sw.Start();

    result = ParallelEnumerable.Range(0, x.Length).Select(i => x[i] * y[i]).ToArray();

    sw.Stop();

    Console.WriteLine($"Plinq: {sw.ElapsedMilliseconds}ms, sum: {SumCheck(result)}");
}

double SumCheck(double[] x)
{
    return Math.Round(x.Sum() , 4);
}

void MultiplyFor(double[] x, double[] y)
{
    var result = new double[x.Length];

    var sw = new Stopwatch();

    sw.Start();

    for (var i = 0; i < x.Length; ++i)
    {
        result[i] = x[i] * y[i];
    }

    sw.Stop();

    Console.WriteLine($"For sequential: {sw.ElapsedMilliseconds}ms, sum: {SumCheck(result)}");

}

void MultiplyParallelFor(double[] x, double[] y)
{
    var result = new double[x.Length];

    var sw = new Stopwatch();

    sw.Start();

    Parallel.For(0, x.Length, i =>
    {
        result[i] = x[i] * y[i];
    });

    sw.Stop();

    Console.WriteLine($"Parallel.For: {sw.ElapsedMilliseconds}ms, {SumCheck(result)}");

}

void MultiplySIMD(double[] x, double[] y)
{
    var sw = new Stopwatch();

    sw.Start();

    var result = MultiplyByVectors(x, y);

    sw.Stop();

    // 2 cores, 4 logical, 256b register
    Console.WriteLine($"Simd sequential: {sw.ElapsedMilliseconds}ms, sum: {SumCheck(result)}");
}

double[] MultiplyByVectors(double[] x, double[] y)
{
    var result = new double[x.Length];

    var vectorSize = Vector<double>.Count;

    int i;

    for (i = 0; i < x.Length - vectorSize; i += vectorSize)
    {
        var vx = new Vector<double>(x, i);
        var vy = new Vector<double>(y, i);
        (vx * vy).CopyTo(result, i);
    }

    for (; i < x.Length; i++)
    {
        result[i] = x[i] * y[i];
    }

    return result;
}

void MultiplyParallelSIMD(double[] x, double[] y)
{
    var sw = new Stopwatch();

    sw.Start();

    var chunkSize = (int)(x.Length / Environment.ProcessorCount);

    Parallel.For(0, Environment.ProcessorCount, i => {

        var complete = i * chunkSize;
        var take = Math.Min(chunkSize, x.Length - i * chunkSize);
        var xSegment = x.Skip((int)complete).Take((int)take);
        var ySegment = y.Skip((int)complete).Take((int)take);
        var result = MultiplyByVectors(xSegment.ToArray(), ySegment.ToArray());

    });

    sw.Stop();

    Console.WriteLine($"Simd parallel: {sw.ElapsedMilliseconds}ms");

}

解决方案

The Parallel.For in its simplest form is not suitable for very granular workloads, because the overhead of invoking an anonymous function on each loop negates the benefits of parallelism (anonymous functions can't be inlined). The solution is to partition the data, so that multiple partitions are processed in parallel, while each partition is processed with a fast direct loop:

Parallel.ForEach(Partitioner.Create(0, x.Length), range =>
{
    for (int i = range.Item1; i < range.Item2; i++)
    {
        result[i] = x[i] * y[i];
    }
});

The built-in Partitioner in its current implementation creates as many partitions as the number of the CPU cores x 3.

Regarding parallelizing SIMD operations, in my own experiments I haven't observed impressive performance improvements in my PC. My theory about it is (and this is just a wild speculation, not an educated guess), that the SIMD calculations are happening so fast that the RAM can't keep up with the rate that the data are crunched by the CPU.

这篇关于在 C# 中按元素相乘数组具有意想不到的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆