使用Vector< T>通过SIMD进行矢量化的C#代码.运行速度比经典循环慢 [英] Vectorized C# code with SIMD using Vector<T> running slower than classic loop

查看：236 发布时间：2020/9/20 18:50:55 c# .net vector vectorization benchmarking

本文介绍了使用Vector< T>通过SIMD进行矢量化的C#代码.运行速度比经典循环慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我看过几篇文章，介绍Vector<T>如何启用SIMD并使用JIT内在函数实现，因此编译器在使用时将正确输出AVS/SSE/...指令，从而使代码比经典代码快得多，线性循环(示例此处).

I've seen a few articles describing how Vector<T> is SIMD-enabled and is implemented using JIT intrinsics so the compiler will correctly output AVS/SSE/... instructions when using it, allowing much faster code than classic, linear loops (example here).

我决定尝试重写一种方法，我必须查看是否能够获得一定的提速，但是到目前为止，我失败了，矢量化代码的运行速度比原始代码慢了3倍，我不确定为什么.这是方法的两个版本，用于检查两个Span<float>实例是否在同一位置具有相对于阈值共享同一位置的所有项目对.

I decided to try to rewrite a method I have to see if I managed to get some speedup, but so far I failed and the vectorized code is running 3 times slower than the original, and I'm not exactly sure as to why. Here are two versions of a method checking if two Span<float> instances have all the pairs of items in the same position that share the same position relative to a threshold value.

// Classic implementation
public static unsafe bool MatchElementwiseThreshold(this Span<float> x1, Span<float> x2, float threshold)
{
    fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
        for (int i = 0; i < x1.Length; i++)
            if (px1[i] > threshold != px2[i] > threshold)
                return false;
    return true;
}

// Vectorized
public static unsafe bool MatchElementwiseThresholdSIMD(this Span<float> x1, Span<float> x2, float threshold)
{
    // Setup the test vector
    int l = Vector<float>.Count;
    float* arr = stackalloc float[l];
    for (int i = 0; i < l; i++)
        arr[i] = threshold;
    Vector<float> cmp = Unsafe.Read<Vector<float>>(arr);
    fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
    {
        // Iterate in chunks
        int
            div = x1.Length / l,
            mod = x1.Length % l,
            i = 0,
            offset = 0;
        for (; i < div; i += 1, offset += l)
        {
            Vector<float>
                v1 = Unsafe.Read<Vector<float>>(px1 + offset),
                v1cmp = Vector.GreaterThan<float>(v1, cmp),
                v2 = Unsafe.Read<Vector<float>>(px2 + offset),
                v2cmp = Vector.GreaterThan<float>(v2, cmp);
            float*
                pcmp1 = (float*)Unsafe.AsPointer(ref v1cmp),
                pcmp2 = (float*)Unsafe.AsPointer(ref v2cmp);
            for (int j = 0; j < l; j++)
                if (pcmp1[j] == 0 != (pcmp2[j] == 0))
                    return false;
        }

        // Test the remaining items, if any
        if (mod == 0) return true;
        for (i = x1.Length - mod; i < x1.Length; i++)
            if (px1[i] > threshold != px2[i] > threshold)
                return false;
    }
    return true;
}

正如我所说，我已经使用BenchmarkDotNet测试了这两个版本，而使用Vector<T>的一个版本的运行速度是另一个版本的3倍.我尝试以不同长度的跨度(从100到2000多个)运行测试，但是矢量化方法始终比另一种慢得多.

As I said, I've tested both versions using BenchmarkDotNet, and the one using Vector<T> is running around 3 times slower than the other one. I tried running the tests with spans of different length (from around 100 to over 2000), but the vectorized method keeps being much slower than the other one.

我在这里错过了明显的东西吗?

Am I missing something obvious here?

谢谢！

编辑:之所以使用不安全的代码并尝试在不并行化的情况下尽可能地优化此代码，是因为已经从Parallel.For迭代中调用了此方法.

the reason why I'm using unsafe code and trying to optimize this code as much as possible without parallelizing it is that this method is already being called from within a Parallel.For iteration.

此外，具有在多个线程上并行执行代码的能力通常不是使单个并行任务没有优化的好理由.

Plus, having the ability to parallelize the code over multiple threads is generally not a good reason to leave the individual parallel tasks not optimized.

使用Vector< T>通过SIMD进行矢量化的C#代码.运行速度比经典循环慢 [英] Vectorized C# code with SIMD using Vector<T> running slower than classic loop

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

使用Vector&lt; T&gt;通过SIMD进行矢量化的C#代码.运行速度比经典循环慢 [英] Vectorized C# code with SIMD using Vector&lt;T&gt; running slower than classic loop

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

使用Vector< T>通过SIMD进行矢量化的C#代码.运行速度比经典循环慢 [英] Vectorized C# code with SIMD using Vector<T> running slower than classic loop

登录关闭