使用Vector< T>通过SIMD进行矢量化的C#代码.运行速度比经典循环慢 [英] Vectorized C# code with SIMD using Vector<T> running slower than classic loop
问题描述
我看过几篇文章,介绍Vector<T>
如何启用SIMD并使用JIT内在函数实现,因此编译器在使用时将正确输出AVS/SSE/...指令,从而使代码比经典代码快得多,线性循环(示例此处).
I've seen a few articles describing how Vector<T>
is SIMD-enabled and is implemented using JIT intrinsics so the compiler will correctly output AVS/SSE/... instructions when using it, allowing much faster code than classic, linear loops (example here).
我决定尝试重写一种方法,我必须查看是否能够获得一定的提速,但是到目前为止,我失败了,矢量化代码的运行速度比原始代码慢了3倍,我不确定为什么.这是方法的两个版本,用于检查两个Span<float>
实例是否在同一位置具有相对于阈值共享同一位置的所有项目对.
I decided to try to rewrite a method I have to see if I managed to get some speedup, but so far I failed and the vectorized code is running 3 times slower than the original, and I'm not exactly sure as to why. Here are two versions of a method checking if two Span<float>
instances have all the pairs of items in the same position that share the same position relative to a threshold value.
// Classic implementation
public static unsafe bool MatchElementwiseThreshold(this Span<float> x1, Span<float> x2, float threshold)
{
fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
for (int i = 0; i < x1.Length; i++)
if (px1[i] > threshold != px2[i] > threshold)
return false;
return true;
}
// Vectorized
public static unsafe bool MatchElementwiseThresholdSIMD(this Span<float> x1, Span<float> x2, float threshold)
{
// Setup the test vector
int l = Vector<float>.Count;
float* arr = stackalloc float[l];
for (int i = 0; i < l; i++)
arr[i] = threshold;
Vector<float> cmp = Unsafe.Read<Vector<float>>(arr);
fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
{
// Iterate in chunks
int
div = x1.Length / l,
mod = x1.Length % l,
i = 0,
offset = 0;
for (; i < div; i += 1, offset += l)
{
Vector<float>
v1 = Unsafe.Read<Vector<float>>(px1 + offset),
v1cmp = Vector.GreaterThan<float>(v1, cmp),
v2 = Unsafe.Read<Vector<float>>(px2 + offset),
v2cmp = Vector.GreaterThan<float>(v2, cmp);
float*
pcmp1 = (float*)Unsafe.AsPointer(ref v1cmp),
pcmp2 = (float*)Unsafe.AsPointer(ref v2cmp);
for (int j = 0; j < l; j++)
if (pcmp1[j] == 0 != (pcmp2[j] == 0))
return false;
}
// Test the remaining items, if any
if (mod == 0) return true;
for (i = x1.Length - mod; i < x1.Length; i++)
if (px1[i] > threshold != px2[i] > threshold)
return false;
}
return true;
}
正如我所说,我已经使用BenchmarkDotNet测试了这两个版本,而使用Vector<T>
的一个版本的运行速度是另一个版本的3倍.我尝试以不同长度的跨度(从100到2000多个)运行测试,但是矢量化方法始终比另一种慢得多.
As I said, I've tested both versions using BenchmarkDotNet, and the one using Vector<T>
is running around 3 times slower than the other one. I tried running the tests with spans of different length (from around 100 to over 2000), but the vectorized method keeps being much slower than the other one.
我在这里错过了明显的东西吗?
Am I missing something obvious here?
谢谢!
编辑:之所以使用不安全的代码并尝试在不并行化的情况下尽可能地优化此代码,是因为已经从Parallel.For
迭代中调用了此方法.
the reason why I'm using unsafe code and trying to optimize this code as much as possible without parallelizing it is that this method is already being called from within a Parallel.For
iteration.
此外,具有在多个线程上并行执行代码的能力通常不是使单个并行任务没有优化的好理由.
Plus, having the ability to parallelize the code over multiple threads is generally not a good reason to leave the individual parallel tasks not optimized.
推荐答案
我遇到了同样的问题.解决方案是在项目属性中取消选中首选32位选项.
I had the same problem. The solution was to uncheck the Prefer 32-bit option at the project properties.
仅对64位进程启用SIMD.因此,请确保您的应用直接针对x64,或者被编译为任何CPU"且未标记为首选32位". [来源]
这篇关于使用Vector< T>通过SIMD进行矢量化的C#代码.运行速度比经典循环慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!