为什么通过 Pointer 转换结构很慢,而 Unsafe.As 很快? [英] Why is casting a struct via Pointer slow, while Unsafe.As is fast?

查看:43
本文介绍了为什么通过 Pointer 转换结构很慢,而 Unsafe.As 很快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想制作一些整数大小的 struct(即 32 位和 64 位),它们可以轻松转换为相同大小的原始非托管类型(即 Int32UInt32 尤其适用于 32 位大小的结构.

I wanted to make a few integer-sized structs (i.e. 32 and 64 bits) that are easily convertible to/from primitive unmanaged types of the same size (i.e. Int32 and UInt32 for 32-bit-sized struct in particular).

然后,这些结构将公开用于位操作/索引的附加功能,而这些功能在整数类型上无法直接使用.基本上,作为一种语法糖,提高可读性和易用性.

The structs would then expose additional functionality for bit manipulation / indexing that is not available on integer types directly. Basically, as a sort of syntactic sugar, improving readability and ease of use.

然而,重要的部分是性能,因为这种额外的抽象本质上应该是 0 成本(在一天结束时,CPU 应该看到"相同的位,就好像它在处理原始整数一样).

The important part, however, was performance, in that there should essentially be 0 cost for this extra abstraction (at the end of the day the CPU should "see" the same bits as if it was dealing with primitive ints).

下面是我想出的非常基本的struct.它没有所有功能,但足以说明我的问题:

Below is just the very basic struct I came up with. It does not have all the functionality, but enough to illustrate my questions:

[StructLayout(LayoutKind.Explicit, Pack = 1, Size = 4)]
public struct Mask32 {
  [FieldOffset(3)]
  public byte Byte1;
  [FieldOffset(2)]
  public ushort UShort1;
  [FieldOffset(2)]
  public byte Byte2;
  [FieldOffset(1)]
  public byte Byte3;
  [FieldOffset(0)]
  public ushort UShort2;
  [FieldOffset(0)]
  public byte Byte4;

  [DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
  public static unsafe implicit operator Mask32(int i) => *(Mask32*)&i;
  [DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
  public static unsafe implicit operator Mask32(uint i) => *(Mask32*)&i;
}

测试

我想测试这个结构的性能.特别是我想看看它是否能让我在使用常规按位算术时同样快速地获取单个字节:(i >> 8) &0xFF(例如获取第三个字节).

The Test

I wanted to test the performance of this struct. In particular I wanted to see if it could let me get the individual bytes just as quickly if I were to use regular bitwise arithmetic: (i >> 8) & 0xFF (to get the 3rd byte for example).

您将在下面看到我提出的基准:

Below you will see a benchmark I came up with:

public unsafe class MyBenchmark {

  const int count = 50000;

  [Benchmark(Baseline = true)]
  public static void Direct() {
    var j = 0;
    for (int i = 0; i < count; i++) {
      //var b1 = i.Byte1();
      //var b2 = i.Byte2();
      var b3 = i.Byte3();
      //var b4 = i.Byte4();
      j += b3;
    }
  }


  [Benchmark]
  public static void ViaStructPointer() {
    var j = 0;
    int i = 0;
    var s = (Mask32*)&i;
    for (; i < count; i++) {
      //var b1 = s->Byte1;
      //var b2 = s->Byte2;
      var b3 = s->Byte3;
      //var b4 = s->Byte4;
      j += b3;
    }
  }

  [Benchmark]
  public static void ViaStructPointer2() {
    var j = 0;
    int i = 0;
    for (; i < count; i++) {
      var s = *(Mask32*)&i;
      //var b1 = s.Byte1;
      //var b2 = s.Byte2;
      var b3 = s.Byte3;
      //var b4 = s.Byte4;
      j += b3;
    }
  }

  [Benchmark]
  public static void ViaStructCast() {
    var j = 0;
    for (int i = 0; i < count; i++) {
      Mask32 m = i;
      //var b1 = m.Byte1;
      //var b2 = m.Byte2;
      var b3 = m.Byte3;
      //var b4 = m.Byte4;
      j += b3;
    }
  }

  [Benchmark]
  public static void ViaUnsafeAs() {
    var j = 0;
    for (int i = 0; i < count; i++) {
      var m = Unsafe.As<int, Mask32>(ref i);
      //var b1 = m.Byte1;
      //var b2 = m.Byte2;
      var b3 = m.Byte3;
      //var b4 = m.Byte4;
      j += b3;
    }
  }

}

Byte1()Byte2()Byte3()Byte4() 只是确实被内联并通过按位操作和强制转换简单地获取第 n 个字节的扩展方法:

The Byte1(), Byte2(), Byte3(), and Byte4() are just the extension methods that do get inlined and simply get the n-th byte by doing bitwise operations and casting:

[DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
public static byte Byte1(this int it) => (byte)(it >> 24);
[DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
public static byte Byte2(this int it) => (byte)((it >> 16) & 0xFF);
[DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
public static byte Byte3(this int it) => (byte)((it >> 8) & 0xFF);
[DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
public static byte Byte4(this int it) => (byte)it;

修复了代码以确保实际使用变量.还注释掉了 4 个变量中的 3 个,以真正测试结构转换/成员访问,而不是实际使用变量.

Fixed the code to make sure variables are actually used. Also commented out 3 of 4 variables to really test struct casting / member access rather than actually using the variables.

我在发布版本中运行了这些,并在 x64 上进行了优化.

I ran these in the Release build with optimizations on x64.

Intel Core i7-3770K CPU 3.50GHz (Ivy Bridge), 1 CPU, 8 logical cores and 4 physical cores
Frequency=3410223 Hz, Resolution=293.2360 ns, Timer=TSC
  [Host]     : .NET Framework 4.6.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.6.1086.0
  DefaultJob : .NET Framework 4.6.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.6.1086.0


            Method |      Mean |     Error |    StdDev | Scaled | ScaledSD |
------------------ |----------:|----------:|----------:|-------:|---------:|
            Direct |  14.47 us | 0.3314 us | 0.2938 us |   1.00 |     0.00 |
  ViaStructPointer | 111.32 us | 0.6481 us | 0.6062 us |   7.70 |     0.15 |
 ViaStructPointer2 | 102.31 us | 0.7632 us | 0.7139 us |   7.07 |     0.14 |
     ViaStructCast |  29.00 us | 0.3159 us | 0.2800 us |   2.01 |     0.04 |
       ViaUnsafeAs |  14.32 us | 0.0955 us | 0.0894 us |   0.99 |     0.02 |

修复代码后的新结果:

            Method |      Mean |     Error |    StdDev | Scaled | ScaledSD |
------------------ |----------:|----------:|----------:|-------:|---------:|
            Direct |  57.51 us | 1.1070 us | 1.0355 us |   1.00 |     0.00 |
  ViaStructPointer | 203.20 us | 3.9830 us | 3.5308 us |   3.53 |     0.08 |
 ViaStructPointer2 | 198.08 us | 1.8411 us | 1.6321 us |   3.45 |     0.06 |
     ViaStructCast |  79.68 us | 1.5478 us | 1.7824 us |   1.39 |     0.04 |
       ViaUnsafeAs |  57.01 us | 0.8266 us | 0.6902 us |   0.99 |     0.02 |

问题

基准测试结果让我感到惊讶,这就是为什么我有几个问题:

Questions

The benchmark results were surprising for me, and that's why I have a few questions:

在更改代码以便实际使用变量后剩下的问题更少.

Fewer questions remain after altering the code so that the variables actually get used.

  1. 为什么指针的东西那么很慢?
  2. 为什么演员的时间是基线情况的两倍?隐式/显式运算符不是内联的吗?
  3. 为什么新的 System.Runtime.CompilerServices.Unsafe 包 (v. 4.5.0) 如此之快?我认为它至少会涉及一个方法调用...
  4. 更一般地说,我怎样才能本质上制作一个零成本结构,它可以简单地充当某些内存的窗口" 或像 UInt64 这样的大型原始类型我可以更有效地操作/读取该内存吗?这里的最佳做法是什么?
  1. Why is the pointer stuff so slow?
  2. Why is the cast taking twice as long as the baseline case? Aren't implicit/explicit operators inlined?
  3. How come the new System.Runtime.CompilerServices.Unsafe package (v. 4.5.0) is so fast? I thought it would at least involve a method call...
  4. More generally, how can I make essentially a zero-cost struct that would simply act as a "window" onto some memory or a biggish primitive type like UInt64 so that I can more effectively manipulate / read that memory? What's the best practice here?

推荐答案

这个问题的答案似乎是当您使用 Unsafe.As() 时,JIT 编译器可以更好地进行某些优化.

The answer to this appears to be that the JIT compiler can make certain optimisations better when you are using Unsafe.As().

Unsafe.As() 的实现非常简单,如下所示:

Unsafe.As() is implemented very simply like this:

public static ref TTo As<TFrom, TTo>(ref TFrom source)
{
    return ref source;
}

就是这样!

这是我编写的一个测试程序,用于将其与强制转换进行比较:

Here's a test program I wrote to compare that with casting:

using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

namespace Demo
{
    [StructLayout(LayoutKind.Explicit, Pack = 1, Size = 4)]
    public struct Mask32
    {
        [FieldOffset(3)]
        public byte Byte1;
        [FieldOffset(2)]
        public ushort UShort1;
        [FieldOffset(2)]
        public byte Byte2;
        [FieldOffset(1)]
        public byte Byte3;
        [FieldOffset(0)]
        public ushort UShort2;
        [FieldOffset(0)]
        public byte Byte4;
    }

    public static unsafe class Program
    {
        static int count = 50000000;

        public static int ViaStructPointer()
        {
            int total = 0;

            for (int i = 0; i < count; i++)
            {
                var s = (Mask32*)&i;
                total += s->Byte1;
            }

            return total;
        }

        public static int ViaUnsafeAs()
        {
            int total = 0;

            for (int i = 0; i < count; i++)
            {
                var m = Unsafe.As<int, Mask32>(ref i);
                total += m.Byte1;
            }

            return total;
        }

        public static void Main(string[] args)
        {
            var sw = new Stopwatch();

            sw.Restart();
            ViaStructPointer();
            Console.WriteLine("ViaStructPointer took " + sw.Elapsed);

            sw.Restart();
            ViaUnsafeAs();
            Console.WriteLine("ViaUnsafeAs took " + sw.Elapsed);
        }
    }
}

我在我的 PC(x64 发行版)上得到的结果如下:

The results I get on my PC (x64 release build) are as follows:

ViaStructPointer took 00:00:00.1314279
ViaUnsafeAs took 00:00:00.0249446

如您所见,ViaUnsafeAs 确实要快得多.

As you can see, ViaUnsafeAs is indeed much quicker.

那么让我们看看编译器生成了什么:

So let's look at what the compiler has generated:

public static unsafe int ViaStructPointer()
{
    int total = 0;
    for (int i = 0; i < Program.count; i++)
    {
        total += (*(Mask32*)(&i)).Byte1;
    }
    return total;
}

public static int ViaUnsafeAs()
{
    int total = 0;
    for (int i = 0; i < Program.count; i++)
    {
        total += (Unsafe.As<int, Mask32>(ref i)).Byte1;
    }
    return total;
}   

好的,那里没有什么明显的.但是 IL 呢?

OK, there's nothing obvious there. But what about the IL?

.method public hidebysig static int32 ViaStructPointer () cil managed 
{
    .locals init (
        [0] int32 total,
        [1] int32 i,
        [2] valuetype Demo.Mask32* s
    )

    IL_0000: ldc.i4.0
    IL_0001: stloc.0
    IL_0002: ldc.i4.0
    IL_0003: stloc.1
    IL_0004: br.s IL_0017
    .loop
    {
        IL_0006: ldloca.s i
        IL_0008: conv.u
        IL_0009: stloc.2
        IL_000a: ldloc.0
        IL_000b: ldloc.2
        IL_000c: ldfld uint8 Demo.Mask32::Byte1
        IL_0011: add
        IL_0012: stloc.0
        IL_0013: ldloc.1
        IL_0014: ldc.i4.1
        IL_0015: add
        IL_0016: stloc.1

        IL_0017: ldloc.1
        IL_0018: ldsfld int32 Demo.Program::count
        IL_001d: blt.s IL_0006
    }

    IL_001f: ldloc.0
    IL_0020: ret
}

.method public hidebysig static int32 ViaUnsafeAs () cil managed 
{
    .locals init (
        [0] int32 total,
        [1] int32 i,
        [2] valuetype Demo.Mask32 m
    )

    IL_0000: ldc.i4.0
    IL_0001: stloc.0
    IL_0002: ldc.i4.0
    IL_0003: stloc.1
    IL_0004: br.s IL_0020
    .loop
    {
        IL_0006: ldloca.s i
        IL_0008: call valuetype Demo.Mask32& [System.Runtime.CompilerServices.Unsafe]System.Runtime.CompilerServices.Unsafe::As<int32, valuetype Demo.Mask32>(!!0&)
        IL_000d: ldobj Demo.Mask32
        IL_0012: stloc.2
        IL_0013: ldloc.0
        IL_0014: ldloc.2
        IL_0015: ldfld uint8 Demo.Mask32::Byte1
        IL_001a: add
        IL_001b: stloc.0
        IL_001c: ldloc.1
        IL_001d: ldc.i4.1
        IL_001e: add
        IL_001f: stloc.1

        IL_0020: ldloc.1
        IL_0021: ldsfld int32 Demo.Program::count
        IL_0026: blt.s IL_0006
    }

    IL_0028: ldloc.0
    IL_0029: ret
}

啊哈!这里唯一的区别是:

Aha! The only difference here is this:

ViaStructPointer: conv.u
ViaUnsafeAs:      call valuetype Demo.Mask32& [System.Runtime.CompilerServices.Unsafe]System.Runtime.CompilerServices.Unsafe::As<int32, valuetype Demo.Mask32>(!!0&)
                  ldobj Demo.Mask32

从表面上看,您希望 conv.u 比用于 Unsafe.As 的两条指令更快.然而,似乎 JIT 编译器能够比单个 conv.u 更好地优化这两条指令.

On the face of it, you would expect conv.u to be faster than the two instructions used for Unsafe.As. However, it seems that the JIT compiler is able to optimise those two instructions much better than the single conv.u.

为什么是合理的 - 不幸的是,我还没有答案!我几乎可以肯定对 Unsafe::As<>() 的调用是由 JITTER 内联的,并且它正在由 JIT 进一步优化.

It's reasonable to ask why that is - unfortunately I don't have an answer to that yet! I'm almost certain that the call to Unsafe::As<>() is being inlined by the JITTER, and it is being further optimised by the JIT.

有有关 Unsafe 类优化的一些信息,请点击此处.

There is some information about the Unsafe class' optimisations here.

请注意,为 Unsafe.As<> 生成的 IL 是这样的:

Note that the IL generated for Unsafe.As<> is simply this:

.method public hidebysig static !!TTo& As<TFrom, TTo> (
        !!TFrom& source
    ) cil managed aggressiveinlining 
{
    .custom instance void System.Runtime.Versioning.NonVersionableAttribute::.ctor() = (
        01 00 00 00
    )
    IL_0000: ldarg.0
    IL_0001: ret
}

现在我认为为什么 JITTER 可以很好地优化它变得更清楚了.

Now I think it becomes clearer as to why that can be optimised so well by the JITTER.

这篇关于为什么通过 Pointer 转换结构很慢,而 Unsafe.As 很快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆