为什么我的应用程序会花费其生命的24%进行空检查? [英] Why does my application spend 24% of its life doing a null check?

查看:89
本文介绍了为什么我的应用程序会花费其生命的24%进行空检查?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个性能关键的二进制决策树,我想将这个问题集中在一行代码上.下面是二叉树迭代器的代码,以及针对它进行性能分析的结果.

I've got a performance critical binary decision tree, and I'd like to focus this question on a single line of code. The code for the binary tree iterator is below with the results from running performance analysis against it.

        public ScTreeNode GetNodeForState(int rootIndex, float[] inputs)
        {
0.2%        ScTreeNode node = RootNodes[rootIndex].TreeNode;

24.6%       while (node.BranchData != null)
            {
0.2%            BranchNodeData b = node.BranchData;
0.5%            node = b.Child2;
12.8%           if (inputs[b.SplitInputIndex] <= b.SplitValue)
0.8%                node = b.Child1;
            }

0.4%        return node;
        }

BranchData是一个字段,而不是一个属性.我这样做是为了防止没有被内联的风险.

BranchData is a field, not a property. I did this to prevent the risk of it not being inlined.

BranchNodeData类如下:

The BranchNodeData class is as follows:

public sealed class BranchNodeData
{
    /// <summary>
    /// The index of the data item in the input array on which we need to split
    /// </summary>
    internal int SplitInputIndex = 0;

    /// <summary>
    /// The value that we should split on
    /// </summary>
    internal float SplitValue = 0;

    /// <summary>
    /// The nodes children
    /// </summary>
    internal ScTreeNode Child1;
    internal ScTreeNode Child2;
}

如您所见,while循环/空值检查对性能产生了巨大影响.这棵树很大,所以我希望搜索一片叶子会花费一些时间,但是我想了解在那条线上花费的时间不成比例.

As you can see, the while loop / null check is a massive hit on performance. The tree is massive, so I would expect searching for a leaf to take a while, but I'd like to understand the disproportionate amount of time spent on that one line.

我尝试过:

  • 从一会儿分离出Null检查-击中Null检查.
  • 向对象添加一个布尔字段并进行检查,这没有什么区别.所比较的内容无关紧要,关键在于比较.

这是分支预测问题吗?如果是这样,我该怎么办?如果有的话?

Is this a branch prediction issue? If so, what can I do about it? If anything?

我不会假装理解 CIL ,但是我会把它发布给任何人这样他们就可以尝试从中抓取一些信息.

I won't pretend to understand the CIL, but I'll post it for anyone does so they can try to scrape some information from it.

.method public hidebysig
instance class OptimalTreeSearch.ScTreeNode GetNodeForState (
    int32 rootIndex,
    float32[] inputs
) cil managed
{
    // Method begins at RVA 0x2dc8
    // Code size 67 (0x43)
    .maxstack 2
    .locals init (
        [0] class OptimalTreeSearch.ScTreeNode node,
        [1] class OptimalTreeSearch.BranchNodeData b
    )

    IL_0000: ldarg.0
    IL_0001: ldfld class [mscorlib]System.Collections.Generic.List`1<class OptimalTreeSearch.ScRootNode> OptimalTreeSearch.ScSearchTree::RootNodes
    IL_0006: ldarg.1
    IL_0007: callvirt instance !0 class [mscorlib]System.Collections.Generic.List`1<class OptimalTreeSearch.ScRootNode>::get_Item(int32)
    IL_000c: ldfld class OptimalTreeSearch.ScTreeNode OptimalTreeSearch.ScRootNode::TreeNode
    IL_0011: stloc.0
    IL_0012: br.s IL_0039
    // loop start (head: IL_0039)
        IL_0014: ldloc.0
        IL_0015: ldfld class OptimalTreeSearch.BranchNodeData OptimalTreeSearch.ScTreeNode::BranchData
        IL_001a: stloc.1
        IL_001b: ldloc.1
        IL_001c: ldfld class OptimalTreeSearch.ScTreeNode OptimalTreeSearch.BranchNodeData::Child2
        IL_0021: stloc.0
        IL_0022: ldarg.2
        IL_0023: ldloc.1
        IL_0024: ldfld int32 OptimalTreeSearch.BranchNodeData::SplitInputIndex
        IL_0029: ldelem.r4
        IL_002a: ldloc.1
        IL_002b: ldfld float32 OptimalTreeSearch.BranchNodeData::SplitValue
        IL_0030: bgt.un.s IL_0039

        IL_0032: ldloc.1
        IL_0033: ldfld class OptimalTreeSearch.ScTreeNode OptimalTreeSearch.BranchNodeData::Child1
        IL_0038: stloc.0

        IL_0039: ldloc.0
        IL_003a: ldfld class OptimalTreeSearch.BranchNodeData OptimalTreeSearch.ScTreeNode::BranchData
        IL_003f: brtrue.s IL_0014
    // end loop

    IL_0041: ldloc.0
    IL_0042: ret
} // end of method ScSearchTree::GetNodeForState

我决定进行分支预测测试,如果在一段时间内添加了相同的内容,那么我们就有了

I decided to do a branch prediction test, I added an identical if within the while, so we have

while (node.BranchData != null)

if (node.BranchData != null)

在里面.然后,我对此进行了性能分析,执行第一次比较所需的时间比执行总是返回true的第二次比较要花费六倍的时间.因此,看来这确实是一个分支预测问题-我猜我对此无能为力?!

inside that. I then ran performance analysis against that, and it took six times longer to execute the first comparison as it did to execute the second comparison that always returned true. So it looks like it is indeed a branch prediction issue - and I'm guessing there's nothing I can do about it?!

另一个编辑

如果必须从RAM中加载node.BranchData进行while检查,也会发生上述结果-然后将其缓存为if语句.

The above result would also occur if node.BranchData had to be loaded from the RAM for the while check - it would then be cached for the if statement.

这是我关于类似主题的第三个问题.这次,我只关注一行代码. 关于这个问题,我的其他问题是:

This is my third question on a similar topic. This time I'm focusing on a single line of code. My other questions on this subject are:

  • Could I use a faster data structure than a tree for this?
  • Micro optimisations iterating through a tree in C#

推荐答案

那棵大树

到目前为止,处理器执行的最昂贵的操作是不执行指令,而是访问内存.现代的 CPU 的执行核心比内存快许多倍.公共汽车.与距离有关的问题是,电信号必须传播得越远,越难将信号传递到电线的另一端而不会被破坏.解决该问题的唯一方法是使其变慢.将CPU连接到计算机中RAM的电线存在很大问题,您可以弹出机箱并查看电线.

By far the most expensive thing a processor ever does is not executing instructions, it is accessing memory. The execution core of a modern CPU is many times faster than the memory bus. A problem related to distance, the further an electrical signal has to travel, the harder it gets to get that signal delivered to the other end of the wire without it being corrupted. The only cure for that problem is to make it go slower. A big problem with the wires that connect the CPU to the RAM in your machine, you can pop the case and see the wires.

处理器有一个针对此问题的对策,它们使用 cache (将字节的副本存储在RAM中的缓冲区).重要的一个是 L1缓存,通常用于数据的16 KB和用于数据的16 KB指示.很小,允许它靠近执行引擎.从L1缓存中读取字节通常需要2或3个CPU周期.接下来是更大更慢的L2缓存.高档处理器还具有L3高速缓存,更大,更慢.随着制程技术的进步,这些缓冲器占用的空间越来越小,并且随着距离内核的增加而自动变快,这是更新的处理器变得更好以及如何使用越来越多的晶体管的主要原因.

Processors have a counter-measure for this problem, they use caches, buffers that store a copy of the bytes in RAM. An important one is the L1 cache, typically 16 kilobytes for data and 16 kilobytes for instructions. Small, allowing it to be close to the execution engine. Reading bytes from the L1 cache typically takes 2 or 3 CPU cycles. Next up is the L2 cache, bigger and slower. Upscale processors also have an L3 cache, bigger and slower yet. As process technology improves, those buffers take less space and automatically becomes faster as they get closer to the core, a big reason why newer processors are better and how they manage to use an ever increasing number of transistors.

但是,这些缓存不是完美的解决方案.如果其中一个缓存中的数据不可用,则处理器仍将在内存访问上停顿.直到非常慢的内存总线提供了数据,它才能继续.一条指令可能会丢失100个CPU周期.

Those caches are however not a perfect solution. The processor will still stall on a memory access if the data is not available in one of the caches. It cannot continue until the very slow memory bus has supplied the data. Losing a fat hundred CPU cycles is possible on a single instruction.

树结构是一个问题,它们对缓存友好.它们的节点往往分散在整个地址空间中.访问内存的最快方法是通过读取顺序地址. L1高速缓存的存储单位为64字节.换句话说,一旦处理器读取一个字节,下一个63就会非常快,因为它们将出现在缓存中.

Tree structures are a problem, they are not cache friendly. Their nodes tend to be scattered throughout the address space. The fastest way to access memory is by reading from sequential addresses. The unit of storage for the L1 cache is 64 bytes. Or in other words, once the processor reads one byte, the next 63 are very fast since they'll be present in the cache.

这使数组成为目前最有效的数据结构.也是.NET List<>类根本不是列表的原因,它使用数组进行存储.其他集合类型(例如Dictionary)的结构相同,在结构上与数组在远程上并不相似,但在内部是通过数组实现的.

Which makes an array by far the most efficient data structure. Also the reason that the .NET List<> class isn't a list at all, it uses an array for storage. The same for other collection types, like Dictionary, structurally not remotely similar to an array, but internally implemented with arrays.

因此,您的while()语句很可能因为CPU停顿而受苦,因为它取消引用了访问BranchData字段的指针.下一条语句非常便宜,因为while()语句已经完成了从内存中检索值的繁重工作.分配局部变量很便宜,处理器使用缓冲区进行写操作.

So your while() statement is very likely to be suffering from CPU stalls because it is dereferencing a pointer to access the BranchData field. The next statement is very cheap because the while() statement already did the heavy lifting of retrieving the value from memory. Assigning the local variable is cheap, a processor uses a buffer for writes.

否则,这不是一个简单的问题可以解决,将树展平为数组很可能是不切实际的.至少因为您通常无法预测树的节点将以什么顺序访问.一棵红黑树可能会有所帮助,这个问题尚不清楚.因此,得出一个简单的结论是,它已经以您希望的速度运行.而且,如果您需要使其运行得更快,那么您将需要具有更快的内存总线的更好的硬件. DDR4 今年将成为主流.

Not otherwise a simple problem to solve, flattening your tree into arrays is very likely to be unpractical. Not in the least because you typically cannot predict in which order the nodes of the tree will be visited. A red-black tree might help, it isn't clear from the question. So a simple conclusion to draw is that it is already running as fast you can hope for. And if you need it to go faster then you'll need better hardware with a faster memory bus. DDR4 is going mainstream this year.

这篇关于为什么我的应用程序会花费其生命的24%进行空检查?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆