为什么插入我的树更快的排序输入比随机输入? [英] Why is insertion into my tree faster on sorted input than random input?

查看:109
本文介绍了为什么插入我的树更快的排序输入比随机输入?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在我一直听说二进制搜索树比随机选择的数据建立的更快,只是因为有序数据需要显式的重新平衡才能将树的高度保持在最低水平。

Now I've always heard binary search trees are faster to build from randomly selected data than ordered data, simply because ordered data requires explicit rebalancing to keep the tree height at a minimum.

最近我实施了一个不可变的 treap ,一种特殊的二叉搜索树,它使用随机化以保持自身相对平衡。与我预期的相反,我发现我可以持续地构建一个两倍的速度,通常比排序数据更平衡的数据比无序数据 - 我不知道为什么。

Recently I implemented an immutable treap, a special kind of binary search tree which uses randomization to keep itself relatively balanced. In contrast to what I expected, I found I can consistently build a treap about 2x faster and generally better balanced from ordered data than unordered data -- and I have no idea why.

这是我的treap实现:

Here's my treap implementation:

  • http://pastebin.com/VAfSJRwZ

这里是一个测试程序:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Diagnostics;

namespace ConsoleApplication1
{

    class Program
    {
        static Random rnd = new Random();
        const int ITERATION_COUNT = 20;

        static void Main(string[] args)
        {
            List<double> rndTimes = new List<double>();
            List<double> orderedTimes = new List<double>();

            rndTimes.Add(TimeIt(50, RandomInsert));
            rndTimes.Add(TimeIt(100, RandomInsert));
            rndTimes.Add(TimeIt(200, RandomInsert));
            rndTimes.Add(TimeIt(400, RandomInsert));
            rndTimes.Add(TimeIt(800, RandomInsert));
            rndTimes.Add(TimeIt(1000, RandomInsert));
            rndTimes.Add(TimeIt(2000, RandomInsert));
            rndTimes.Add(TimeIt(4000, RandomInsert));
            rndTimes.Add(TimeIt(8000, RandomInsert));
            rndTimes.Add(TimeIt(16000, RandomInsert));
            rndTimes.Add(TimeIt(32000, RandomInsert));
            rndTimes.Add(TimeIt(64000, RandomInsert));
            rndTimes.Add(TimeIt(128000, RandomInsert));
            string rndTimesAsString = string.Join("\n", rndTimes.Select(x => x.ToString()).ToArray());

            orderedTimes.Add(TimeIt(50, OrderedInsert));
            orderedTimes.Add(TimeIt(100, OrderedInsert));
            orderedTimes.Add(TimeIt(200, OrderedInsert));
            orderedTimes.Add(TimeIt(400, OrderedInsert));
            orderedTimes.Add(TimeIt(800, OrderedInsert));
            orderedTimes.Add(TimeIt(1000, OrderedInsert));
            orderedTimes.Add(TimeIt(2000, OrderedInsert));
            orderedTimes.Add(TimeIt(4000, OrderedInsert));
            orderedTimes.Add(TimeIt(8000, OrderedInsert));
            orderedTimes.Add(TimeIt(16000, OrderedInsert));
            orderedTimes.Add(TimeIt(32000, OrderedInsert));
            orderedTimes.Add(TimeIt(64000, OrderedInsert));
            orderedTimes.Add(TimeIt(128000, OrderedInsert));
            string orderedTimesAsString = string.Join("\n", orderedTimes.Select(x => x.ToString()).ToArray());

            Console.WriteLine("Done");
        }

        static double TimeIt(int insertCount, Action<int> f)
        {
            Console.WriteLine("TimeIt({0}, {1})", insertCount, f.Method.Name);

            List<double> times = new List<double>();
            for (int i = 0; i < ITERATION_COUNT; i++)
            {
                Stopwatch sw = Stopwatch.StartNew();
                f(insertCount);
                sw.Stop();
                times.Add(sw.Elapsed.TotalMilliseconds);
            }

            return times.Average();
        }

        static void RandomInsert(int insertCount)
        {
            Treap<double> tree = new Treap<double>((x, y) => x.CompareTo(y));
            for (int i = 0; i < insertCount; i++)
            {
                tree = tree.Insert(rnd.NextDouble());
            }
        }

        static void OrderedInsert(int insertCount)
        {
            Treap<double> tree = new Treap<double>((x, y) => x.CompareTo(y));
            for(int i = 0; i < insertCount; i++)
            {
                tree = tree.Insert(i + rnd.NextDouble());
            }
        }
    }
}

这是一个比较随机和有序插入时间(毫秒)的图表:

And here's a chart comparing random and ordered insertion times in milliseconds:

Insertions         Random          Ordered         RandomTime / OrderedTime
50                 1.031665        0.261585        3.94
100                0.544345        1.377155        0.4
200                1.268320        0.734570        1.73
400                2.765555        1.639150        1.69
800                6.089700        3.558350        1.71
1000               7.855150        4.704190        1.67
2000               17.852000       12.554065       1.42
4000               40.157340       22.474445       1.79
8000               88.375430       48.364265       1.83
16000              197.524000      109.082200      1.81
32000              459.277050      238.154405      1.93
64000              1055.508875     512.020310      2.06
128000             2481.694230     1107.980425     2.24

我在代码中看不到任何订购输入渐近速度比无序输入,所以我很失落地解释这个区别。

I don't see anything in the code which makes ordered input asymptotically faster than unordered input, so I'm at a loss to explain the difference.

为什么从有序输入构建一个treap比例随机输入?

推荐答案

自平衡树存在修复非随机分布的数据。根据定义,他们交易了一些最好的表现,大大改善了与非平衡BST相关的最坏情况,特别是排序输入。

Self-balancing trees exist to fix the problems associated non-randomly-distributed data. By definition, they trade away a bit of the best-case performance to vastly improve the worst-case performance associated with non-balanced BSTs, specifically that of sorted input.

实际上,您实际上对此问题进行了深思熟虑,因为随机数据与有序数据的插入速度较慢是任何平衡树的特征。尝试在AVL,你会看到相同的结果。

You're actually overthinking this problem, because slower insertion of random data vs. ordered data is a characteristic of any balanced tree. Try it on an AVL and you'll see the same results.

Cameron 有正确的想法,删除优先级检查来强制最坏的情况。如果你这样做和仪器树,所以你可以看到每个插件发生了多少重新平衡,它实际上变得非常明显的发生了什么。当插入排序数据时,树总是向左旋转,而根的右边的孩子总是空的。插入总是导致一个重新平衡,因为插入节点没有子节点,并且不会发生递归。另一方面,当您在随机数据上运行时,几乎立即开始在每个插入件上发现多个重新平衡,最小的情况下,最多5个或6个(50个插入),因为它在子树上发生

Cameron had the right idea, removing the priority check to force the worst case. If you do that and instrument your tree so you can see how many rebalances are happening for each insert, it actually becomes very obvious what's going on. When inserting sorted data, the tree always rotates left and the root's right child is always empty. Insertion always results in exactly one rebalance because the insertion node has no children and no recursion occurs. On the other hand, when you run it on the random data, almost immediately you start to see multiple rebalances happening on every insert, as many as 5 or 6 of them in the smallest case (50 inserts), because it's happening on subtrees as well.

由于更多的节点被推入左子树,所以优先级检查不仅重新开始,而且更重要的是,他们永远不会出来,因为没有插入发生在那里),但他们也不太可能发生。为什么?因为在treap中,高优先级节点浮动到顶部,并且恒定的左旋转(不伴随右旋转)也开始将所有高优先级节点推入左子树。结果是由于概率分布不均匀,重新平衡发生的频率较低。

With priority checking turned back on, not only are rebalances typically less expensive due to more nodes being pushed into the left subtree (where they never come out of because no insertions happen there), but they are also less likely to occur. Why? Because in the treap, high-priority nodes float to the top, and the constant left-rotations (not accompanied by right-rotations) start to push all the high-priority nodes into the left subtree as well. The result is that rebalances happen less frequently due to the uneven distribution of probability.

如果你调整重新平衡代码,你会看到这是真的;对于排序和随机输入,最终都会产生几乎相同数量的左旋转,但是随机输入也给出相同数量的右旋转,这是所有的两倍。这不应该是令人惊讶的 - 高斯输入应该导致旋转的高斯分布。您还将看到排序输入的顶级重新平衡只有大约60-70%,这也许是令人惊讶的,再次,这是因为排序的输入与自然的分配优先级。

If you instrument the rebalancing code you'll see that this is true; for both the sorted and random input, you end up with almost identical numbers of left-rotations, but the random input also gives the same number of right-rotations, which makes for twice as many in all. This shouldn't be surprising - Gaussian input should result in a Gaussian distribution of rotations. You'll also see that there are only about 60-70% as many top-level rebalances for the sorted input, which perhaps is surprising, and again, that's due to the sorted input messing with the natural distribution of priorities.

您还可以通过检查插入循环末尾的完整树来验证这一点。随着随机输入,优先级趋于相对线性下降;随着排序的输入,优先级往往保持非常高,直到从底部达到一个或两个级别。

You can also verify this by inspecting the full tree at the end of an insertion loop. With the random input, priorities tend to decrease fairly linearly by level; with the sorted input, priorities tend to stay very high until you get to one or two levels from the bottom.

希望我做了一个体面的工作来解释这个..让我知道其中是否含糊不清。

Hopefully I've done a decent job explaining this... let me know if any of it is too vague.

这篇关于为什么插入我的树更快的排序输入比随机输入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆