为什么插入我的树快比上随机输入排序输入? [英] Why is insertion into my tree faster on sorted input than random input?

查看:172
本文介绍了为什么插入我的树快比上随机输入排序输入?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在我一直听说二叉搜索树从快比有序的数据随机选取数据构建,仅仅是因为有序的数据,需要明确的重新平衡,以保持树高在最低限度。



最近,我实现了一个不可改变的树堆,一种特殊的二叉搜索树,它使用随机化,以保持自身相对均衡。相反,我所期待的,我发现我可以一致地构建一个约2倍树堆更快,更好的一般均衡从无序比数据有序的数据 - 我不知道为什么。



<。 p>下面是我的树堆实现:





这是一个测试程序:

 使用系统; 
使用System.Collections.Generic;
使用System.Linq的;使用System.Diagnostics程序
;

命名ConsoleApplication1
{

类节目
{
静态随机RND =新的随机();
const int的ITERATION_COUNT = 20;

静态无效的主要(字串[] args)
{
名单,LT;双> rndTimes =新的List<双>();
名单,LT;双> orderedTimes =新的List<双>();

rndTimes.Add(TimeIt(50 RandomInsert));
rndTimes.Add(TimeIt(100,RandomInsert));
rndTimes.Add(TimeIt(200,RandomInsert));
rndTimes.Add(TimeIt(400,RandomInsert));
rndTimes.Add(TimeIt(800,RandomInsert));
rndTimes.Add(TimeIt(1000,RandomInsert));
rndTimes.Add(TimeIt(2000年,RandomInsert));
rndTimes.Add(TimeIt(4000,RandomInsert));
rndTimes.Add(TimeIt(8000,RandomInsert));
rndTimes.Add(TimeIt(16000,RandomInsert));
rndTimes.Add(TimeIt(32000,RandomInsert));
rndTimes.Add(TimeIt(64000,RandomInsert));
rndTimes.Add(TimeIt(128000,RandomInsert));
串rndTimesAsString =的string.join(\\\
,rndTimes.Select(X => x.ToString())ToArray的());

orderedTimes.Add(TimeIt(50 OrderedInsert));
orderedTimes.Add(TimeIt(100,OrderedInsert));
orderedTimes.Add(TimeIt(200,OrderedInsert));
orderedTimes.Add(TimeIt(400,OrderedInsert));
orderedTimes.Add(TimeIt(800,OrderedInsert));
orderedTimes.Add(TimeIt(1000,OrderedInsert));
orderedTimes.Add(TimeIt(2000年,OrderedInsert));
orderedTimes.Add(TimeIt(4000,OrderedInsert));
orderedTimes.Add(TimeIt(8000,OrderedInsert));
orderedTimes.Add(TimeIt(16000,OrderedInsert));
orderedTimes.Add(TimeIt(32000,OrderedInsert));
orderedTimes.Add(TimeIt(64000,OrderedInsert));
orderedTimes.Add(TimeIt(128000,OrderedInsert));
串orderedTimesAsString =的string.join(\\\
,orderedTimes.Select(X => x.ToString())ToArray的());

Console.WriteLine(完成);
}

静态双TimeIt(INT insertCount,动作< INT&F)的温度
{
Console.WriteLine(TimeIt({0},{1}) insertCount,f.Method.Name);

名单,LT;双>次=新的List<双>();
的for(int i = 0; I< ITERATION_COUNT;我++)
{
秒表SW = Stopwatch.StartNew();
F(insertCount);
sw.Stop();
times.Add(sw.Elapsed.TotalMilliseconds);
}

返回times.Average();
}

静态无效RandomInsert(INT insertCount)
{
树堆<双>树=新的树堆<双>((X,Y)=>则x.compareTo(Y));
的for(int i = 0; I< insertCount;我++)
{
树= tree.Insert(rnd.NextDouble());
}
}

静态无效OrderedInsert(INT insertCount)
{
树堆<双>树=新的树堆<双>((X,Y)=>则x.compareTo(Y));
的for(int i = 0; I< insertCount;我++)
{
树= tree.Insert(1 + rnd.NextDouble());
}
}
}
}

和这里有以毫秒为单位比较随机的,有序的插入时间的图表:

 的插入随机有序RandomTime / OrderedTime 
50 1.031665 0.261585 3.94
100 0.544345 1.377155 0.4
200 1.268320 0.734570 1.73
400 2.765555 1.639150 1.69
800 6.089700 3.558350 1.71
1000 7.855150 4.704190 1.67
2000 17.852000 12.554065 1.42
4000 40.157340 22.474445 1.79
8000 88.375430 48.364265 1.83
16000 197.524000 109.082200 1.81
32000 459.277050 238.154405 1.93
64000 1055.508875 512.020310 2.06
128000 2481.694230 1107.980425 2.24

我看不出这使得有序输入渐近快于无序的输入代码,任何事情,所以我在一个无法解释的差异。



为什么会如此之快,从比随机输入命令输入建立一个树堆?


解决方案

自平衡树存在的修复的相关的非随机分布的数据的问题。根据定义,它们交易走位的最佳情况下的性能,以大大提高与非平衡BSTS关联的最坏情况下的性能,特别是该排序输入的



你得太多其实这个问题,因为随机数据与插入慢有序数据的任何的平衡树的特性。尝试在一个AVL,你会看到同样的结果。



卡梅伦是对的,去除优先检查以强制的最坏情况。如果你这样做和仪器你的树,所以你可以看到有多少重新平衡正在发生的每次插入,它实际上变得很明显这是怎么回事。当插入排序的数据,树总是向左旋转和根的右子总是空空的。插入总是会导致只有一个再平衡,因为插入节点没有孩子,也没有发生递归。在另一方面,当你在随机数据运行它,几乎立刻你开始看到多个重新平衡上发生的每次插入,其中多达5或6中最小的情况下(50刀片),因为它是发生在子树作为好。



通过优先级检查转身,不仅是重新平衡通常的更便宜的的由于多个节点推入左子树(其中他们从来不出来的,因为没有插入发生在那里),但他们也不太可能的发生。为什么?因为在树堆,高优先级节点浮到顶部,并在恒定左旋转(不伴随右旋转)开始的所有高优先级的节点推入左子树以及。其结果是,重新平衡经常出现的较少,由于概率分布不均



如果您的仪器重新平衡代码,你会发现这是真实的;为排序和随机输入,你最终左旋转的几乎相同的数字,但随机输入也给出了相同数量的右旋转,这使得在所有的两倍。这并不奇怪 - 高斯输入应导致旋转高斯分布。您还可以看到有许多的有序投入,这或许是的是顶级重新平衡只有约60-70%的令人惊讶,并再次,这是由于输入排序与自然搞乱优先级分配。



您也可以通过在插入循环的末尾检查满树验证这一点。随着随机输入,优先事项往往由级别相当线性降低;与排序输入,优先事项往往直到你从下一个或两个级别保持非常高的。



希望我做一个体面的工作,解释这个.. 。让我知道如果任它太含糊。


Now I've always heard binary search trees are faster to build from randomly selected data than ordered data, simply because ordered data requires explicit rebalancing to keep the tree height at a minimum.

Recently I implemented an immutable treap, a special kind of binary search tree which uses randomization to keep itself relatively balanced. In contrast to what I expected, I found I can consistently build a treap about 2x faster and generally better balanced from ordered data than unordered data -- and I have no idea why.

Here's my treap implementation:

And here's a test program:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Diagnostics;

namespace ConsoleApplication1
{

    class Program
    {
        static Random rnd = new Random();
        const int ITERATION_COUNT = 20;

        static void Main(string[] args)
        {
            List<double> rndTimes = new List<double>();
            List<double> orderedTimes = new List<double>();

            rndTimes.Add(TimeIt(50, RandomInsert));
            rndTimes.Add(TimeIt(100, RandomInsert));
            rndTimes.Add(TimeIt(200, RandomInsert));
            rndTimes.Add(TimeIt(400, RandomInsert));
            rndTimes.Add(TimeIt(800, RandomInsert));
            rndTimes.Add(TimeIt(1000, RandomInsert));
            rndTimes.Add(TimeIt(2000, RandomInsert));
            rndTimes.Add(TimeIt(4000, RandomInsert));
            rndTimes.Add(TimeIt(8000, RandomInsert));
            rndTimes.Add(TimeIt(16000, RandomInsert));
            rndTimes.Add(TimeIt(32000, RandomInsert));
            rndTimes.Add(TimeIt(64000, RandomInsert));
            rndTimes.Add(TimeIt(128000, RandomInsert));
            string rndTimesAsString = string.Join("\n", rndTimes.Select(x => x.ToString()).ToArray());

            orderedTimes.Add(TimeIt(50, OrderedInsert));
            orderedTimes.Add(TimeIt(100, OrderedInsert));
            orderedTimes.Add(TimeIt(200, OrderedInsert));
            orderedTimes.Add(TimeIt(400, OrderedInsert));
            orderedTimes.Add(TimeIt(800, OrderedInsert));
            orderedTimes.Add(TimeIt(1000, OrderedInsert));
            orderedTimes.Add(TimeIt(2000, OrderedInsert));
            orderedTimes.Add(TimeIt(4000, OrderedInsert));
            orderedTimes.Add(TimeIt(8000, OrderedInsert));
            orderedTimes.Add(TimeIt(16000, OrderedInsert));
            orderedTimes.Add(TimeIt(32000, OrderedInsert));
            orderedTimes.Add(TimeIt(64000, OrderedInsert));
            orderedTimes.Add(TimeIt(128000, OrderedInsert));
            string orderedTimesAsString = string.Join("\n", orderedTimes.Select(x => x.ToString()).ToArray());

            Console.WriteLine("Done");
        }

        static double TimeIt(int insertCount, Action<int> f)
        {
            Console.WriteLine("TimeIt({0}, {1})", insertCount, f.Method.Name);

            List<double> times = new List<double>();
            for (int i = 0; i < ITERATION_COUNT; i++)
            {
                Stopwatch sw = Stopwatch.StartNew();
                f(insertCount);
                sw.Stop();
                times.Add(sw.Elapsed.TotalMilliseconds);
            }

            return times.Average();
        }

        static void RandomInsert(int insertCount)
        {
            Treap<double> tree = new Treap<double>((x, y) => x.CompareTo(y));
            for (int i = 0; i < insertCount; i++)
            {
                tree = tree.Insert(rnd.NextDouble());
            }
        }

        static void OrderedInsert(int insertCount)
        {
            Treap<double> tree = new Treap<double>((x, y) => x.CompareTo(y));
            for(int i = 0; i < insertCount; i++)
            {
                tree = tree.Insert(i + rnd.NextDouble());
            }
        }
    }
}

And here's a chart comparing random and ordered insertion times in milliseconds:

Insertions         Random          Ordered         RandomTime / OrderedTime
50                 1.031665        0.261585        3.94
100                0.544345        1.377155        0.4
200                1.268320        0.734570        1.73
400                2.765555        1.639150        1.69
800                6.089700        3.558350        1.71
1000               7.855150        4.704190        1.67
2000               17.852000       12.554065       1.42
4000               40.157340       22.474445       1.79
8000               88.375430       48.364265       1.83
16000              197.524000      109.082200      1.81
32000              459.277050      238.154405      1.93
64000              1055.508875     512.020310      2.06
128000             2481.694230     1107.980425     2.24

I don't see anything in the code which makes ordered input asymptotically faster than unordered input, so I'm at a loss to explain the difference.

Why is it so much faster to build a treap from ordered input than random input?

解决方案

Self-balancing trees exist to fix the problems associated non-randomly-distributed data. By definition, they trade away a bit of the best-case performance to vastly improve the worst-case performance associated with non-balanced BSTs, specifically that of sorted input.

You're actually overthinking this problem, because slower insertion of random data vs. ordered data is a characteristic of any balanced tree. Try it on an AVL and you'll see the same results.

Cameron had the right idea, removing the priority check to force the worst case. If you do that and instrument your tree so you can see how many rebalances are happening for each insert, it actually becomes very obvious what's going on. When inserting sorted data, the tree always rotates left and the root's right child is always empty. Insertion always results in exactly one rebalance because the insertion node has no children and no recursion occurs. On the other hand, when you run it on the random data, almost immediately you start to see multiple rebalances happening on every insert, as many as 5 or 6 of them in the smallest case (50 inserts), because it's happening on subtrees as well.

With priority checking turned back on, not only are rebalances typically less expensive due to more nodes being pushed into the left subtree (where they never come out of because no insertions happen there), but they are also less likely to occur. Why? Because in the treap, high-priority nodes float to the top, and the constant left-rotations (not accompanied by right-rotations) start to push all the high-priority nodes into the left subtree as well. The result is that rebalances happen less frequently due to the uneven distribution of probability.

If you instrument the rebalancing code you'll see that this is true; for both the sorted and random input, you end up with almost identical numbers of left-rotations, but the random input also gives the same number of right-rotations, which makes for twice as many in all. This shouldn't be surprising - Gaussian input should result in a Gaussian distribution of rotations. You'll also see that there are only about 60-70% as many top-level rebalances for the sorted input, which perhaps is surprising, and again, that's due to the sorted input messing with the natural distribution of priorities.

You can also verify this by inspecting the full tree at the end of an insertion loop. With the random input, priorities tend to decrease fairly linearly by level; with the sorted input, priorities tend to stay very high until you get to one or two levels from the bottom.

Hopefully I've done a decent job explaining this... let me know if any of it is too vague.

这篇关于为什么插入我的树快比上随机输入排序输入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆