为什么处理排序后的数组不是一个排序的数组慢? [英] Why is processing a sorted array slower than an unsorted array?

查看:104
本文介绍了为什么处理排序后的数组不是一个排序的数组慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个列表500000随机生成的元组LT;长,长串> 对象上我执行一个简单的搜索之间的

  VAR数据=新的List&LT元组LT;长,长串>>(500000);
...
VAR CNT = data.Count(T => t.Item1< = X&放大器;&安培; t.Item2> = X);

当我生成我随机阵列和运行我为 100随机生成的数值搜索X 中,搜索在完成大约四秒钟。的<会心href=\"http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array\">great奇妙的排序确实给搜索,但是,我决定我的分类数据 - 首先由项目1 ,然后通过项目2 ,最后由项目3 - 我跑100搜索之前。我预期排序版本,因为分支prediction更快地执行了一点:我的思想一直认为一旦我们得到的地步项目1 == X ,所有的进一步 t.Item1&LT的检查; = X 将正确predict分支为不取,加快了搜索的尾部。出乎我的意料,的搜索,只要拿了两次排序阵列上

我试过围绕在我跑我的实验顺序切换,并使用不同的种子随机数生成器,但效果一直不变:在排序的数组的搜索跑了近两倍的搜索在同一数组,但排序!

有没有人有这种奇效一个很好的解释?我的测试源$ C ​​$ C如下:我使用.NET 4.0。


 私人const int的TOTALCOUNT = 500000;
私人const int的TotalQueries = 100;
私有静态长NextLong(随机R){
    VAR数据=新的字节[8];
    r.NextBytes(数据);
    返回BitConverter.ToInt64(数据,0);
}
私有类TupleComparer:的IComparer&LT元组LT;长,长串&GT;&GT; {
    公众诠释比较(元组LT;长,长串&GT; X,元组LT;长,长串&GT; Y){
        VAR解析度= x.Item1.CompareTo(y.Item1);
        如果(RES!= 0)返回资源;
        RES = x.Item2.CompareTo(y.Item2);
        回报(RES!= 0)? RES:String.CompareOrdinal(x.Item3,y.Item3);
    }
}
静态无效测试(布尔doSort){
    VAR数据=新的List&LT元组LT;长,长串&GT;&GT;(TOTALCOUNT);
    VAR随机=新的随机(1000000007);
    变种SW =新的秒表();
    sw.Start();
    对于(VAR I = 0;!I = TOTALCOUNT;我++){
        VAR一个= NextLong(随机);
        变种B = NextLong(随机);
        如果(A&GT; B){
            VAR TMP =一个;
            A = B;
            B = tmp目录;
        }
        变种S =的String.Format({0} - {1},A,B);
        data.Add(Tuple.Create(A,B,S));
    }
    sw.Stop();
    如果(doSort){
        data.Sort(新TupleComparer());
    }
    Console.WriteLine(填入{0},sw.Elapsed);
    sw.Reset();
    VAR总= 0L;
    sw.Start();
    对于(VAR I = 0; i = TotalQueries;!我++){
        VAR X = NextLong(随机);
        VAR CNT = data.Count(T =&GT; t.Item1&LT; = X&放大器;&安培; t.Item2&GT; = X);
        总+ = CNT;
    }
    sw.Stop();
    Console.WriteLine(?,总共sw.Elapsed,doSort排序找到{0} {1}({2})匹配:未分类);
}
静态无效的主要(){
    测试(假);
    测试(真);
    测试(假);
    测试(真);
}


 填入00:00:01.3176257
在发现的00场比赛15614281:00:04.2463478(未排序)
填入00:00:01.3345087
在发现的00场比赛15614281:00:08.5393730(排序)
填入00:00:01.3665681
在发现的00场比赛15614281:00:04.1796578(未排序)
填入00:00:01.3326378
在发现的00场比赛15614281:00:08.6027886(排序)


解决方案

当您使用无序列表中的所有元组在内存顺序访问即可。他们已在RAM连续分配。爱的CPU访问内存顺序,因为他们可以推测请求下一个高速缓存行,以便在需要的时候它总是会present。

当你排序,你把它放到列表中的随机因为你的排序键随机生成的。这意味着,存储器访问元组成员是未predictable。 CPU不能prefetch内存和几乎每访问一个元组是高速缓存未命中。

这是对于 GC内存管理的特定优势的很好的例子:已分配在一起,一起使用的数据结构进行非常漂亮。他们有很大的引用的局部性

从缓存未命中点球超过了保存的分支prediction点球在这种情况下。

尝试切换到结构元组。因为没有指针解引用需要在运行时出现访问元组成员这将恢复性能。

克里斯·辛克莱指出在的意见围绕10000以下TOTALCOUNT,排序的版本并更快地执行的。这是因为一个小名单的完全装入CPU缓存。该内存访问可能是未predictable但目标​​始终是在缓存中。我相信还有一个小小的惩罚,因为即使从缓存中加载需要一定的周期。但是,这似乎不是一个问题,因为在 CPU可以处理多个优秀的负载,从而增加数据吞吐量。每当CPU命中一个等待内存它仍然会高速前进的指令流中排队尽可能多的内存操作,因为它可以。这种技术被用来隐藏延迟。

这样的行为表明它是对现代的CPU predict性能有多难。我们是事实上的只慢2倍从顺序随机内存访问会告诉我有多少会在幕后隐藏存储器延迟时。内存访问可以暂停CPU为50-200周期。鉴于头号可以期待的节目,成为>慢10倍时,引入随机内存访问。

I have a list of 500000 randomly generated Tuple<long,long,string> objects on which I am performing a simple "between" search:

var data = new List<Tuple<long,long,string>>(500000);
...
var cnt = data.Count(t => t.Item1 <= x && t.Item2 >= x);

When I generate my random array and run my search for 100 randomly generated values of x, the searches complete in about four seconds. Knowing of the great wonders that sorting does to searching, however, I decided to sort my data - first by Item1, then by Item2, and finally by Item3 - before running my 100 searches. I expected the sorted version to perform a little faster because of branch prediction: my thinking has been that once we get to the point where Item1 == x, all further checks of t.Item1 <= x would predict the branch correctly as "no take", speeding up the tail portion of the search. Much to my surprise, the searches took twice as long on a sorted array!

I tried switching around the order in which I ran my experiments, and used different seed for the random number generator, but the effect has been the same: searches in an unsorted array ran nearly twice as fast as the searches in the same array, but sorted!

Does anyone have a good explanation of this strange effect? The source code of my tests follows; I am using .NET 4.0.


private const int TotalCount = 500000;
private const int TotalQueries = 100;
private static long NextLong(Random r) {
    var data = new byte[8];
    r.NextBytes(data);
    return BitConverter.ToInt64(data, 0);
}
private class TupleComparer : IComparer<Tuple<long,long,string>> {
    public int Compare(Tuple<long,long,string> x, Tuple<long,long,string> y) {
        var res = x.Item1.CompareTo(y.Item1);
        if (res != 0) return res;
        res = x.Item2.CompareTo(y.Item2);
        return (res != 0) ? res : String.CompareOrdinal(x.Item3, y.Item3);
    }
}
static void Test(bool doSort) {
    var data = new List<Tuple<long,long,string>>(TotalCount);
    var random = new Random(1000000007);
    var sw = new Stopwatch();
    sw.Start();
    for (var i = 0 ; i != TotalCount ; i++) {
        var a = NextLong(random);
        var b = NextLong(random);
        if (a > b) {
            var tmp = a;
            a = b;
            b = tmp;
        }
        var s = string.Format("{0}-{1}", a, b);
        data.Add(Tuple.Create(a, b, s));
    }
    sw.Stop();
    if (doSort) {
        data.Sort(new TupleComparer());
    }
    Console.WriteLine("Populated in {0}", sw.Elapsed);
    sw.Reset();
    var total = 0L;
    sw.Start();
    for (var i = 0 ; i != TotalQueries ; i++) {
        var x = NextLong(random);
        var cnt = data.Count(t => t.Item1 <= x && t.Item2 >= x);
        total += cnt;
    }
    sw.Stop();
    Console.WriteLine("Found {0} matches in {1} ({2})", total, sw.Elapsed, doSort ? "Sorted" : "Unsorted");
}
static void Main() {
    Test(false);
    Test(true);
    Test(false);
    Test(true);
}


Populated in 00:00:01.3176257
Found 15614281 matches in 00:00:04.2463478 (Unsorted)
Populated in 00:00:01.3345087
Found 15614281 matches in 00:00:08.5393730 (Sorted)
Populated in 00:00:01.3665681
Found 15614281 matches in 00:00:04.1796578 (Unsorted)
Populated in 00:00:01.3326378
Found 15614281 matches in 00:00:08.6027886 (Sorted)

解决方案

When you are using the unsorted list all tuples are accessed in memory-order. They have been allocated consecutively in RAM. CPUs love accessing memory sequentially because they can speculatively request the next cache line so it will always be present when needed.

When you are sorting the list you put it into random order because your sort keys are randomly generated. This means that the memory accesses to tuple members are unpredictable. The CPU cannot prefetch memory and almost every access to a tuple is a cache miss.

This is a nice example for a specific advantage of GC memory management: data structures which have been allocated together and are used together perform very nicely. They have great locality of reference.

The penalty from cache misses outweighs the saved branch prediction penalty in this case.

Try switching to a struct-tuple. This will restore performance because no pointer-dereference needs to occur at runtime to access tuple members.

Chris Sinclair notes in the comments that "for TotalCount around 10,000 or less, the sorted version does perform faster". This is because a small list fits entirely into the CPU cache. The memory accesses might be unpredictable but the target is always in cache. I believe there is still a small penalty because even a load from cache takes some cycles. But that seems not to be a problem because the CPU can juggle multiple outstanding loads, thereby increasing throughput. Whenever the CPU hits a wait for memory it will still speed ahead in the instruction stream to queue as many memory operations as it can. This technique is used to hide latency.

This kind of behavior shows how hard it is to predict performance on modern CPUs. The fact that we are only 2x slower when going from sequential to random memory access tell me how much is going on under the covers to hide memory latency. A memory access can stall the CPU for 50-200 cycles. Given that number one could expect the program to become >10x slower when introducing random memory accesses.

这篇关于为什么处理排序后的数组不是一个排序的数组慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆