C#最快的2套排序号的交集 [英] C# fastest intersection of 2 sets of sorted numbers

查看:125
本文介绍了C#最快的2套排序号的交集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我计算了2套数字排序的交叉点在我的应用程序的时间关键部分。这种计算是整个应用的最大瓶颈,所以我需要加速这一过程。

I'm calculating intersection of 2 sets of sorted numbers in a time-critical part of my application. This calculation is the biggest bottleneck of the whole application so I need to speed it up.

我已经试过了一堆简单的选择,我目前使用的:

I've tried a bunch of simple options and am currently using this:

foreach (var index in firstSet)
{
    if (secondSet.BinarySearch(index) < 0)
        continue;

    //do stuff
}

两个 firstSet secondSet 是类型列表中。

我使用LINQ也尝试:

I've also tried using LINQ:

var intersection = firstSet.Where(t => secondSet.BinarySearch(t) >= 0).ToList();

然后通过路口

但由于这两个组进行排序,我觉得有一个更好的方式来做到这一点。请注意,我不能从集合中删除项目,使它们更小。这两套通常由各约50个项目。

But as both of these sets are sorted I feel there's a better way to do it. Note that I can't remove items from sets to make them smaller. Both sets usually consist of about 50 items each.

请帮我家伙,我没有很多时间来熟悉这件事情做。谢谢你。

Please help me guys as I don't have a lot of time to get this thing done. Thanks.

注:我做这个约530万次。所以每微秒计数。

NOTE: I'm doing this about 5.3 million times. So every microsecond counts.

推荐答案

如果您有两套这两者都是排序,可以实现比提供开箱即用LINQ什么更快的交集。

If you have two sets which are both sorted, you can implement a faster intersection than anything provided out of the box with LINQ.

基本上保持两个的IEnumerator&LT; T&GT; 光标开,每个组。在任何时候,任何推进具有较小值。如果他们在任何时候匹配,直到你到达任何迭代器的提前结束他们两个,等等。

Basically, keep two IEnumerator<T> cursors open, one for each set. At any point, advance whichever has the smaller value. If they match at any point, advance them both, and so on until you reach the end of either iterator.

有关这样做的好处是,你只需要遍历各设置一次,而且你可以在O做到这一点(1)内存。

The nice thing about this is that you only need to iterate over each set once, and you can do it in O(1) memory.

下面是一个简单的实现 - 未经检验的,但它确实编译:)它假定两个输入序列是​​重复和无排序,都是按规定的比较器(通过在的Comparer&LT; T&GT; .DEFAULT ):

Here's a sample implementation - untested, but it does compile :) It assumes that both of the incoming sequences are duplicate-free and sorted, both according to the comparer provided (pass in Comparer<T>.Default):

(有在回答最后更多的文字!)

(There's more text at the end of the answer!)

static IEnumerable<T> IntersectSorted<T>(this IEnumerable<T> sequence1,
    IEnumerable<T> sequence2,
    IComparer<T> comparer)
{
    using (var cursor1 = sequence1.GetEnumerator())
    using (var cursor2 = sequence2.GetEnumerator())
    {
        if (!cursor1.MoveNext() || !cursor2.MoveNext())
        {
            yield break;
        }
        var value1 = cursor1.Current;
        var value2 = cursor2.Current;

        while (true)
        {
            int comparison = comparer.Compare(value1, value2);
            if (comparison < 0)
            {
                if (!cursor1.MoveNext())
                {
                    yield break;
                }
                value1 = cursor1.Current;
            }
            else if (comparison > 0)
            {
                if (!cursor2.MoveNext())
                {
                    yield break;
                }
                value2 = cursor2.Current;
            }
            else
            {
                yield return value1;
                if (!cursor1.MoveNext() || !cursor2.MoveNext())
                {
                    yield break;
                }
                value1 = cursor1.Current;
                value2 = cursor2.Current;
            }
        }
    }
}

编辑:正如评论指出的,在某些情况下,你可能有一个输入,它比其他的要大很多,在这种情况下,你可能会节省大量的时间使用二进制搜索从内部的小集合中的每个元素大集。这需要随机访问大集,但(这是二进制搜索只是一个prerequisite)。你甚至可以比一个幼稚的二进制搜索它稍微好一点通过比赛从的 previous 的结果给一个下界的二进制搜索。因此,假设你是在一组与每个整数寻找值1000,2000和3000,从0到19,999。在第一次迭代中,你需要看看在整个集 - 分别为您开始下/上指标将是0和19,999。之后你会发现,比赛在指数1000,但是,的下一步的步骤(在那里你正在寻找2000)与2000年较低的指数开始当你的进步,范围在其中要逐步缩小搜索。不管这是值得额外的实施成本与否另当别论,但是。

As noted in comments, in some cases you may have one input which is much larger than the other, in which case you could potentially save a lot of time using a binary search for each element from the smaller set within the larger set. This requires random access to the larger set, however (it's just a prerequisite of binary search). You can even make it slightly better than a naive binary search by using the match from the previous result to give a lower bound to the binary search. So suppose you were looking for values 1000, 2000 and 3000 in a set with every integer from 0 to 19,999. In the first iteration, you'd need to look across the whole set - your starting lower/upper indexes would be 0 and 19,999 respectively. After you'd found a match at index 1000, however, the next step (where you're looking for 2000) can start with a lower index of 2000. As you progress, the range in which you need to search gradually narrows. Whether or not this is worth the extra implementation cost or not is a different matter, however.

这篇关于C#最快的2套排序号的交集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆