相交和任何或包含和任何.找到至少一个公共元素哪个更有效? [英] intersect and any or contains and any. Which is more efficient to find at least one common element?

查看:42
本文介绍了相交和任何或包含和任何.找到至少一个公共元素哪个更有效?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我有两个列表,并且想知道是否至少有一个公共元素,那么我有两个选择:

lst1.Intersect(lst2).Any();

Lst1.Any(x => lst2.Contains(x));

这两个选项给了我期望的结果,但是我不知道什么是最好的选择.哪个更有效?为什么呢?

谢谢.

创建此帖子时,除了解决方案外,我还在寻找原因.我知道我可以运行测试,但是我不知道结果的原因.一个比另一个快吗?总是一个解决方案比另一个解决方案最佳吗?

因此,由于这个原因,我接受了Matthew的回答,不仅是针对测试代码,还他解释了何时一个比另一个更好,以及为什么.我也非常感谢尼古拉斯和奥伦的贡献.

谢谢.

解决方案

Oren的答案在秒表的使用方式上有错误.在测量了Any()所花费的时间之后,不会在循环结束时重置它.

请注意,它如何回到循环的开始,而秒表永远不会是Reset(),因此添加到intersect 的时间包括Any()花费的时间. /p>

以下是更正的版本.

在任何调试器外部运行的发布版本都会在我的PC上显示以下结果:

Intersect:    1ms
Any:       6743ms

请注意,我如何为此测试制作两个不重叠的字符串列表.另请注意,这是最坏的情况.

如果有许多交集(或恰好在数据开始处发生交集),那么Oren正确地说Any()应该更快.

如果实际数据通常包含交集,则最好使用Any().否则,请使用Intersect().这完全取决于数据.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

namespace Demo
{
    class Program
    {
        void run()
        {
            double intersect = 0;
            double any = 0;

            Stopwatch stopWatch = new Stopwatch();

            List<string> L1 = Enumerable.Range(0, 10000).Select(x => x.ToString()).ToList();
            List<string> L2 = Enumerable.Range(10000, 10000).Select(x => x.ToString()).ToList();

            for (int i = 0; i < 10; i++)
            {
                stopWatch.Restart();
                Intersect(L1, L2);
                stopWatch.Stop();
                intersect += stopWatch.ElapsedMilliseconds;

                stopWatch.Restart();
                Any(L1, L2);
                stopWatch.Stop();
                any += stopWatch.ElapsedMilliseconds;
            }

            Console.WriteLine("Intersect: " + intersect + "ms");
            Console.WriteLine("Any: " + any + "ms");
        }

        private static bool Any(List<string> lst1, List<string> lst2)
        {
            return lst1.Any(lst2.Contains);
        }

        private static bool Intersect(List<string> lst1, List<string> lst2)
        {
            return lst1.Intersect(lst2).Any();
        }            

        static void Main()
        {
            new Program().run();
        }
    }
}

出于比较目的,我编写了自己的测试以比较int序列:

intersect took 00:00:00.0065928
any took       00:00:08.6706195

代码:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

namespace Demo
{
    class Program
    {
        void run()
        {
            var lst1 = Enumerable.Range(0, 10000);
            var lst2 = Enumerable.Range(10000, 10000);

            int count = 10;

            DemoUtil.Time(() => lst1.Intersect(lst2).Any(), "intersect", count);
            DemoUtil.Time(() => lst1.Any(lst2.Contains),     "any",      count);
        }

        static void Main()
        {
            new Program().run();
        }
    }

    static class DemoUtil
    {
        public static void Print(this object self)
        {
            Console.WriteLine(self);
        }

        public static void Print(this string self)
        {
            Console.WriteLine(self);
        }

        public static void Print<T>(this IEnumerable<T> self)
        {
            foreach (var item in self)
                Console.WriteLine(item);
        }

        public static void Time(Action action, string title, int count)
        {
            var sw = Stopwatch.StartNew();

            for (int i = 0; i < count; ++i)
                action();

            (title + " took " + sw.Elapsed).Print();
        }
    }
}

如果我也将此时间设置为重叠范围,方法是将列表更改为该范围,并将count设置为10000:

var lst1 = Enumerable.Range(10000, 10000);
var lst2 = Enumerable.Range(10000, 10000);

我得到这些结果:

intersect took 00:00:03.2607476
any took 00:00:00.0019170

在这种情况下,Any()显然要快得多.

结论

对于Any(),最坏情况下的性能非常差,但对于Intersect(),可以接受. 最佳情况下的性能对于Any()而言非常好,而对于Intersect()则很差. (Any()的最佳情况可能是Intersect()的最坏情况!)

Any()方法在最坏的情况下为O(N ^ 2),在最坏的情况下为O(1). Intersect()方法始终为O(N)(因为它使用散列而不是排序,否则将为O(N(Log(N))).

您还必须考虑内存使用情况:Intersect()方法需要复制其中一个输入,而Any()不需要.

因此,要做出最佳决策,您确实需要了解真实数据的特征,并实际执行测试.

在最坏的情况下,如果您真的不希望Any()变成O(N ^ 2),则应使用Intersect().但是,可能最好使用Any().

当然,在大多数情况下,这都不重要!

除非您发现代码的这一部分是瓶颈,否则这只是出于学术目的.如果没有问题,您不应该在这种分析上浪费时间. :)

If I have two list and I want to know if there are at least one common element, I have this two options:

lst1.Intersect(lst2).Any();

Lst1.Any(x => lst2.Contains(x));

The two options give me the result that I expect, however I don't know what is the best option. Which is more efficient? And why?

Thanks.

EDIT: when I created this post, apart of the solution, I was looking the reason. I know that I can run tests, but I wouldn't know the reason of the result. One is faster than the other? Is always one solution best than the other?

So for this reason, I hace accepted the answer of Matthew, not only for the test code, but also he explain when one is better than other and why. I appreciate a lot the contributions of Nicholas and Oren too.

Thanks.

解决方案

Oren's answer has an error in the way the stopwatch is being used. It isn't being reset at the end of the loop after the time taken by Any() has been measured.

Note how it goes back to the start of the loop with the stopwatch never being Reset() so that the time that is added to intersect includes the time taken by Any().

Following is a corrected version.

A release build run outside any debugger gives this result on my PC:

Intersect:    1ms
Any:       6743ms

Note how I'm making two non-overlapping string lists for this test. Also note that this is a worst-case test.

Where there are many intersections (or intersections that happen to occur near the start of the data) then Oren is quite correct to say that Any() should be faster.

If the real data usually contains intersections then it's likely that it is better to use Any(). Otherwise, use Intersect(). It's very data dependent.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

namespace Demo
{
    class Program
    {
        void run()
        {
            double intersect = 0;
            double any = 0;

            Stopwatch stopWatch = new Stopwatch();

            List<string> L1 = Enumerable.Range(0, 10000).Select(x => x.ToString()).ToList();
            List<string> L2 = Enumerable.Range(10000, 10000).Select(x => x.ToString()).ToList();

            for (int i = 0; i < 10; i++)
            {
                stopWatch.Restart();
                Intersect(L1, L2);
                stopWatch.Stop();
                intersect += stopWatch.ElapsedMilliseconds;

                stopWatch.Restart();
                Any(L1, L2);
                stopWatch.Stop();
                any += stopWatch.ElapsedMilliseconds;
            }

            Console.WriteLine("Intersect: " + intersect + "ms");
            Console.WriteLine("Any: " + any + "ms");
        }

        private static bool Any(List<string> lst1, List<string> lst2)
        {
            return lst1.Any(lst2.Contains);
        }

        private static bool Intersect(List<string> lst1, List<string> lst2)
        {
            return lst1.Intersect(lst2).Any();
        }            

        static void Main()
        {
            new Program().run();
        }
    }
}

For comparative purposes, I wrote my own test comparing int sequences:

intersect took 00:00:00.0065928
any took       00:00:08.6706195

The code:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

namespace Demo
{
    class Program
    {
        void run()
        {
            var lst1 = Enumerable.Range(0, 10000);
            var lst2 = Enumerable.Range(10000, 10000);

            int count = 10;

            DemoUtil.Time(() => lst1.Intersect(lst2).Any(), "intersect", count);
            DemoUtil.Time(() => lst1.Any(lst2.Contains),     "any",      count);
        }

        static void Main()
        {
            new Program().run();
        }
    }

    static class DemoUtil
    {
        public static void Print(this object self)
        {
            Console.WriteLine(self);
        }

        public static void Print(this string self)
        {
            Console.WriteLine(self);
        }

        public static void Print<T>(this IEnumerable<T> self)
        {
            foreach (var item in self)
                Console.WriteLine(item);
        }

        public static void Time(Action action, string title, int count)
        {
            var sw = Stopwatch.StartNew();

            for (int i = 0; i < count; ++i)
                action();

            (title + " took " + sw.Elapsed).Print();
        }
    }
}

If I also time this for overlapping ranges by changing the lists to this and upping the count to 10000:

var lst1 = Enumerable.Range(10000, 10000);
var lst2 = Enumerable.Range(10000, 10000);

I get these results:

intersect took 00:00:03.2607476
any took 00:00:00.0019170

In this case Any() is clearly much faster.

Conclusion

The worst-case performance is very bad for Any() but acceptible for Intersect(). The best-case performance is extremely good for Any() and bad for Intersect(). (and best-case for Any() is probably worst-case for Intersect()!)

The Any() approach is O(N^2) in the worst case and O(1) in the best case. The Intersect() approach is always O(N) (since it uses hashing, not sorting, otherwise it would be O(N(Log(N))).

You must also consider the memory usage: the Intersect() method needs to take a copy of one of the inputs, whereas Any() doesn't.

Therefore to make the best decision you really need to know the characteristics of the real data, and actually perform tests.

If you really don't want the Any() to turn into an O(N^2) in the worst case, then you should use Intersect(). However, the chances are that you will be best off using Any().

And of course, most of the time none of this matters!

Unless you've discovered this part of the code to be a bottleneck, this is of merely academic interest. You shouldn't waste your time with this kind of analysis if there's no problem. :)

这篇关于相交和任何或包含和任何.找到至少一个公共元素哪个更有效?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆