有两个非常大的列表/集合-如何有效地检测和/或删除重复项 [英] With two very large lists/collections - how to detect and/or remove duplicates efficiently

查看:50
本文介绍了有两个非常大的列表/集合-如何有效地检测和/或删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么

在两个列表之间进行比较和重复数据删除时,编码人员在时间紧迫的情况下通常不会找到运行时效率最高的实现.对于许多编码器来说,两个嵌套的for循环是常见的goto解决方案.可以尝试使用LINQ进行CROSS JOIN,但这显然效率不高.为此,编码人员需要一种令人难忘且代码高效的方法,同时也要相对具有运行时效率.

When comparing and deduplicating across two lists coders don't often find the most runtime-efficient implementation while under time-pressure. Two nested for-loops is a common goto solution for many coders. One might try a CROSS JOIN with LINQ, but this is clearly inefficient. Coders need a memorable and code-efficient approach for this that is also relatively runtime-efficient.

这个问题是在看到一个更具体的问题后创建的:

This question was created after seeing a more specific one: Delete duplicates in a single dataset relative to another one in C# - it's more specialised with the use of Datasets. The term "dataset" would not help people in the future. No other generalised question was found.

什么

我使用术语列表/集合"来解决这个更通用的编码问题.

I have used the term List/Collection to help with this more general coding problem.

var setToDeduplicate = new List<int>() { 1,2,3,4,5,6,7,8,9,10,11,.....}; //All integer values 1-1M 

var referenceSet = new List<int>() { 1,3,5,7,9,....}; //All odd integer values 1-1M

var deduplicatedSet = deduplicationFunction(setToDeduplicate, referenceSet);

通过实施deduplicationFunction函数,输入数据和输出应清晰可见.输出可以是IEnumerable.在此输入示例中,预期输出将是1-1M {2,4,6,8,...}

By implementing the deduplicationFunction function the input data and output should be clear. The output can be IEnumerable. The expected output in this input example would be the even numbers from 1-1M {2,4,6,8,...}

注意:referenceSet中可能有重复项.两组中的值仅是指示性的,因此我不希望找到数学解决方案-这也适用于两组中的随机数输入.

Note: There may be duplicates within the referenceSet. The values in both sets are indicative only, so I'm not looking for a mathematical solution - this should also work for random number inputs in both sets.

如果使用简单的LINQ函数进行处理,它将太慢O(1M * 0.5M).对于如此大的集合,需要一种更快的方法.

If this is approached with simple LINQ functions it will be too slow O(1M*0.5M). There needs to be a faster approach for such large sets.

速度很重要,但是随着大量代码的增加而进行的改进的价值将降低.同样,理想情况下,它也适用于其他数据类型,包括数据模型对象,但回答此特定问题就足够了.其他数据类型将仅涉及更多的预处理或对答案的轻微更改.

Speed is important, but incremental improvements with a large bloat of code will be of less value. Also, ideally it would work for other datatypes including data model objects, but answering this specific question should be enough. Other datatypes would simply involve some more pre-processing or slight change to the answer.

解决方案摘要

这是测试代码,其结果如下:

Here's the test code, for results which follow:

using System;
using System.Collections.Generic;
using System.Data;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Test
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Preparing...");

            List<int> set1 = new List<int>();
            List<int> set2 = new List<int>();

            Random r = new Random();
            var max = 10000;

            for (int i = 0; i < max; i++)
            {
                set1.Add(r.Next(0, max));
                set2.Add(r.Next(0, max/2) * 2);
            }

            Console.WriteLine("First run...");

            Stopwatch sw = new Stopwatch();
            IEnumerable<int> result;
            int count;

            while (true)
            {
                sw.Start();
                result = deduplicationFunction(set1, set2);
                var results1 = result.ToList();
                count = results1.Count;
                sw.Stop();
                Console.WriteLine("Dictionary and Where - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();


                sw.Start();
                result = deduplicationFunction2(set1, set2);
                var results2 = result.ToList();
                count = results2.Count;
                sw.Stop();
                Console.WriteLine("  HashSet ExceptWith - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();

                sw.Start();
                result = deduplicationFunction3(set1, set2);
                var results3 = result.ToList();
                count = results3.Count;
                sw.Stop();
                Console.WriteLine("     Sort Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();

                sw.Start();
                result = deduplicationFunction4(set1, set2);
                var results4 = result.ToList();
                count = results3.Count;
                sw.Stop();
                Console.WriteLine("Presorted Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();


                set2.RemoveAt(set2.Count - 1); //Remove the last item, because it was added in the 3rd test

                sw.Start();
                result = deduplicationFunction5(set1, set2);
                var results5 = result.ToList();
                count = results5.Count;
                sw.Stop();
                Console.WriteLine("        Nested Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();


                Console.ReadLine();

                Console.WriteLine("");
                Console.WriteLine("Next Run");
                Console.WriteLine("");
            }

        }


        //Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
        static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
        {
            //Create a hashset first, which is much more efficient for searching
            var ReferenceHashSet = Reference
                                .Distinct() //Inserting duplicate keys in a dictionary will cause an exception
                                .ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer

            int throwAway;
            return Set.Distinct().Where(y => ReferenceHashSet.TryGetValue(y, out throwAway) == false);
        }

        //Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
        static IEnumerable<int> deduplicationFunction2(List<int> Set, List<int> Reference)
        {
            //Create a hashset first, which is much more efficient for searching
            var SetAsHash = new HashSet<int>();

            Set.ForEach(x =>
            {
                if (SetAsHash.Contains(x))
                    return;

                SetAsHash.Add(x);
            }); // .Net 4.7.2 - ToHashSet will reduce this code to a single line.

            SetAsHash.ExceptWith(Reference); // This is ultimately what we're testing

            return SetAsHash.AsEnumerable();
        }

        static IEnumerable<int> deduplicationFunction3(List<int> Set, List<int> Reference)
        {
            Set.Sort();
            Reference.Sort();
            Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.

            return deduplicationFunction4(Set, Reference);
        }

        static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
        {
            int i1 = 0;
            int i2 = 0;
            int thisValue = Set[i1];
            int thisReference = Reference[i2];
            while (true)
            {
                var difference = thisReference - thisValue;

                if (difference < 0)
                {
                    i2++; //Compare side is too low, there might be an equal value to be found
                    if (i2 == Reference.Count)
                        break;
                    thisReference = Reference[i2];
                    continue;
                }

                if (difference > 0) //Duplicate
                    yield return thisValue;

                GoFurther:
                i1++;
                if (i1 == Set.Count)
                    break;
                if (Set[i1] == thisValue) //Eliminates duplicates
                    goto GoFurther; //I rarely use goto statements, but this is a good situation

                thisValue = Set[i1];
            }
        }

        static IEnumerable<int> deduplicationFunction5(List<int> Set, List<int> Reference)
        {
            var found = false;
            var lastValue = 0;
            var thisValue = 0;
            for (int i = 0; i < Set.Count; i++)
            {
                thisValue = Set[i];

                if (thisValue == lastValue)
                    continue;

                lastValue = thisValue;

                found = false;
                for (int x = 0; x < Reference.Count; x++)
                {
                    if (thisValue != Reference[x])
                        continue;

                    found = true;
                    break;
                }

                if (found)
                    continue;

                yield return thisValue;
            }
        }
    }
}

我将使用它来比较多种方法的性能. (尽管ExceptWith启用了简洁的解决方案,但我在此阶段对哈希方法与分类索引双重索引特别感兴趣)

I'll use this to compare performance of multiple approaches. (I'm particularly interested in Hash-approach vs dual-index-on-sorted-approach at this stage, although ExceptWith enables a terse solution)

到目前为止,已成功完成1万个项目的结果(良好运行):

Results so far on 10k items in set (Good Run):

首次运行

  • 字典和位置-计数:3565,毫秒:16.38.
  • HashSet ExceptWith-计数:3565,毫秒:5.33.
  • 排序双索引-计数:3565,毫秒:6.34.
  • 预排序双重索引-计数:3565,毫秒:1.14.
  • 嵌套索引-计数:3565,毫秒:964.16.

运行良好

  • 字典和位置-计数:3565,毫秒:1.21.
  • HashSet ExceptWith-计数:3565,毫秒:0.94.
  • 排序双索引-计数:3565,毫秒:1.09.
  • 预分类双重索引-计数:3565,毫秒:0.76.
  • 嵌套索引-计数:3565,毫秒:628.60.

选择答案:

  • @backs HashSet.ExceptWith方法-使用最少的代码略快一些,使用了有趣的函数ExceptWith,但是由于缺乏通用性而被削弱了,并且有趣的函数鲜为人知.
  • 我的答案之一:HashSet> Where(.. Contains ..)-仅比@backs慢一点,但使用的代码模式使用LINQ,并且在原始元素列表之外非常通用.我相信这是我在编码时遇到的更常见的情况,并且相信许多其他编码人员都是这种情况.
  • 特别感谢@TheGeneral对一些答案以及一些有趣的unsafe版本进行了基准测试,并帮助使@Backs回答对于后续测试更加有效.
  • @backs HashSet.ExceptWith approach - is marginally faster with minimal code, uses an interesting function ExceptWith, however it is weakened due to lack of versatility, and the fact the interesting function is less commonly known.
  • One of my answers: HashSet > Where(..Contains..) - is only a tiny bit slower than @backs, but uses a code pattern that uses LINQ and is very versitile beyond lists of primative elements. I believe this is a more common scenario I find myself with when coding, and trust this is the case for many other coders.
  • Special thanks to @TheGeneral for his benchmarking of some of the answers and also some interesting unsafe versions, and for helping to make @Backs answer more efficient for a followup test.

推荐答案

哈希集和位置

您必须使用HashSet(或Dictionary)来提高速度:

You must use a HashSet (or Dictionary) for speed:

//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = Reference
                        .Distinct() //Inserting duplicate keys in a dictionary will cause an exception
                        .ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer

    int throwAway;
        return Set.Where(y => ReferenceHashSet.TryGetValue(y, out throwAway));
}

那是lambda表达式版本.它使用Dictionary(字典),该字典可根据需要提供用于更改值的适应性.可以使用文字for循环,也许可以获得更多的增量性能改进,但是相对于具有两个嵌套的循环,这已经是一个了不起的改进.

That's a lambda expression version. It uses Dictionary which provides adaptability for varying the value if needed. Literal for-loops could be used and perhaps some more incremental performance improvement gained, but relative to having two-nested-loops, this is already an amazing improvement.

在学习其他答案的同时学习一些知识,这是一种更快的实现方式:

Learning a few things while looking at other answers, here is a faster implementation:

static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = new HashSet<int>(Reference);
    return Set.Where(y => ReferenceHashSet.Contains(y) == false).Distinct();
}

重要的是,这种方法(虽然比@backs回答慢一点点)仍然足够通用,可以用于数据库实体,并且其他类型也可以轻松地用于重复检查字段.

Importantly, this approach (while a tiny bit slower than @backs answer) is still versatile enough to use for database entities, AND other types can easily be used on the duplicate check field.

下面是一个示例,说明如何轻松调整代码以与Person类型的数据库实体列表一起使用.

Here's an example how the code is easily adjusted for use with a Person kind of database entity list.

static IEnumerable<Person> deduplicatePeople(List<Person> Set, List<Person> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = new HashSet<int>(Reference.Select(p => p.ID));
    return Set.Where(y => ReferenceHashSet.Contains(y.ID) == false)
            .GroupBy(p => p.ID).Select(p => p.First()); //The groupby and select should accomplish DistinctBy(..p.ID)
}

这篇关于有两个非常大的列表/集合-如何有效地检测和/或删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆