使用Levenshtein距离(将每个元素彼此比较)从两个大型数据集中优化匹配元素 [英] Optimize matching elements from two large data sets using Levenshtein distance (comparing each element to each other element)

查看:99
本文介绍了使用Levenshtein距离(将每个元素彼此比较)从两个大型数据集中优化匹配元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在研究一个问题,以便从另一个名为"ListB"的列表中找到列表中最佳的数据匹配项,即"ListA".每当我找到"ListA"的元素与"ListB"中任何元素的置信度和准确度达到70%或更高的匹配项时,我就将来自列表B的匹配字符串和列表A中的字符串添加到元组中,我进一步保存在数据库中.

I am currently working on a problem to find the best matches of data in a List namely "ListA" from another List called "ListB". Whenever I find a match of an element for "ListA" with any element in "ListB" which has a confidence and accuracy with 70% or greater I add the matched string from List B and the string in List A to a tuple which i further save in a database.

Levenshtien算法为我提供了一个数字,我将其与70的阈值进行比较,如果返回的值等于或大于70%的阈值,则将其附加到原始字符串元素"ListA".

Levenshtien algorithm gives me a number which I compare with my threshold of 70 and if the values returned is equal or greater than the 70 percent threshold I append it with the original string element of "ListA".

如果"ListA"和"ListB"中的记录在数千个值以内,并且如果我将记录增加到一百万个,则为每个过程计算一个距离大约需要一个小时,因此我为此过程编写的代码可以正常工作列表A的元素.

The code that I have written for this process works fine if the records in "ListA" and "ListB" are within thousands of values and if I increase the records to a million it takes about an hour to calculate the distance for each element of the List A.

我需要针对大型数据集优化流程.请告知我需要在哪里进行改进.

I need to optimize the process for huge data sets. Please advise where do I need to make the improvements.

到目前为止,我的流程代码如下:

My code for the process so far looks like this

  public static PerformFuzzyMatch()
  {
    // Fetch the ListA & List B from SQL tables
     var ListACount = await FuzzyMatchRepo.FetchListACount();
     var ListB = await FuzzyMatchRepo.FetchListBAsync();

    //Split the ListA data to smaller chunks and loop through those chunks 
     var splitGroupSize = 1000;
     var sourceDataBatchesCount = ListACount / splitGroupSize;

     // Loop through the smaller chunks of List A
     for (int b = 0; b < sourceDataBatchesCount; b++)
     {
       var currentBatchMatchedWords = new List<Tuple<string, string, double>>();
       int skipRowCount = b * splitGroupSize;
       int takeRowCount = splitGroupSize;

       // Get chunks of data from ListA according to the skipRowCount and takeRowCount
   var currentSourceDataBatch = FuzzyMatchRepository.FetchSourceDataBatch(skipRowCount, takeRowCount);

 //Loop through the ListB and parallely calculate the distance between chunks of List A and List B data
   for (int i = 0; i < ListB.Count; i++)
   {
     Parallel.For(
      0,
      currentSourceDataBatch.Count,
      new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 10 },
      cntr =>
      {
         try
         {
            //call the Levenshtien Algorithm to calculate the distance between each element of ListB and the smaller chunk of List A.
              int leven = LevenshteinDistance(currentSourceDataBatch[cntr], ListB[i]);
              int length = Math.Max(currentSourceDataBatch[cntr].Length, ListB[i].Length);
              double similarity = double similarity = 1.0 - (double)leven / length;
              if ((similarity * 100) >= 70)
              {
     currentBatchMatchedWords.Add(Tuple.Create(currentSourceDataBatch[cntr], ListB[i], similarity));
            }
          cntr++;
         }
        catch (Exception ex)
        {
         exceptions.Enqueue(ex);
        }
      });
     }
   }
  }

它调用的算法是计算距离为

And the algorithm which it calls is to calculate the distance is

public static int LevenshteinDistance(this string input, string comparedTo, bool caseSensitive = false)
    {
        if (string.IsNullOrWhiteSpace(input) || string.IsNullOrWhiteSpace(comparedTo))
        {
            return -1;
        }

        if (!caseSensitive)
        {
            input = Common.Hashing.InvariantUpperCaseStringExtensions.ToUpperInvariant(input);
            comparedTo = Common.Hashing.InvariantUpperCaseStringExtensions.ToUpperInvariant(comparedTo);
        }

        int inputLen = input.Length;
        int comparedToLen = comparedTo.Length;

        int[,] matrix = new int[inputLen, comparedToLen];

        //initialize           
        for (var i = 0; i < inputLen; i++)
        {
            matrix[i, 0] = i;
        }
        for (var i = 0; i < comparedToLen; i++)
        {
            matrix[0, i] = i;
        }

        //analyze
        for (var i = 1; i < inputLen; i++)
        {
            ushort si = input[i - 1];
            for (var j = 1; j < comparedToLen; j++)
            {
                ushort tj = comparedTo[j - 1];
                int cost = (si == tj) ? 0 : 1;

                int above = matrix[i - 1, j];
                int left = matrix[i, j - 1];
                int diag = matrix[i - 1, j - 1];
                int cell = FindMinimumOptimized(above + 1, left + 1, diag + cost);

                //transposition
                if (i > 1 && j > 1)
                {
                    int trans = matrix[i - 2, j - 2] + 1;
                    if (input[i - 2] != comparedTo[j - 1])
                    {
                        trans++;
                    }
                    if (input[i - 1] != comparedTo[j - 2])
                    {
                        trans++;
                    }
                    if (cell > trans)
                    {
                        cell = trans;
                    }
                }
                matrix[i, j] = cell;
            }
        }
        return matrix[inputLen - 1, comparedToLen - 1];
    }

寻找最小优化的麻烦

public static int FindMinimumOptimized(int a, int b, int c)
    {
        return Math.Min(a, Math.Min(b, c));
    }

推荐答案

这是固有具有二次成本O(N^2)的计算.随着项目数量的增加,这总是很难扩展的.

This is a computation that inherently has quadratic cost O(N^2). This is always going to scale very badly as the number of items increases.

您可以并行化它,但这只是执行所花费时间的常数.

You can parallelize this but that's just a constant factor gain in the time it takes to execute.

您是否可以找到其他基于平等的条件来进行匹配?在这种情况下,您可以使用基于哈希的算法非常快速地创建要检查的候选对象.例如,假设您要匹配文本文章,并且您希望几乎所有Levenstein匹配都发生在同一日历日撰写的文章中,那么您可以先按日期进行匹配(使用O(N)复杂度),然后才对所有内容进行二次比较当天的项目.您还可以比较前一天和第二天的项目,以节省一些时间.

Can you find some other criterion to match against that is based on equality? In that case you could use a hash based algorithm to very quickly create candidates for checking. For example, assuming that you are matching text articles and you expect almost all Levenstein matches to occur in articles written on the same calendar day, then you could match by date first (using O(N) complexity), and only then quadratically compare all items of that day. You could also compare items for the previous and next day to allow for some slack.

如果无法执行此操作,则必须接受二次缩放.

If you can't do that you have to accept quadratic scaling.

一个好的代码模式是这样的:

A good code pattern is this:

var pairs = (from a in listA
             from b in listB //cross product
             select new { a, b });

var matches = 
 pairs
 .AsParallel()
 .Where(pair => IsLevenshteinMatch(pair))
 .ToList();

您可以扔掉为此编写的所有复杂代码.在使用并发或并行处理时,通常需要花一点时间考虑一下最佳设计.通常,对于常见问题有高度紧凑的解决方案.

You can throw out all of that complex code you have written for this. When working with concurrency or parallelism it often pays to ponder the best design a bit. Often, there are highly compact solutions for common problems.

这篇关于使用Levenshtein距离(将每个元素彼此比较)从两个大型数据集中优化匹配元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆