使用Levenshtein距离检测字符串相似度(慢) [英] Detecting string similarity using Levenshtein distance (is SLOW)

查看:90
本文介绍了使用Levenshtein距离检测字符串相似度(慢)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

myTable行计数为150000行.我需要比较名称"列中的值.如果我检测到相似度< =相似度级别,并且NUM列值相同,则应将其记录下来.
他们遍历这张表的方式需要永远的时间.还有什么其他解决方案?

myTable row count is 150000 rows. I need to compare the values in NAME column. If I detect similarity <= similarity level and if the NUM column values are the same I should log it.
They way I loop through this table takes forever. What other solutions are there?

DataTable dt1 = new DataTable();
dt1.Load(DbInfo.DataRdr(Conn, "SELECT * FROM myTable"));
for (int i = 0; i < dt1.Rows.Count; i++)
{
 for (int j = 0; i + 1 < dt1.Rows.Count; j++)
 {
   if (dt1.Rows[i]["NUM"].ToString() == dt1.Rows[j]["NUM"].ToString())
   {
      if (dt1.Rows[i]["Name"].ToString().
LevenshteinDistance(dt1.Rows[j]["Name"].ToString()) <= 10)
      {
        Logging.Write(...);
       }
   }
  }
}
    public static int LevenshteinDistance(this string s, string t)
    {
        if (s == null)
            throw new ArgumentNullException("s");
        if (t == null)
            throw new ArgumentNullException("t");
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n+1,m+1];
        if (n == 0 || m == 0)
            return Math.Max(m, n);
        for (int i = 0; i <= n; i++)
        {
            d[i, 0] = i;
        }
        for (int i = 0; i < m; i++)
        {
            d[0, i] = i;
        }
        for (int i = 0; i < n; i++)
        {
            for (int j = 0; j < m; j++)
            {
                int cost = (t[j] == s[i]) ? 0 : 1;
                d[i + 1, j + 1] = Math.Min(Math.Min(d[i, j + 1] + 1, d[i + 1, j] + 1), d[i, j] + cost);
            }
        }
        return d[n, m];
    }myTable row count is 150000 rows.
I need to compare the values in NAME column. If I detect similarity <= similarity level and if the NUM column values are the same I should log it. 

They way I loop through this table takes forever. What other solutions are there?

<pre lang="cs">DataTable dt1 = new DataTable();
dt1.Load(DbInfo.DataRdr(Conn, "SELECT * FROM myTable"));

   for (int i = 0; i < dt1.Rows.Count; i++)
   {

     for (int j = 0; j < dt1.Rows.Count; j++)
     {
       if (dt1.Rows[i]["Name"].ToString().LevenshteinDistance(dt1.Rows[j]            ["Name"].ToString()) <= 10)
       {
          if (dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString())
          {
            Logging.Write(...);
           }
       }
      }

  }

推荐答案

您可以从
开始
You can start with

for (int j = i + 1; j < dt1.Rows.Count; j++)



因为一旦比较了两个单词,倒数始终是相同的数字.



since once two words have been compared, the inverse is always the same number.


假设您只对相似的单词感兴趣,那么您可以修改LevenshteinDistance以便停止比较一旦达到最大相似度级别,就会有2个字符串.

Assuming that you are only interested in similar words, then you can modify LevenshteinDistance so that you will stop comparing 2 strings as soon as the maximum similarity level has been reached.

public static int? LevenshteinDistance(this string s, string t, 
    int similarityLimit)
{
    // Some code...

    // In the inner loop, add:
    if (cost > similarityLevel)
    {
        return null;
    }
}



并在调用代码中:



And in calling code:

int limit = 10;
int? similarity = LevenshteinDistance(row1, row2, limit);
bool isSimilar = similarity.HasValue && similarity.Value <= limit;



假设很多对的成本远高于限制,则将大大减少执行时间.

然后,由于您只想知道NAME NUM 相等时是否相似,因此您可能可以逆转这两个测试,因为第二个测试(字符串比较)应该快得多.因此,假设NUM 并非全部相等,则可以消除对LevenshteinDistance的大多数调用.

因此,您的内部循环可能看起来像这样:



Assuming that a lot of pairs have a cost much higher than the limit, it will greatly reduce the execution time.

Then since you are only interested to know if NAME ar similar when NUM are equal, then you might be able to reverse those 2 tests since the second one (a string comparison) should be much faster. Thus you can eliminate most calls to LevenshteinDistance assuming that NUM are not all equals.

Thus your inner loop might look like this:

// First do the "fast" check
if (dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString())
{
    // Then do the "slow" check...
    if (dt1.Rows[i]["Name"].ToString().LevenshteinDistance(
        dt1.Rows[j]["Name"].ToString()) <= 10)
    {
        Logging.Write(...);
    }
}



然后,假设您拥有大于等于的NUM的百分之几,则可以按项目的NUM对其进行排序.一种方法是使用如下查询对阅读器返回的数据进行排序:



Then assuming that you have more than a few percents of NUM that are equals, you can sort items by their NUM. One way to do it would be to sort the data returned by the reader using a query like:

SELECT * FROM myTable ORDER BY NUM

如果数据库已建立索引,则它可能是较快的选项.否则,可以使用Dictionary<string, List<int>>之类的方法完成,其中将每个不同的NUM添加到字典中,并且列表中填充表中的索引.

If the database is indexed, it might be the faster option. Otherwise, it might be done by using something like Dictionary<string, List<int>> where each distinct NUM are added to the dictionary and the list is filled with the indexes in the table.

var numDictionary = new Dictionary<string, List<int>>;
for (int i = 0; i < dt1.Rows.Count; i++)
{
    var rowNum = dt1.Rows[i]["NUM"].ToString();

    List<int> listIndexes;
    if (!numDictionary.TryGetValue(rowNum, out listIndexes))
    {
        list = new List<int>;
        numDictionary.Add(rowNum, list);
    }
    list.Add(i);
}



然后,您可以像原始算法一样使用嵌套循环为字典中的每个列表计算LevenshteinDistance.

请注意,内部循环可以在外部循环的索引处停止.也就是说,在j等于i之前停止.

如果顺序无关紧要,则可以直接在日志中写入项目.如果不是,请在某些结构中填充您需要输出的信息,并在输出结果之前对该结构进行排序.



Then you can compute the LevenshteinDistance for each list in the dictionary using nested loop as you have done in the original algorithm.

Note that the internal loop can stop at the index of the external loop. That is, stop just before j is equal to i.

If the order does not matters, then you can write items in the log directly. If not, fill some structures with the information you wil need to output and sort that structure before outputting the results.


这篇关于使用Levenshtein距离检测字符串相似度(慢)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆