测量两个字符串之间相似度的有效方法是什么? (Levenshtein距离使堆栈太深) [英] What is an efficient way to measure similarity between two strings? (Levenshtein Distance makes stack too deep)

查看:120
本文介绍了测量两个字符串之间相似度的有效方法是什么? (Levenshtein距离使堆栈太深)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我从以下内容开始: http://en.wikibooks.org / wiki / Algorithm_Implementation / Strings / Levenshtein_distance#Ruby

So, I started with this: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Ruby

对于非常小的字符串,这非常有用。但是,我的字符串长度可能超过10,000个字符-并且由于Levenshtein距离是递归的,因此在Ruby on Rails应用程序中会导致堆栈太深的错误。

Which works great for really small strings. But, my strings can be upwards of 10,000 characters long -- and since the Levenshtein Distance is recursive, this causes a stack too deep error in my Ruby on Rails app.

那么,还有另一种可能不太耗费大量堆栈资源的方法来查找两个大字符串之间的相似性吗?

So, is there another, maybe less stack intensive method of finding the similarity between two large strings?

或者,我需要一种使堆栈更大的方法尺寸。 (不过,我认为这不是解决问题的正确方法)

Alternatively, I'd need a way to make the stack have much larger size. (I don't think this is the right way to solve the problem, though)

推荐答案

请考虑使用非递归版本避免过多的调用堆栈开销。 Seth Schroeder Ruby中的迭代实现,它使用多维数组代替;它似乎与Levenshtein距离的动态编程方法有关(如Wikipedia文章中伪代码中所述) )。赛斯的红宝石代码如下:

Consider a non-recursive version to avoid the excessive call stack overhead. Seth Schroeder has an iterative implementation in Ruby which uses multi-dimensional arrays instead; it appears to be related to the dynamic programming approach for Levenshtein distance (as outlined in the pseudocode for the Wikipedia article). Seth's ruby code is reproduced below:

def levenshtein(s1, s2)
  d = {}
  (0..s1.size).each do |row|
    d[[row, 0]] = row
  end
  (0..s2.size).each do |col|
    d[[0, col]] = col
    end
  (1..s1.size).each do |i|
    (1..s2.size).each do |j|
      cost = 0
      if (s1[i-1] != s2[j-1])
        cost = 1
      end
      d[[i, j]] = [d[[i - 1, j]] + 1,
                   d[[i, j - 1]] + 1,
                   d[[i - 1, j - 1]] + cost
                  ].min
      next unless @@damerau
      if (i > 1 and j > 1 and s1[i-1] == s2[j-2] and s1[i-2] == s2[j-1])
        d[[i, j]] = [d[[i,j]],
                     d[[i-2, j-2]] + cost
                    ].min
      end
    end
  end
  return d[[s1.size, s2.size]]
end

这篇关于测量两个字符串之间相似度的有效方法是什么? (Levenshtein距离使堆栈太深)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆