测量两个字符串之间相似性的有效方法是什么?(Levenshtein 距离使堆栈太深) [英] What is an efficient way to measure similarity between two strings? (Levenshtein Distance makes stack too deep)

查看:26
本文介绍了测量两个字符串之间相似性的有效方法是什么?(Levenshtein 距离使堆栈太深)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我从这个开始:http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Ruby

这对于非常小的字符串非常有用.但是,我的字符串长度可能超过 10,000 个字符 - 由于 Levenshtein 距离是递归的,这会在我的 Ruby on Rails 应用程序中导致堆栈过深错误.

Which works great for really small strings. But, my strings can be upwards of 10,000 characters long -- and since the Levenshtein Distance is recursive, this causes a stack too deep error in my Ruby on Rails app.

那么,是否还有另一种可能不需要堆栈密集型的方法来查找两个大字符串之间的相似性?

So, is there another, maybe less stack intensive method of finding the similarity between two large strings?

或者,我需要一种方法来使堆栈具有更大的大小.(不过,我认为这不是解决问题的正确方法)

Alternatively, I'd need a way to make the stack have much larger size. (I don't think this is the right way to solve the problem, though)

推荐答案

考虑使用非递归版本以避免过多的调用堆栈开销.Seth Schroeder 有一个 Ruby 中的迭代实现,它改用多维数组;它似乎与 Levenshtein 距离的动态编程方法有关(如 维基百科文章的伪代码中所述).Seth 的 ruby​​ 代码转载如下:

Consider a non-recursive version to avoid the excessive call stack overhead. Seth Schroeder has an iterative implementation in Ruby which uses multi-dimensional arrays instead; it appears to be related to the dynamic programming approach for Levenshtein distance (as outlined in the pseudocode for the Wikipedia article). Seth's ruby code is reproduced below:

def levenshtein(s1, s2)
  d = {}
  (0..s1.size).each do |row|
    d[[row, 0]] = row
  end
  (0..s2.size).each do |col|
    d[[0, col]] = col
    end
  (1..s1.size).each do |i|
    (1..s2.size).each do |j|
      cost = 0
      if (s1[i-1] != s2[j-1])
        cost = 1
      end
      d[[i, j]] = [d[[i - 1, j]] + 1,
                   d[[i, j - 1]] + 1,
                   d[[i - 1, j - 1]] + cost
                  ].min
      next unless @@damerau
      if (i > 1 and j > 1 and s1[i-1] == s2[j-2] and s1[i-2] == s2[j-1])
        d[[i, j]] = [d[[i,j]],
                     d[[i-2, j-2]] + cost
                    ].min
      end
    end
  end
  return d[[s1.size, s2.size]]
end

这篇关于测量两个字符串之间相似性的有效方法是什么?(Levenshtein 距离使堆栈太深)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆