计算两个文本块之间的百分比差异的算法 [英] Algorithm to calculate percent difference between two blobs of text
问题描述
我一直在研究如何找到有效的解决方案。我研究了差异引擎(Google的diff-match-patch,python的diff)和一些最长的通用链算法。
I've been researching on finding an efficient solution to this. I've looked into diffing engines (google's diff-match-patch, python's diff) and some some longest common chain algorithms.
我希望能给大家一些有关如何解决此问题的建议。
I was hoping on getting you guys suggestions on how to solve this issue. Any algorithm or library in particular you would like to recommend?
推荐答案
我不知道什么是最长的[[链? substring?]]与百分比差异有关,尤其是在评论中看到您期望两个字符串中间的一个字符之间的百分比差异很小(因此,它们最长的公共子字符串大约是字符串的一半)。字符串的长度)。
I don't know what "longest common [[chain? substring?]]" has to do with "percent difference", especially after seeing in a comment that you expect a very small % difference between two strings that differ by one character in the middle (so their longest common substring is about one half of the strings' length).
忽略最长共同点的陌生性,并将百分比差异定义为字符串之间的编辑距离除以最大长度(乘以100)当然;-),怎么办?
Ignoring the "longest common" strangeness, and defining "percent difference" as the edit distance between the strings divided by the max length (times 100 of course;-), what about:
def levenshtein_distance(first, second):
"""Find the Levenshtein distance between two strings."""
if len(first) > len(second):
first, second = second, first
if len(second) == 0:
return len(first)
first_length = len(first) + 1
second_length = len(second) + 1
distance_matrix = [[0] * second_length for x in range(first_length)]
for i in range(first_length):
distance_matrix[i][0] = i
for j in range(second_length):
distance_matrix[0][j]=j
for i in xrange(1, first_length):
for j in range(1, second_length):
deletion = distance_matrix[i-1][j] + 1
insertion = distance_matrix[i][j-1] + 1
substitution = distance_matrix[i-1][j-1]
if first[i-1] != second[j-1]:
substitution += 1
distance_matrix[i][j] = min(insertion, deletion, substitution)
return distance_matrix[first_length-1][second_length-1]
def percent_diff(first, second):
return 100*levenshtein_distance(a, b) / float(max(len(a), len(b)))
a = "the quick brown fox"
b = "the quick vrown fox"
print '%.2f' % percent_diff(a, b)
Levenshtein函数来自< a href = http://www.korokithakis.net/node/87 rel = noreferrer> Stavros的博客。在这种情况下,结果将是5.26(差异百分比)。
The Levenshtein function is from Stavros' blog. The result in this case would be 5.26 (percent difference).
这篇关于计算两个文本块之间的百分比差异的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!