编辑距离算法说明 [英] Edit distance algorithm explanation

查看:110
本文介绍了编辑距离算法说明的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据维基百科,计算两个字符串a和b之间的Levenshtein距离的递归公式的定义如下:

According to wikipedia, the definition of the recursive formula which calculates the Levenshtein distance between two strings a and b is the following:

我不明白为什么我们不考虑删除 a [j] ,或者我们插入 b [i] 。另外,如果我错了,请纠正我,插入的情况与删除的情况不一样吗?我的意思是,我们可以从第二个字符串中插入相同的字符,而不是从一个字符串中删除字符。那么,为什么不将插入/删除操作合并到一个成本等于 min {cost_insert,cost_delete} 的操作中呢?

I don't understand why we don't take into consideration the cases in which we delete a[j], or we insert b[i]. Also, correct me if I am wrong, isn't the case of insertion the same as the case of the deletion? I mean, instead of deleting a character from one string, we could insert the same character in the second string, and the opposite. So why not merge the insert/delete operations into one operation with cost equal to min{cost_insert, cost_delete}?

推荐答案

这没有完成,因为不允许您同时编辑两个字符串。 (从维基百科开始)编辑距离的定义是:

This is not done, because you are not allowed to edit both strings. The definition of the edit distance (from wikipedia) is:


将a转换为b的最小权重的一系列编辑操作。

the minimum-weight series of edit operations that transforms a into b.

因此,您正在专门寻找要在字符串 a上执行的一系列操作(的权重)将其转换为字符串 b

So you are specifically looking for (the weight of) a sequence of operations to execute on the string a to transform it into string b.

此外,编辑距离不一定是对称的。如果插入和删除的成本相同,则距离是对称的: d(a,b)= d(b,a)

Also, the edit distance is not necessarily symmetric. If your costs for inserts and deletions are identical the distance is symmetric: d(a,b) = d(b,a)

以Wikipedia为例,但费用不同:

Consider the wikipedia example but with different costs:


  • 插入费用:w_ins = 1

  • 删除成本:w_del = 2

  • 替换成本:w_sub = 1

小猫和坐着的距离仍然是3,

kitten -> sitten  (substitution k->s, cost 1)
sitten -> sittin  (substitution e->i, cost 1)
sittin -> sitting (insertion of g, cost 1)
=> d(kitten, sitting) = 3

坐着的距离和小猫不是:

sitting -> kitting  (substitution s->k, cost 1)
kitting -> kitteng  (substitution i->e, cost 1)
kitteng -> kitten (deletion of g, cost 2)
=> d(kitten, sitting) = 4

您看到 d(小猫,坐在)!= d(小猫,坐在)

另一方面,如果您使用对称成本,则为Levenshtein距离(是一个包含单位成本的编​​辑距离),则可以假设 d(a,b)= d(b,a)成立。然后,通过考虑反例也不会赢得任何收益。您丢失的是在哪个字符串中替换了哪个字符的信息,这使得事后提取操作序列变得更加困难。

On the other hand if you do use symmetric costs, as the Levenshtein distance (which is an edit distance with unit costs) does, you can assume that d(a,b) = d(b,a) holds. Then you do not win anything by also considering the inverse cases. What you lose is the information which character has been replaced in which string, what makes it harder to extract the sequence of operations afterwards.

Wagner-Fisher算法(在问题中显示)可以通过回溯路径从DP矩阵中提取该算法成本最低。考虑在 to foo 之间的两个编辑矩阵,其单位成本为:

The Wagner-Fisher algorithm which you are showing in your question can extract this from the DP matrix by backtracking the path with minimal costs. Consider this two edit matrices between to and foo with unit costs:

  t o       f o o
f 1 2     t 1 2 3
o 2 1     o 2 1 2
o 3 2

请注意,如果将 d(to,foo)的矩阵转置,则会得到 d(foo,to )。注意,由此,在第一矩阵中的插入变为在第二矩阵中的删除,反之亦然。因此,这就是您正在寻找的对称性再次出现的地方。

Note that if you transpose the matrix for d(to, foo) you get the matrix for d(foo, to). Note that by this, an insertion in the first matrix becomes a deletion in the second matrix and vice versa. So this is where this symmetry you are looking for is coming up again.

我希望这会有所帮助:)

I hope this helps :)

这篇关于编辑距离算法说明的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆