无法遵循最小编辑距离背后的直觉 [英] Unable to follow the intuition behind minimum edit distance

查看:56
本文介绍了无法遵循最小编辑距离背后的直觉的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我刚刚开始阅读MED,但完全无法遵循它. 假设我必须将"WATER"转换为"ATERW" 现在我可以替代:

So i just began reading on MED, but am totally unable to follow it. Suppose I have to convert "WATER" to "ATERW" Now I can substitute :

W->A, A->T, T->E, E->R, R->W

因此总费用= 2 + 2 + 2 + 2 + 2 = 10(所有替换项)

Thus total cost = 2+2+2+2+2 =10 (all substitutions)

但是我知道那是不正确的,应该是这样

However that isnt correct I know, it should be like this

WATER-
-ATERW

因此,这里的总费用= 1 + 1 = 2(删除和插入) 但是,我的问题是,程序如何知道它不应该匹配'W'->'A',而是删除两个字符串中的'W'并匹配'ATER'?直觉/逻辑是如何被灌输到程序中的?

Thus total cost here = 1+1 =2 (deletion and insertion) But then my question is that how does the program know that it shouldnt match 'W'->'A', and rather delete 'W' and match 'ATER' in both the string ?? How is that intuition/logic inculcated into the program ?

推荐答案

首先,您应该检查有关Levenshtein距离的维基百科页面:

First you should check the wikipedia page about Levenshtein distance: https://en.wikipedia.org/wiki/Levenshtein_distance

就像编辑距离一样,除了编辑费用.

It is just like edit distance, except for the edit cost.

如您所见,要解决此问题,您必须构建一个矩阵(这是一种动态编程方法).行代表源词,列代表目标词.

As you can see, to solve this problem you have to build a matrix (it is a dynamic programing approach). Rows represent the source word, columns represent the target word.

首先,用基本情况初始化矩阵:

First you initialize the matrix with base cases:

  • 要将空字符串转换为空字符串,您需要进行0次操作;
  • 要将空字符串转换为A,您需要进行1次操作(插入);
  • 要将空字符串转换为AT,您需要进行2次操作(插入);
  • 要将空字符串转换为ATE,您需要进行3次操作(插入);
  • 依此类推...
  • 要将W转换为空字符串,您需要进行1次操作(删除);
  • 要将WA转换为空字符串,您需要进行2次操作(删除);
  • 要将WAT转换为空字符串,您需要进行3次操作(删除);
  • 依此类推...

现在您有了第一行和第一列.

Now you got your first row and your first column.

  _ A T E R W
_ 0 1 2 3 4 5
W 1 ? ? ? ? ?
A 2 ? ? ? ? ?
T 3 ? ? ? ? ?
E 4 ? ? ? ? ?
R 5 ? ? ? ? ?

想法是逐个单元填充矩阵.要填充的第一个将成为第二行(2,2)的第二列的单元格.它对应于源单词(即WATER)的字母W和目标单词(ATERW)的字母A.

The idea is to fill the matrix cell by cell. The first one to fill is gonna be the cell of the second column of the second row (2,2). It corresponds to the letter W of the source word (ie WATER), and to the letter A of the target word (ATERW).

让我们看一下周围的值,让我们为每个值添加一个编辑成本.然后,我们选择最小值. 对于插入(左单元格)或删除(上单元格),编辑成本始终为1.对于替换(左上单元格),如果字母不同,则编辑成本为1,否则为0.

Let's take a look at the value around and let's add an edit cost to each of these values. We'll then pick the minimum. For an insertion (left cell) or a deletion (upper cell) the edit cost is always 1. For a substitution (upper-left cell) the edit cost is 1 if the letters are different, 0 otherwise.

我们有:

  • 插入(来自单元格2,1):1(单元格的值)+1(编辑成本);
  • 替换(来自单元格1,1):0(单元格的值)+1(不同字母的编辑成本);
  • 删除(来自1,2单元格):1(单元格的值)+1(编辑成本).

现在选择最小值:1(替换).单元格2,2为现在的值1.

Now pick the minimum value: 1 (SUBSTITUTION). Cell 2,2 as now value 1.

  _ A T E R W
_ 0 1 2 3 4 5
W 1 1 ? ? ? ?
A 2 ? ? ? ? ?
T 3 ? ? ? ? ?
E 4 ? ? ? ? ?
R 5 ? ? ? ? ?

现在让我们对单元格2,3进行相同的操作.它对应于源单词(即WATER)的后一个W,并对应于目标单词(ATERW)的字母T.

Now let's do the same for cell 2,3. It corresponds to the later W of the source word (ie WATER), and to the letter T of the target word (ATERW).

我们有:

  • 插入(来自单元格2,2):1(单元格的值)+1(编辑成本);
  • 替换(来自单元1,2):1(单元的值)+1(不同字母的编辑成本);
  • 删除(从单元格1,3开始):2(单元格的值)+1(编辑成本).

现在选择最小值:2(插入或替换).单元格2,3为现在的值2.

Now pick the minimum value: 2 (INSERTION or SUBSTITUTION). Cell 2,3 as now value 2.

这意味着将W转换为AT的成本为2.

It means that the cost to convert W into AT is 2.

  _ A T E R W
_ 0 1 2 3 4 5
W 1 1 2 ? ? ?
A 2 ? ? ? ? ?
T 3 ? ? ? ? ?
E 4 ? ? ? ? ?
R 5 ? ? ? ? ?

如您所见,我们使用先前的计算(单元格2,2中的值)填充当前单元格(2,3).这就是动态编程的思想.

As you can see we use previous computation (value in cell 2,2) to fill the current cell (2,3). That's the idea of dynamic programming.

重复直到矩阵被填满.看起来应该像这样:

Repeat until the matrix is filled. It should looks like this:

  _ A T E R W
_ 0 1 2 3 4 5
W 1 1 2 3 4 4
A 2 1 2 3 4 5
T 3 2 1 2 3 4
E 4 3 2 1 2 3
R 5 4 3 2 1 2

看看最后一个单元格(6,6):值为2.它对应于将"WATER"转换为"ATERW"的成本.

Take a look at the very last cell (6,6): value is 2. It corresponds to the cost to transform 'WATER' into 'ATERW'.

为了恢复执行的编辑操作的顺序,您可以使用反向指针表.对于给定的单元格,每一行都给出了您从中选取最小值的单元格.

So as to recover the sequence of edit operations performed you can use a backpointer table. Each row gives, for a given cell, the cell from which you picked the minimum value.

2,2 1,1
2,3 2,2
2,4 2,3
2,5 2,4
2,6 1,5
3,2 2,1
...
6,5 5,4
6,6 6,5

现在您可以向后解析表并构建路径,即(6,6)->(6,5)->(5,4)...

Now you can parse the table backward and build the path, i.e., (6,6) -> (6,5) -> (5,4)...

这篇关于无法遵循最小编辑距离背后的直觉的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆