添加例外莱文斯坦 - 距离样算法 [英] Adding exceptions to Levenshtein-Distance-like algorithm
问题描述
我想计算出有类似的多达6个变量序列。目前我使用的是集合计数器返回不同的变量作为我的编辑距离的频率。
I'm trying to compute how similar a sequence of up to 6 variables are. Currently I'm using a Collections Counter to return the frequency of different variables as my edit-distance.
默认情况下,在编辑一个变量的距离(添加/分/变更)为1或0。我想改变的距离取决于变量,我设置了该变量。什么价值
所以我可以说某些变量类似于其他变量,并为他们多么相似的值。 我还想说的某些变量值得比平时少跌多的距离。
So I can say certain variables are similar to other variables, and provide a value for how similar they are. I also want to say certain variables are worth less or more distance than usual.
下面是我的previous岗位作为背景:<一href="http://stackoverflow.com/questions/32456943/modify-levenshtein-distance-to-ignore-order">Modify莱文斯坦 - 距离忽视订单
Here is my previous post as context: Modify Levenshtein-Distance to ignore order
例如:
# 'c' and 'k' are quite similar, so their distance from eachother is 0.5 instead of 1
>>> groups = {['c','k'] : 0.5}
# the letter 'e' is less significant, and 'x' is very significant
>>> exceptions = {'e': 0.3, 'x': 1.5}
>>> distance('woke', 'woc')
0.8
说明:
woke
k -> c = 1
woce
-e = 1
woc
Distance = 2
# With exceptions:
woke
k -> c = 0.5
woce
-e = 0.3
woc
Distance = 0.8
我怎么能做到这一点?这将是可以实现这个反算法?
目前的code(谢谢你,大卫Eisenstat)
Current code (thank you David Eisenstat)
def distance(s1, s2):
cnt = collections.Counter()
for c in s1:
cnt[c] += 1
for c in s2:
cnt[c] -= 1
return sum(abs(diff) for diff in cnt.values()) // 2 + \
(abs(sum(cnt.values())) + 1) // 2
推荐答案
我结束了将过程分成几个阶段,然后通过每个阶段的串进行迭代。我不知道它是有效的,因为它可以,但它的工作原理。
I ended up dividing the process into a few stages then iterating through the strings for each stage. I'm not sure if its as efficient as it could be but it works.
总结我试图实现(与编辑距离算法)
Summing up what I was trying to achieve (in relation to Edit-distance algorithms)
- 从一个字母到另一个距离为1。
变化的J - &GT; K = 1
- 0是没有任何区别的。例如
变化的J - &GT; J = 0
- 类似的信件可以值得小于1(由我指定)如
C
和K
音同,因此C,K = 0.5
,变动c - &GT; K = 0.5
- 在某些字母可能价值更多或更少(由我指定)如
X
是罕见的,所以我希望它有更多的重量,X = 1.4
,x更改 - &GT; K = 1.4
- Distance from one letter to another is 1.
change j -> k = 1
- 0 being no difference at all. e.g.
change j -> j = 0
- Similar letters can be worth less than 1 (specified by me) e.g.
c
andk
sound the same, thereforec, k = 0.5
,change c -> k = 0.5
- Certain letters could be worth more or less (specified by me) e.g.
x
is uncommon so I want it to have more weight,x = 1.4
,change x -> k = 1.4
创建2字典,1 类似于书信,1 的异常
Created 2 dictionaries, 1 for similar letters, 1 for exceptions
- 填充计数器 - 遍历两个字符串
- 匹配相似的项 - ,迭代字符串1,如果在类似快译通,迭代字符串2,如果在同类词典
- 更新计数 - 删除类似的项目,
- 查找距离 - 加起来绝对频率,占区别在字符串的长度
- 包含例外距离 - 异常值的基础上的字母频率帐户
- Populate Counter - Iterate through both strings
- Match similar items - Iterate string1, if in similar dict, iterate string2, if in similar dict
- Update Counter - remove similar items,
- Find Distance - add up absolute frequencies, account for difference in string length
- Include exceptions distance - Account for exception values based on frequency of letters
这篇关于添加例外莱文斯坦 - 距离样算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!