添加例外莱文斯坦 - 距离样算法 [英] Adding exceptions to Levenshtein-Distance-like algorithm

查看:213
本文介绍了添加例外莱文斯坦 - 距离样算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算出有类似的多达6个变量序列。目前我使用的是集合计数器返回不同的变量作为我的编辑距离的频​​率。

I'm trying to compute how similar a sequence of up to 6 variables are. Currently I'm using a Collections Counter to return the frequency of different variables as my edit-distance.

默认情况下,在编辑一个变量的距离(添加/分/变更)为1或0。我想改变的距离取决于变量,我​​设置了该变量。什么价值

所以我可以说某些变量类似于其他变量,并为他们多么相似的值。 我还想说的某些变量值得比平时少跌多的距离。

So I can say certain variables are similar to other variables, and provide a value for how similar they are. I also want to say certain variables are worth less or more distance than usual.

下面是我的previous岗位作为背景:<一href="http://stackoverflow.com/questions/32456943/modify-levenshtein-distance-to-ignore-order">Modify莱文斯坦 - 距离忽视订单

Here is my previous post as context: Modify Levenshtein-Distance to ignore order

例如:

# 'c' and 'k' are quite similar, so their distance from eachother is 0.5 instead of 1
>>> groups = {['c','k'] : 0.5}

# the letter 'e' is less significant, and 'x' is very significant
>>> exceptions = {'e': 0.3, 'x': 1.5}

>>> distance('woke', 'woc')
0.8

说明:

woke
k -> c = 1
woce
-e = 1
woc
Distance = 2

# With exceptions:
woke
k -> c = 0.5
woce
-e = 0.3
woc
Distance = 0.8

我怎么能做到这一点?这将是可以实现这个反算法?

目前的code(谢谢你,大卫Eisenstat)

Current code (thank you David Eisenstat)

def distance(s1, s2):
    cnt = collections.Counter()
    for c in s1:
        cnt[c] += 1
    for c in s2:
        cnt[c] -= 1
    return sum(abs(diff) for diff in cnt.values()) // 2 + \
        (abs(sum(cnt.values())) + 1) // 2

推荐答案

我结束了将过程分成几个阶段,然后通过每个阶段的串进行迭代。我不知道它是有效的,因为它可以,但它的工作原理。

I ended up dividing the process into a few stages then iterating through the strings for each stage. I'm not sure if its as efficient as it could be but it works.

总结我试图实现(与编辑距离算法)

Summing up what I was trying to achieve (in relation to Edit-distance algorithms)

  • 从一个字母到另一个距离为1。变化的J - &GT; K = 1
  • 0是没有任何区别的。例如变化的J - &GT; J = 0
  • 类似的信件可以值得小于1(由我指定)如 C K 音同,因此 C,K = 0.5 变动c - &GT; K = 0.5
  • 在某些字母可能价值更多或更少(由我指定)如 X 是罕见的,所以我希望它有更多的重量, X = 1.4 x更改 - &GT; K = 1.4
  • Distance from one letter to another is 1. change j -> k = 1
  • 0 being no difference at all. e.g. change j -> j = 0
  • Similar letters can be worth less than 1 (specified by me) e.g. c and k sound the same, therefore c, k = 0.5, change c -> k = 0.5
  • Certain letters could be worth more or less (specified by me) e.g. x is uncommon so I want it to have more weight, x = 1.4, change x -> k = 1.4

创建2字典,1 类似于书信,1 的异常

Created 2 dictionaries, 1 for similar letters, 1 for exceptions

  1. 填充计数器 - 遍历两个字符串
  2. 匹配相似的项 - ,迭代字符串1,如果在类似快译通,迭代字符串2,如果在同类词典
  3. 更新计数 - 删除类似的项目,
  4. 查找距离 - 加起来绝对频率,占区别在字符串的长度
  5. 包含例外距离 - 异常值的基础上的字母频率帐户
  1. Populate Counter - Iterate through both strings
  2. Match similar items - Iterate string1, if in similar dict, iterate string2, if in similar dict
  3. Update Counter - remove similar items,
  4. Find Distance - add up absolute frequencies, account for difference in string length
  5. Include exceptions distance - Account for exception values based on frequency of letters

这篇关于添加例外莱文斯坦 - 距离样算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆