是否有一个编辑距离算法,需要"大块换位"考虑? [英] Is there an edit distance algorithm that takes "chunk transposition" into account?

查看:129
本文介绍了是否有一个编辑距离算法,需要"大块换位"考虑?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我把引号块换位,因为我不知道是否有什么技术术语应该是。只要知道如果有一个技术术语的过程将是非常有益的。

I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful.

借助上编辑距离维基百科的文章对这个概念的一些很好的背景。

The Wikipedia article on edit distance gives some good background on the concept.

通过采取块换位考虑在内,我的意思是

By taking "chunk transposition" into account, I mean that

Turing, Alan.

应该匹配

Alan Turing

更紧密地比它匹配

more closely than it matches

Turing Machine

即。距离计算应检测时,文子都被简单地在文本中移动。这是不符合共同Levenshtein距离公式的情况

I.e. the distance calculation should detect when substrings of the text have simply been moved within the text. This is not the case with the common Levenshtein distance formula.

中的字符串将是几百个字符长至多 - 它们的作者的名字或作者姓名这可能是在一个不同的格式的列表。我没有做DNA测序(虽然我怀疑人们做会知道一点关于这个主题)。

The strings will be a few hundred characters long at most -- they are author names or lists of author names which could be in a variety of formats. I'm not doing DNA sequencing (though I suspect people that do will know a bit about this subject).

推荐答案

有一个看杰卡德距离度量(JDM)。这是一个过时的歌曲,但是,糖果这是pretty的擅长标记级别的差异,如姓氏第一,姓。为两个字符串comparands中,皮肌炎计算仅仅是唯一字符的两个串都在共同除以它们之间唯一字符的总数的数量(换句话说在联合的交点)。例如,给定的两个参数JEFFKTYZZER和TYZZERJEFF,分子是7,分母是8,得到0.875的数值。我的选择字符作为令牌不是唯一可用的,顺便说一句 - 正 - 克经常使用以及

Have a look at the Jaccard distance metric (JDM). It's an oldie-but-goodie that's pretty adept at token-level discrepancies such as last name first, first name last. For two string comparands, the JDM calculation is simply the number of unique characters the two strings have in common divided by the total number of unique characters between them (in other words the intersection over the union). For example, given the two arguments "JEFFKTYZZER" and "TYZZERJEFF," the numerator is 7 and the denominator is 8, yielding a value of 0.875. My choice of characters as tokens is not the only one available, BTW--n-grams are often used as well.

这篇关于是否有一个编辑距离算法,需要"大块换位"考虑?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆