字符串的相似性 - >莱文斯坦距离 [英] String similarity -> Levenshtein distance
问题描述
我使用Levenshtein算法来找出两个字符串之间的相似性。这是我在做节目的一个非常重要的一部分,所以它必须是有效的。 问题是,该算法没有找到以下的实施例为相似的:
CONAIR
AIRCON
该算法将给出6.一个距离所以对于6个字母(你看具有最高量的字母的字),所不同的是100%=>的相似性为0%这个字。对>
我需要找到一种方法来找到两个字符串之间的相似性,同时也考虑到情况下,像之前psented一个我$ P $。
有没有更好的算法,我可以使用吗?或者,你是什么人推荐我?
编辑:我也看了成Damerau - 莱文斯坦的算法,它增加了换位。的问题是,这种换位仅供相邻字符(而不是一个字符数)。
我会分裂这个词变成对unigram,双字母组和卦,然后计算余弦相似。
I'm using the Levenshtein algorithm to find the similarity between two strings. This is a very important part of the program I'm making, so it needs to be effective. The problem is that the algorithm doesn't find the following examples as similar:
CONAIR
AIRCON
The algorithm will give a distance of 6. So for this word of 6 letters (You look at the word with the highest amount of letters), the difference is of 100% => the similarity is 0%.
I need to find a way to find the similarities between two string, but also taking into consideration cases like the one I presented before.
Is there a better algorithm I can use? Or what do you guys recommend me?
EDIT: I've also looked into the "Damerau–Levenshtein" algorithm, which adds transpositions. The problem is that this transpositions are only for adjacent characters (and not for a number of characters).
I would divide the term into unigrams, bigrams and trigrams, then calculate cosine similarity.
这篇关于字符串的相似性 - >莱文斯坦距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!