字符串的相似性 - ＆GT;莱文斯坦距离 [英] String similarity -> Levenshtein distance

查看：635 发布时间：2015/11/30 13:50:16 string algorithm levenshtein-distance similarity

本文介绍了字符串的相似性 - ＆GT;莱文斯坦距离的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用Levenshtein算法来找出两个字符串之间的相似性。这是我在做节目的一个非常重要的一部分，所以它必须是有效的。问题是，该算法没有找到以下的实施例为相似的：

CONAIR
AIRCON

该算法将给出6.一个距离所以对于6个字母（你看具有最高量的字母的字），所不同的是100％=>的相似性为0％这个字。

我需要找到一种方法来找到两个字符串之间的相似性，同时也考虑到情况下，像之前psented一个我$ P $。

有没有更好的算法，我可以使用吗？或者，你是什么人推荐我？

编辑：我也看了成Damerau - 莱文斯坦的算法，它增加了换位。的问题是，这种换位仅供相邻字符（而不是一个字符数）。

解决方案

我会分裂这个词变成对unigram，双字母组和卦，然后计算余弦相似。

I'm using the Levenshtein algorithm to find the similarity between two strings. This is a very important part of the program I'm making, so it needs to be effective. The problem is that the algorithm doesn't find the following examples as similar:

CONAIR
AIRCON

The algorithm will give a distance of 6. So for this word of 6 letters (You look at the word with the highest amount of letters), the difference is of 100% => the similarity is 0%.

I need to find a way to find the similarities between two string, but also taking into consideration cases like the one I presented before.

Is there a better algorithm I can use? Or what do you guys recommend me?

EDIT: I've also looked into the "Damerau–Levenshtein" algorithm, which adds transpositions. The problem is that this transpositions are only for adjacent characters (and not for a number of characters).

解决方案

I would divide the term into unigrams, bigrams and trigrams, then calculate cosine similarity.

这篇关于字符串的相似性 - ＆GT;莱文斯坦距离的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

字符串的相似性 - ＆GT;莱文斯坦距离 [英] String similarity -> Levenshtein distance

问题描述

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

字符串的相似性 - ＆GT;莱文斯坦距离 [英] String similarity -&gt; Levenshtein distance

问题描述

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

字符串的相似性 - ＆GT;莱文斯坦距离 [英] String similarity -> Levenshtein distance

登录关闭