字符串的相似性 - >莱文斯坦距离 [英] String similarity -> Levenshtein distance

查看:635
本文介绍了字符串的相似性 - >莱文斯坦距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Levenshtein算法来找出两个字符串之间的相似性。这是我在做节目的一个非常重要的一部分,所以它必须是有效的。 问题是,该算法没有找到以下的实施例为相似的:

  

CONAIR
  AIRCON

该算法将给出6.一个距离所以对于6个字母(你看具有最高量的字母的字),所不同的是100%=>的相似性为0%这个字。

我需要找到一种方法来找到两个字符串之间的相似性,同时也考虑到情况下,像之前psented一个我$ P $。

有没有更好的算法,我可以使用吗?或者,你是什么人推荐我?

编辑:我也看了成Damerau - 莱文斯坦的算法,它增加了换位。的问题是,这种换位仅供相邻字符(而不是一个字符数)。

解决方案

我会分裂这个词变成对unigram,双字母组和卦,然后计算余弦相似。

I'm using the Levenshtein algorithm to find the similarity between two strings. This is a very important part of the program I'm making, so it needs to be effective. The problem is that the algorithm doesn't find the following examples as similar:

CONAIR
AIRCON

The algorithm will give a distance of 6. So for this word of 6 letters (You look at the word with the highest amount of letters), the difference is of 100% => the similarity is 0%.

I need to find a way to find the similarities between two string, but also taking into consideration cases like the one I presented before.

Is there a better algorithm I can use? Or what do you guys recommend me?

EDIT: I've also looked into the "Damerau–Levenshtein" algorithm, which adds transpositions. The problem is that this transpositions are only for adjacent characters (and not for a number of characters).

解决方案

I would divide the term into unigrams, bigrams and trigrams, then calculate cosine similarity.

这篇关于字符串的相似性 - >莱文斯坦距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆