Levenshtein和Trigram的替代品 [英] Alternative to Levenshtein and Trigram

查看:114
本文介绍了Levenshtein和Trigram的替代品的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我的数据库中有以下两个字符串:

Say I have the following two strings in my database:

(1) 'Levi Watkins Learning Center - Alabama State University'
(2) 'ETH Library'

我的软件从数据源接收自由文本输入,并且应该将这些自由文本与数据库中的预定义字符串(上面的字符串)进行匹配.

My software receives free text inputs from a data source, and it should match those free texts to the pre-defined strings in the database (the ones above).

例如,如果软件获取字符串 'Alabama University' ,则它应该认识到,它与(1)相比比与(2)更相似.

For example, if the software gets the string 'Alabama University', it should recognize that this is more similar to (1) than it is to (2).

起初,我想到使用著名的字符串指标,例如Levenshtein-Damerau或Trigrams,但这会导致不良结果,如您在此处看到的那样:

At first, I thought of using a well-known string metric like Levenshtein-Damerau or Trigrams, but this leads to unwanted results as you can see here:

http://fuzzy-string. com/Compare/Transform.aspx?r = ETH + Library& q =阿拉巴马州+大学

Difference to (1): 37
Difference to (2): 14

(2)之所以获胜,是因为它比(1)短得多,即使(1)包含搜索字符串的两个词(AlabamaUniversity).

(2) wins because it is much shorter than (1), even though (1) contains both words (Alabama and University) of the search string.

我也使用Trigrams(使用Javascript库FuzzySet)进行了尝试,但在那里得到了类似的结果.

I also tried it with Trigrams (using the Javascript library fuzzySet), but I got similar results there.

是否有一个字符串度量标准可以识别搜索字符串与(1)的相似性?

Is there a string metric that would recognize the similarity of the search string to (1)?

推荐答案

您可以尝试使用单词移动器的距离 https://github.com/mkusner/wmd .该算法的一个显着优势是,它在计算文档中单词之间的差异时合并了隐含的含义.可以在此处

You could try the Word Mover's Distance https://github.com/mkusner/wmd instead. One brilliant advantage of this algorithm is that it incorporates the implied meanings while computing the differences between words in documents. The paper can be found here

这篇关于Levenshtein和Trigram的替代品的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆