Levenshtein和Trigram的替代品 [英] Alternative to Levenshtein and Trigram
问题描述
说我的数据库中有以下两个字符串:
Say I have the following two strings in my database:
(1) 'Levi Watkins Learning Center - Alabama State University'
(2) 'ETH Library'
我的软件从数据源接收自由文本输入,并且应该将这些自由文本与数据库中的预定义字符串(上面的字符串)进行匹配.
My software receives free text inputs from a data source, and it should match those free texts to the pre-defined strings in the database (the ones above).
例如,如果软件获取字符串 'Alabama University'
,则它应该认识到,它与(1)
相比比与(2)
更相似.
For example, if the software gets the string 'Alabama University'
, it should recognize that this is more similar to (1)
than it is to (2)
.
起初,我想到使用著名的字符串指标,例如Levenshtein-Damerau或Trigrams,但这会导致不良结果,如您在此处看到的那样:
At first, I thought of using a well-known string metric like Levenshtein-Damerau or Trigrams, but this leads to unwanted results as you can see here:
http://fuzzy-string. com/Compare/Transform.aspx?r = ETH + Library& q =阿拉巴马州+大学
Difference to (1): 37
Difference to (2): 14
(2)
之所以获胜,是因为它比(1)
短得多,即使(1)
包含搜索字符串的两个词(Alabama
和University
).
(2)
wins because it is much shorter than (1)
, even though (1)
contains both words (Alabama
and University
) of the search string.
我也使用Trigrams(使用Javascript库FuzzySet)进行了尝试,但在那里得到了类似的结果.
I also tried it with Trigrams (using the Javascript library fuzzySet), but I got similar results there.
是否有一个字符串度量标准可以识别搜索字符串与(1)
的相似性?
Is there a string metric that would recognize the similarity of the search string to (1)
?
推荐答案
您可以尝试使用单词移动器的距离 https://github.com/mkusner/wmd .该算法的一个显着优势是,它在计算文档中单词之间的差异时合并了隐含的含义.可以在此处
You could try the Word Mover's Distance https://github.com/mkusner/wmd instead. One brilliant advantage of this algorithm is that it incorporates the implied meanings while computing the differences between words in documents. The paper can be found here
这篇关于Levenshtein和Trigram的替代品的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!