Java中的Levenshtein算法存在的问题 [英] Problems with Levenshtein algorithm in Java
问题描述
我想使用 Levenshtein算法来完成以下任务:如果我网站上的用户搜索一些值(他在输入中输入字符),我想立即使用AJAX检查建议,就像Google Instant一样.
I want to use the Levenshtein algorithm for the following task: if a user on my website searches for some value (he enters characters in a input), I want to instantly check for suggestions with AJAX, like Google Instant does.
我的印象是Levenshtein算法对于这样的任务来说太慢了.为了检查其行为,我首先用Java实现了它,在该方法的每个递归调用中都打印出两个String
.
I have the impression that the Levenshtein algorithm is way too slow for such a task. To check its behaviour, I first implemented it in Java, printing out the two String
s in every recursive call of the method.
public class Levenshtein {
public static void main(String[] arg){
String a = "Hallo Zusammen";
String b = "jfdss Zusammen";
int res = levenshtein(a, b);
System.out.println(res);
}
public static int levenshtein(String s, String t){
int len_s = s.length();
int len_t = t.length();
int cost = 0;
System.out.println("s: " + s + ", t: " + t);
if(len_s>0 && len_t>0){
if(s.charAt(0) != t.charAt(0)) cost = 1;
}
if(len_s == 0){
return len_t;
}else{
if(len_t == 0){
return len_s;
}else{
String news = s.substring(0, s.length()-1);
String newt = t.substring(0, t.length()-1);
return min(levenshtein(news, t) + 1,
levenshtein(s, newt) + 1,
levenshtein(news, newt) + cost);
}
}
}
public static int min(int a, int b, int c) {
return Math.min(Math.min(a, b), c);
}
}
但是,这里有几点:
- 我添加了支票
if(len_s>0 && len_t>0)
,因为我得到了具有上述测试值的StringIndexOutOfBoundsException
. - 在上述测试值的情况下,该算法似乎可以无限计算
- The check
if(len_s>0 && len_t>0)
was added by me, because I was getting aStringIndexOutOfBoundsException
with above test values. - With above test values, the algorithm seems to calculate infinitely
是否可以对算法进行优化以使其对我有用,还是我应该使用完全不同的算法来完成所需的任务?
Are there optimizations that can be made on the algorithm to make it work for me, or should I use a completely different one to accomplish the desired task?
推荐答案
1)关于Levenshtein距离算法改进的几句话
Levenshteins距离的递归实现具有指数复杂性.
我建议您使用记忆技术并实现Levenshtein距离而无需递归,并降低O(N^2)
的复杂性(需要O(N^2)
内存)
I'd suggest you to use memoization technique and implement Levenshtein distance without recursion, and reduce complexity to O(N^2)
(needs O(N^2)
memory)
public static int levenshteinDistance( String s1, String s2 ) {
return dist( s1.toCharArray(), s2.toCharArray() );
}
public static int dist( char[] s1, char[] s2 ) {
// distance matrix - to memoize distances between substrings
// needed to avoid recursion
int[][] d = new int[ s1.length + 1 ][ s2.length + 1 ];
// d[i][j] - would contain distance between such substrings:
// s1.subString(0, i) and s2.subString(0, j)
for( int i = 0; i < s1.length + 1; i++ ) {
d[ i ][ 0 ] = i;
}
for(int j = 0; j < s2.length + 1; j++) {
d[ 0 ][ j ] = j;
}
for( int i = 1; i < s1.length + 1; i++ ) {
for( int j = 1; j < s2.length + 1; j++ ) {
int d1 = d[ i - 1 ][ j ] + 1;
int d2 = d[ i ][ j - 1 ] + 1;
int d3 = d[ i - 1 ][ j - 1 ];
if ( s1[ i - 1 ] != s2[ j - 1 ] ) {
d3 += 1;
}
d[ i ][ j ] = Math.min( Math.min( d1, d2 ), d3 );
}
}
return d[ s1.length ][ s2.length ];
}
或者,甚至更好的-您可能会注意到,对于距离矩阵中的每个单元格-您只需要有关前一行的信息,因此您可以将内存需求减少到O(N)
:>
Or, even better - you may notice, that for each cell in distance matrix - you're need only information about previous line, so you can reduce memory needs to O(N)
:
public static int dist( char[] s1, char[] s2 ) {
// memoize only previous line of distance matrix
int[] prev = new int[ s2.length + 1 ];
for( int j = 0; j < s2.length + 1; j++ ) {
prev[ j ] = j;
}
for( int i = 1; i < s1.length + 1; i++ ) {
// calculate current line of distance matrix
int[] curr = new int[ s2.length + 1 ];
curr[0] = i;
for( int j = 1; j < s2.length + 1; j++ ) {
int d1 = prev[ j ] + 1;
int d2 = curr[ j - 1 ] + 1;
int d3 = prev[ j - 1 ];
if ( s1[ i - 1 ] != s2[ j - 1 ] ) {
d3 += 1;
}
curr[ j ] = Math.min( Math.min( d1, d2 ), d3 );
}
// define current line of distance matrix as previous
prev = curr;
}
return prev[ s2.length ];
}
2)关于自动完成的几句话
Levenshtein的距离仅在您需要查找精确匹配时才适用.
但是,如果您的关键字为 apple
并且用户键入了 green apples
,该怎么办?查询和关键字之间的Levenshteins距离会很大( 7分).而且 apple
和 bcdfghk
(哑字符串)之间的Levensteins距离也将为 7分!
建议您使用全文搜索引擎(例如 Lucene ).诀窍是-您必须使用 n-gram 代表每个关键字的模型.
简而言之:
1),您必须将每个关键字表示为文档,其中包含n-gram:apple -> [ap, pp, pl, le]
.
2)(将每个关键字转换为一组n-gram)后-您必须在搜索引擎中按n-gram 索引每个关键字文档.您必须像这样创建索引:
Levenshtein's distance is perferred only if you need to find exact matches.
But what if your keyword would be apple
and user typed green apples
? Levenshteins distance between query and keyword would be large (7 points). And Levensteins distance between apple
and bcdfghk
(dumb string) would be 7 points too!
I'd suggest you to use full-text search engine (e.g. Lucene). The trick is - that you have to use n-gram model to represent each keyword.
In few words:
1) you have to represent each keyword as document, which contains n-grams: apple -> [ap, pp, pl, le]
.
2) after transforming each keyword to set of n-grams - you have to index each keyword-document by n-gram in your search engine. You'll have to create index like this:
...
ap -> apple, map, happy ...
pp -> apple ...
pl -> apple, place ...
...
3)所以您拥有n-gram索引. 查询时-您必须将其拆分为n克.这之后-您将拥有一组查询n-gram的用户.而您所需要的-就是匹配搜索引擎中最相似的文档.在草案方法中就足够了.
4)为了获得更好的建议-您可以按照Levenshtein距离对搜索引擎的结果进行排名.
3) So you have n-gram index. When you're get query - you have to split it into n-grams. Aftre this - you'll have set of users query n-grams. And all you need - is to match most similar documents from your search engine. In draft approach it would be enough.
4) For better suggest - you may rank results of search-engine by Levenshtein distance.
P.S. I'd suggest you to look through the book "Introduction to information retrieval".
这篇关于Java中的Levenshtein算法存在的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!