Java中的Levenshtein算法存在的问题 [英] Problems with Levenshtein algorithm in Java

查看:70
本文介绍了Java中的Levenshtein算法存在的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 Levenshtein算法来完成以下任务:如果我网站上的用户搜索一些值(他在输入中输入字符),我想立即使用AJAX检查建议,就像Google Instant一样.

I want to use the Levenshtein algorithm for the following task: if a user on my website searches for some value (he enters characters in a input), I want to instantly check for suggestions with AJAX, like Google Instant does.

我的印象是Levenshtein算法对于这样的任务来说太慢了.为了检查其行为,我首先用Java实现了它,在该方法的每个递归调用中都打印出两个String.

I have the impression that the Levenshtein algorithm is way too slow for such a task. To check its behaviour, I first implemented it in Java, printing out the two Strings in every recursive call of the method.

public class Levenshtein {
    public static void main(String[] arg){
        String a = "Hallo Zusammen";
        String b = "jfdss Zusammen";

        int res = levenshtein(a, b);

        System.out.println(res);
    }

    public static int levenshtein(String s, String t){
        int len_s = s.length();
        int len_t = t.length();
        int cost = 0;

        System.out.println("s: " + s + ", t: " + t);

        if(len_s>0 && len_t>0){
            if(s.charAt(0) != t.charAt(0)) cost = 1;
        }

        if(len_s == 0){
            return len_t;
        }else{
            if(len_t == 0){
                return len_s;
            }else{
                String news = s.substring(0, s.length()-1);
                String newt = t.substring(0, t.length()-1);
                return min(levenshtein(news, t) + 1,
                            levenshtein(s, newt) + 1,
                            levenshtein(news, newt) + cost);
            }
        }
    }

    public static int min(int a, int b, int c) {
          return Math.min(Math.min(a, b), c);
    }
}

但是,这里有几点:

  • 我添加了支票if(len_s>0 && len_t>0),因为我得到了具有上述测试值的StringIndexOutOfBoundsException.
  • 在上述测试值的情况下,该算法似乎可以无限计算
  • The check if(len_s>0 && len_t>0) was added by me, because I was getting a StringIndexOutOfBoundsException with above test values.
  • With above test values, the algorithm seems to calculate infinitely

是否可以对算法进行优化以使其对我有用,还是我应该使用完全不同的算法来完成所需的任务?

Are there optimizations that can be made on the algorithm to make it work for me, or should I use a completely different one to accomplish the desired task?

推荐答案

1)关于Levenshtein距离算法改进的几句话

Levenshteins距离的递归实现具有指数复杂性.

我建议您使用记忆技术并实现Levenshtein距离而无需递归,并降低O(N^2)的复杂性(需要O(N^2)内存)

I'd suggest you to use memoization technique and implement Levenshtein distance without recursion, and reduce complexity to O(N^2)(needs O(N^2) memory)

public static int levenshteinDistance( String s1, String s2 ) {
    return dist( s1.toCharArray(), s2.toCharArray() );
}

public static int dist( char[] s1, char[] s2 ) {

    // distance matrix - to memoize distances between substrings
    // needed to avoid recursion
    int[][] d = new int[ s1.length + 1 ][ s2.length + 1 ];

    // d[i][j] - would contain distance between such substrings:
    // s1.subString(0, i) and s2.subString(0, j)

    for( int i = 0; i < s1.length + 1; i++ ) {
        d[ i ][ 0 ] = i;
    }

    for(int j = 0; j < s2.length + 1; j++) {
        d[ 0 ][ j ] = j;
    }

    for( int i = 1; i < s1.length + 1; i++ ) {
        for( int j = 1; j < s2.length + 1; j++ ) {
            int d1 = d[ i - 1 ][ j ] + 1;
            int d2 = d[ i ][ j - 1 ] + 1;
            int d3 = d[ i - 1 ][ j - 1 ];
            if ( s1[ i - 1 ] != s2[ j - 1 ] ) {
                d3 += 1;
            }
            d[ i ][ j ] = Math.min( Math.min( d1, d2 ), d3 );
        }
    }
    return d[ s1.length ][ s2.length ];
}

或者,甚至更好的-您可能会注意到,对于距离矩阵中的每个单元格-您只需要有关前一行的信息,因此您可以将内存需求减少到O(N) :

Or, even better - you may notice, that for each cell in distance matrix - you're need only information about previous line, so you can reduce memory needs to O(N):

public static int dist( char[] s1, char[] s2 ) {

    // memoize only previous line of distance matrix     
    int[] prev = new int[ s2.length + 1 ];

    for( int j = 0; j < s2.length + 1; j++ ) {
        prev[ j ] = j;
    }

    for( int i = 1; i < s1.length + 1; i++ ) {

        // calculate current line of distance matrix     
        int[] curr = new int[ s2.length + 1 ];
        curr[0] = i;

        for( int j = 1; j < s2.length + 1; j++ ) {
            int d1 = prev[ j ] + 1;
            int d2 = curr[ j - 1 ] + 1;
            int d3 = prev[ j - 1 ];
            if ( s1[ i - 1 ] != s2[ j - 1 ] ) {
                d3 += 1;
            }
            curr[ j ] = Math.min( Math.min( d1, d2 ), d3 );
        }

        // define current line of distance matrix as previous     
        prev = curr;
    }
    return prev[ s2.length ];
}

2)关于自动完成的几句话

Levenshtein的距离仅在您需要查找精确匹配时才适用.

但是,如果您的关键字为 apple 并且用户键入了 green apples ,该怎么办?查询和关键字之间的Levenshteins距离会很大( 7分).而且 apple bcdfghk (哑字符串)之间的Levensteins距离也将为 7分

建议您使用全文搜索引擎(例如 Lucene ).诀窍是-您必须使用 n-gram 代表每个关键字的模型.

简而言之:
1),您必须将每个关键字表示为文档,其中包含n-gram:apple -> [ap, pp, pl, le].

2)(将每个关键字转换为一组n-gram)后-您必须在搜索引擎中按n-gram 索引每个关键字文档.您必须像这样创建索引:

Levenshtein's distance is perferred only if you need to find exact matches.

But what if your keyword would be apple and user typed green apples? Levenshteins distance between query and keyword would be large (7 points). And Levensteins distance between apple and bcdfghk (dumb string) would be 7 points too!

I'd suggest you to use full-text search engine (e.g. Lucene). The trick is - that you have to use n-gram model to represent each keyword.

In few words:
1) you have to represent each keyword as document, which contains n-grams: apple -> [ap, pp, pl, le].

2) after transforming each keyword to set of n-grams - you have to index each keyword-document by n-gram in your search engine. You'll have to create index like this:

...
ap -> apple, map, happy ...
pp -> apple ...
pl -> apple, place ...
...

3)所以您拥有n-gram索引. 查询时-您必须将其拆分为n克.这之后-您将拥有一组查询n-gram的用户.而您所需要的-就是匹配搜索引擎中最相似的文档.在草案方法中就足够了.

4)为了获得更好的建议-您可以按照Levenshtein距离对搜索引擎的结果进行排名.

3) So you have n-gram index. When you're get query - you have to split it into n-grams. Aftre this - you'll have set of users query n-grams. And all you need - is to match most similar documents from your search engine. In draft approach it would be enough.

4) For better suggest - you may rank results of search-engine by Levenshtein distance.

PS ,建议您仔细阅读本书 .

P.S. I'd suggest you to look through the book "Introduction to information retrieval".

这篇关于Java中的Levenshtein算法存在的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆