修改Levenshtein距离算法以不计算所有距离 [英] Modifying Levenshtein Distance algorithm to not calculate all distances

查看:76
本文介绍了修改Levenshtein距离算法以不计算所有距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究模糊搜索实现,作为实现的一部分,我们正在使用Apache的StringUtils.getLevenshteinDistance。目前,我们要为模糊搜索指定一个特定的最大平均响应时间。经过各种改进并进行了一些分析后,花费最多时间的地方是计算Levenshtein距离。在三个或三个以上字母的搜索字符串上,它大约占总时间的80-90%。

I'm working on a fuzzy search implementation and as part of the implementation, we're using Apache's StringUtils.getLevenshteinDistance. At the moment, we're going for a specific maxmimum average response time for our fuzzy search. After various enhancements and with some profiling, the place where the most time is spent is calculating the Levenshtein distance. It takes up roughly 80-90% of the total time on search strings three letters or more.

现在,我知道这里可以做些限制,但是我已经阅读了之前的SO问题和LD的Wikipedia链接,如果愿意将阈值限制为设置的最大距离,这可以帮助减少算法花费的时间,但是我不确定该怎么做完全是这样

Now, I know there are some limitations to what can be done here, but I've read on previous SO questions and on the Wikipedia link for LD that if one is willing limit the threshold to a set maximum distance, that could help curb the time spent on the algorithm, but I'm not sure how to do this exactly.


如果我们只对
距离感兴趣,如果它小于
阈值k,那么就足够了到
计算矩阵中宽度为
2k + 1的对角线。这样,
算法可以在O(kl)时间运行,
,其中l是最短
字符串的长度。[3]

If we are only interested in the distance if it is smaller than a threshold k, then it suffices to compute a diagonal stripe of width 2k+1 in the matrix. In this way, the algorithm can be run in O(kl) time, where l is the length of the shortest string.[3]

在下面,您将看到StringUtils的原始LH代码。之后就是我的修改。我正在尝试从i,j对角线计算出一定长度的距离(因此,在我的示例中,是i,j对角线上方和下方的两个对角线)。但是,这是不正确的,因为我已经做到了。例如,在最高对角线上,总是要选择正上方的单元格值,该值将为0。如果有人可以向我展示如何如我所描述的那样使此函数起作用,或者有关如何使它起作用的一些一般性建议, 这将不胜感激。

Below you will see the original LH code from StringUtils. After that is my modification. I'm trying to basically calculate the distances of a set length from the i,j diagonal (so, in my example, two diagonals above and below the i,j diagonal). However, this can't be correct as I've done it. For example, on the highest diagonal, it's always going to choose the cell value directly above, which will be 0. If anyone could show me how to make this functional as I've described, or some general advice on how to make it so, it would be greatly appreciated.

public static int getLevenshteinDistance(String s, String t) {
        if (s == null || t == null) {
            throw new IllegalArgumentException("Strings must not be null");
        }

        int n = s.length(); // length of s
        int m = t.length(); // length of t

        if (n == 0) {
            return m;
        } else if (m == 0) {
            return n;
        }

        if (n > m) {
            // swap the input strings to consume less memory
            String tmp = s;
            s = t;
            t = tmp;
            n = m;
            m = t.length();
        }

        int p[] = new int[n+1]; //'previous' cost array, horizontally
        int d[] = new int[n+1]; // cost array, horizontally
        int _d[]; //placeholder to assist in swapping p and d

        // indexes into strings s and t
        int i; // iterates through s
        int j; // iterates through t

        char t_j; // jth character of t

        int cost; // cost

        for (i = 0; i<=n; i++) {
            p[i] = i;
        }

        for (j = 1; j<=m; j++) {
            t_j = t.charAt(j-1);
            d[0] = j;

            for (i=1; i<=n; i++) {
                cost = s.charAt(i-1)==t_j ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
                d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
            }

            // copy current distance counts to 'previous row' distance counts
            _d = p;
            p = d;
            d = _d;
        }

        // our last action in the above loop was to switch d and p, so p now 
        // actually has the most recent cost counts
        return p[n];
    }

我的修改(仅对for循环):

My modifications (only to the for loops):

  for (j = 1; j<=m; j++) {
        t_j = t.charAt(j-1);
        d[0] = j;

        int k = Math.max(j-2, 1);
        for (i = k; i <= Math.min(j+2, n); i++) {
            cost = s.charAt(i-1)==t_j ? 0 : 1;
            // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
            d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
        }

        // copy current distance counts to 'previous row' distance counts
        _d = p;
        p = d;
        d = _d;
    }


推荐答案

实现窗口的问题正在处理每行中第一个条目左侧和最后一个条目上方的值。

The issue with implementing the window is dealing with the value to the left of the first entry and above the last entry in each row.

一种方法是将最初填写的值从1而不是0开始,然后忽略遇到的任何0。您必须从最终答案中减去1。

One way is to start the values you initially fill in at 1 instead of 0, then just ignore any 0s that you encounter. You'll have to subtract 1 from your final answer.

另一种方法是在第一个和最后一个左上方的条目中填充较高的值,这样就不会选择最小的支票他们。那是我前一天必须实现它时选择的方式:

Another way is to fill the entries left of first and above last with high values so the minimum check will never pick them. That's the way I chose when I had to implement it the other day:

public static int levenshtein(String s, String t, int threshold) {
    int slen = s.length();
    int tlen = t.length();

    // swap so the smaller string is t; this reduces the memory usage
    // of our buffers
    if(tlen > slen) {
        String stmp = s;
        s = t;
        t = stmp;
        int itmp = slen;
        slen = tlen;
        tlen = itmp;
    }

    // p is the previous and d is the current distance array; dtmp is used in swaps
    int[] p = new int[tlen + 1];
    int[] d = new int[tlen + 1];
    int[] dtmp;

    // the values necessary for our threshold are written; the ones after
    // must be filled with large integers since the tailing member of the threshold 
    // window in the bottom array will run min across them
    int n = 0;
    for(; n < Math.min(p.length, threshold + 1); ++n)
        p[n] = n;
    Arrays.fill(p, n, p.length, Integer.MAX_VALUE);
    Arrays.fill(d, Integer.MAX_VALUE);

    // this is the core of the Levenshtein edit distance algorithm
    // instead of actually building the matrix, two arrays are swapped back and forth
    // the threshold limits the amount of entries that need to be computed if we're 
    // looking for a match within a set distance
    for(int row = 1; row < s.length()+1; ++row) {
        char schar = s.charAt(row-1);
        d[0] = row;

        // set up our threshold window
        int min = Math.max(1, row - threshold);
        int max = Math.min(d.length, row + threshold + 1);

        // since we're reusing arrays, we need to be sure to wipe the value left of the
        // starting index; we don't have to worry about the value above the ending index
        // as the arrays were initially filled with large integers and we progress to the right
        if(min > 1)
            d[min-1] = Integer.MAX_VALUE;

        for(int col = min; col < max; ++col) {
            if(schar == t.charAt(col-1))
                d[col] = p[col-1];
            else 
                // min of: diagonal, left, up
                d[col] = Math.min(p[col-1], Math.min(d[col-1], p[col])) + 1;
        }
        // swap our arrays
        dtmp = p;
        p = d;
        d = dtmp;
    }

        if(p[tlen] == Integer.MAX_VALUE)
            return -1;
    return p[tlen];
}

这篇关于修改Levenshtein距离算法以不计算所有距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆