实现一个简单的特里高效Levenshtein距离计算 - Java的 [英] Implementing a simple Trie for efficient Levenshtein Distance calculation - Java

查看:730
本文介绍了实现一个简单的特里高效Levenshtein距离计算 - Java的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

完成。下面是code终于通过了所有的我的测试。再次,这是穆里罗Vasconcelo的改性史蒂夫Hanov的算法的版本后建模。感谢所有帮助!

Done. Below is the code that finally passed all of my tests. Again, this is modeled after Murilo Vasconcelo's modified version of Steve Hanov's algorithm. Thanks to all that helped!

/**
 * Computes the minimum Levenshtein Distance between the given word (represented as an array of Characters) and the
 * words stored in theTrie. This algorithm is modeled after Steve Hanov's blog article "Fast and Easy Levenshtein
 * distance using a Trie" and Murilo Vasconcelo's revised version in C++.
 * 
 * http://stevehanov.ca/blog/index.php?id=114
 * http://murilo.wordpress.com/2011/02/01/fast-and-easy-levenshtein-distance-using-a-trie-in-c/
 * 
 * @param ArrayList<Character> word - the characters of an input word as an array representation
 * @return int - the minimum Levenshtein Distance
 */
private int computeMinimumLevenshteinDistance(ArrayList<Character> word) {

    theTrie.minLevDist = Integer.MAX_VALUE;

    int iWordLength = word.size();
    int[] currentRow = new int[iWordLength + 1];

    for (int i = 0; i <= iWordLength; i++) {
        currentRow[i] = i;
    }

    for (int i = 0; i < iWordLength; i++) {
        traverseTrie(theTrie.root, word.get(i), word, currentRow);
    }
    return theTrie.minLevDist;
}

/**
 * Recursive helper function. Traverses theTrie in search of the minimum Levenshtein Distance.
 * 
 * @param TrieNode node - the current TrieNode
 * @param char letter - the current character of the current word we're working with
 * @param ArrayList<Character> word - an array representation of the current word
 * @param int[] previousRow - a row in the Levenshtein Distance matrix
 */
private void traverseTrie(TrieNode node, char letter, ArrayList<Character> word, int[] previousRow) {

    int size = previousRow.length;
    int[] currentRow = new int[size];
    currentRow[0] = previousRow[0] + 1;

    int minimumElement = currentRow[0];
    int insertCost, deleteCost, replaceCost;

    for (int i = 1; i < size; i++) {

        insertCost = currentRow[i - 1] + 1;
        deleteCost = previousRow[i] + 1;

        if (word.get(i - 1) == letter) {
            replaceCost = previousRow[i - 1];
        } else {
            replaceCost = previousRow[i - 1] + 1;
        }

        currentRow[i] = minimum(insertCost, deleteCost, replaceCost);

        if (currentRow[i] < minimumElement) {
            minimumElement = currentRow[i];
        }
    }

    if (currentRow[size - 1] < theTrie.minLevDist && node.isWord) {
        theTrie.minLevDist = currentRow[size - 1];
    }

    if (minimumElement < theTrie.minLevDist) {

        for (Character c : node.children.keySet()) {
            traverseTrie(node.children.get(c), c, word, currentRow);
        }
    }
}

更新2

最后,我已经成功地得到这个工作,我的大多数测试案例。我的实现实际上是从<一个直接翻译href="http://murilo.word$p$pss.com/2011/02/01/fast-and-easy-levenshtein-distance-using-a-trie-in-c/">Murilo's C ++的史蒂夫Hanov算法版本。所以,我应该如何重构算法和/或进行优化?下面是code ...

UPDATE 2

Finally, I've managed to get this to work for most of my test cases. My implementation is practically a direct translation from Murilo's C++ version of Steve Hanov's algorithm. So how should I refactor this algorithm and/or make optimizations? Below is the code...

public int search(String word) {

    theTrie.minLevDist = Integer.MAX_VALUE;

    int size = word.length();
    int[] currentRow = new int[size + 1];

    for (int i = 0; i <= size; i++) {
        currentRow[i] = i;
    }
    for (int i = 0; i < size; i++) {
        char c = word.charAt(i);
        if (theTrie.root.children.containsKey(c)) {
            searchRec(theTrie.root.children.get(c), c, word, currentRow);
        }
    }
    return theTrie.minLevDist;
}
private void searchRec(TrieNode node, char letter, String word, int[] previousRow) {

    int size = previousRow.length;
    int[] currentRow = new int[size];
    currentRow[0] = previousRow[0] + 1;

    int insertCost, deleteCost, replaceCost;

    for (int i = 1; i < size; i++) {

        insertCost = currentRow[i - 1] + 1;
        deleteCost = previousRow[i] + 1;

        if (word.charAt(i - 1) == letter) {
            replaceCost = previousRow[i - 1];
        } else {
            replaceCost = previousRow[i - 1] + 1;
        }
        currentRow[i] = minimum(insertCost, deleteCost, replaceCost);
    }

    if (currentRow[size - 1] < theTrie.minLevDist && node.isWord) {
        theTrie.minLevDist = currentRow[size - 1];
    }

    if (minElement(currentRow) < theTrie.minLevDist) {

        for (Character c : node.children.keySet()) {
            searchRec(node.children.get(c), c, word, currentRow);

        }
    }
}

谢谢大家谁促成了这一问题。我试图得到莱文斯坦自动机工作,但我不能做到这一点。

Thank you everyone who contributed to this question. I tried getting the Levenshtein Automata to work, but I couldn't make it happen.

所以我在寻找关于上述code对重构和/或优化的建议。请让我知道,如果有任何混淆。与往常一样,我可以提供为所需的源$ C ​​$ c中的其余部分。

So I'm looking for suggestions on refactoring and/or optimizations regarding the above code. Please let me know if there's any confusion. As always, I can provide the rest of the source code as needed.

所以,我已经实现了一个简单的特里数据结构,我一直努力遵循史蒂夫Hanov的蟒蛇教程来计算Levenshtein距离。其实,我感兴趣的是计算在最小 Levenshtein距离一个给定的单词和特里之间的话,所以我一直在下面<一href="http://murilo.word$p$pss.com/2011/02/01/fast-and-easy-levenshtein-distance-using-a-trie-in-c/">Murilo洛斯版的史蒂夫Hanov算法。它不工作得很好,但这里是我的特里类:

So I've implemented a simple Trie data structure and I've been trying to follow Steve Hanov's python tutorial to compute the Levenshtein Distance. Actually, I'm interested in computing the minimum Levenshtein Distance between a given word and the words in the Trie, thus I've been following Murilo Vasconcelos's version of Steve Hanov's algorithm. It's not working very well, but here's my Trie class:

public class Trie {

    public TrieNode root;
    public int minLevDist;

    public Trie() {
        this.root = new TrieNode(' ');
    }

    public void insert(String word) {

        int length = word.length();
        TrieNode current = this.root;

        if (length == 0) {
            current.isWord = true;
        }
        for (int index = 0; index < length; index++) {

            char letter = word.charAt(index);
            TrieNode child = current.getChild(letter);

            if (child != null) {
                current = child;
            } else {
                current.children.put(letter, new TrieNode(letter));
                current = current.getChild(letter);
            }
            if (index == length - 1) {
                current.isWord = true;
            }
        }
    }
}

...和TrieNode类:

... and the TrieNode class:

public class TrieNode {

    public final int ALPHABET = 26;

    public char letter;
    public boolean isWord;
    public Map<Character, TrieNode> children;

    public TrieNode(char letter) {
        this.isWord = false;
        this.letter = letter;
        children = new HashMap<Character, TrieNode>(ALPHABET);
    }

    public TrieNode getChild(char letter) {

        if (children != null) {
            if (children.containsKey(letter)) {
                return children.get(letter); 
            }
        }
        return null;
    }
}

现在,我一直在努力,实现搜索的<一个href="http://murilo.word$p$pss.com/2011/02/01/fast-and-easy-levenshtein-distance-using-a-trie-in-c/">Murilo德瓦斯康塞洛斯有之,但一些是关闭的,我需要一些帮助调试这一点。请就如何重构这和/或指出其中的错误的建议。我想重构的第一件事就是minCost全局变量,但这是最小的事情。总之,这里的code ...

Now, I've tried to implement the search as Murilo Vasconcelos has it, but something is off and I need some help debugging this. Please give suggestions on how to refactor this and/or point out where the bugs are. The very first thing I'd like to refactor is the "minCost" global variable, but that's the smallest of things. Anyway, here's the code...

public void search(String word) {

    int size = word.length();
    int[] currentRow = new int[size + 1];

    for (int i = 0; i <= size; i++) {
        currentRow[i] = i;
    }
    for (int i = 0; i < size; i++) {
        char c = word.charAt(i);
        if (theTrie.root.children.containsKey(c)) {
            searchRec(theTrie.root.children.get(c), c, word, currentRow);
        }
    }
}

private void searchRec(TrieNode node, char letter, String word, int[] previousRow) {

    int size = previousRow.length;
    int[] currentRow = new int[size];
    currentRow[0] = previousRow[0] + 1;

    int replace, insertCost, deleteCost;

    for (int i = 1; i < size; i++) {

        char c = word.charAt(i - 1);

        insertCost = currentRow[i - 1] + 1;
        deleteCost = previousRow[i] + 1;
        replace = (c == letter) ? previousRow[i - 1] : (previousRow[i - 1] + 1);

        currentRow[i] = minimum(insertCost, deleteCost, replace);
    }

    if (currentRow[size - 1] < minCost && !node.isWord) {
        minCost = currentRow[size - 1];
    }
    Integer minElement = minElement(currentRow);
    if (minElement < minCost) {

        for (Map.Entry<Character, TrieNode> entry : node.children.entrySet()) {
            searchRec(node, entry.getKey(), word, currentRow);
        }
    }
}

我的不足评论道歉。所以我在做什么错了?

I apologize for the lack of comments. So what am I doing wrong?

我一直在读了一篇文章,快速和容易Levenshtein距离使用的是特里,在希望的搞清楚计算 Levenshtein距离两个字符串之间的一种有效的方式。我的这一主要目的是,给定的一大组的话,能够找到最小Levenshtein距离的输入字(S)和该组字之间。

I've been reading an article, Fast and Easy Levenshtein distance using a Trie, in hopes of figuring out an efficient way to compute the Levenshtein Distance between two Strings. My main goal with this is, given a large set of words, to be able to find the minimal Levenshtein Distance between an input word(s) and this set of words.

在我的简单的实现,我计算Levenshtein距离的输入单词和单词集合之间,对于每一个输入单词,并返回最小。它的工作原理,但它是没有效率......

In my trivial implementation, I compute the Levenshtein Distance between an input word and the set of words, for each input word, and return the minimum. It works, but it is not efficient...

我一直在寻找一个特里,在Java中的实现,和我遇到两个看似良好来源:

I've been looking for implementations of a Trie, in Java, and I've come across two seemingly good sources:

  • <一个href="http://www.koders.com/java/fid0F06E53F2CFCC6E591C38752F355A7178F92FFE5.aspx?s=trie#L11">Koders.com版本
  • code.google.com版本
  • Koders.com version
  • code.google.com version

然而,这些实现显得过于复杂,我想要做的。正如我一直在读通过他们来了解他们的工作和特里数据结构是如何工作的一般,我只会变得更加迷茫。

However, these implementations seem too complicated for what I'm trying to do. As I've been reading through them to understand how they work and how Trie data structures work in general, I've only become more confused.

那么我将如何在Java中实现一个简单的特里数据结构?我的直觉告诉我,每个TrieNode应存放重新presents,也提到了英文字母,不一定所有的字母串。我的直觉正确吗?

So how would I implement a simple Trie data structure in Java? My intuition tells me that each TrieNode should store the String it represents and also references to letters of the alphabet, not necessarily all letters. Is my intuition correct?

一旦实现,接下来的任务就是计算Levenshtein距离。我通过在上述文章中的Python code例如读取,但我不会说Python和我的Java实现耗尽堆内存,一旦我打的递归搜索。那么,如何将计算Levenshtein距离使用特里数据结构?我有一个简单的实现,这个来源$ C ​​$ C ,但它不使用一个Trie树...它是低效率的。

Once that is implemented, the next task is to compute the Levenshtein Distance. I read through the Python code example in the article above, but I don't speak Python, and my Java implementation runs out of Heap memory once I hit the recursive searching. So how would I compute the Levenshtein Distance using the Trie data structure? I have a trivial implementation, modeled after this source code, but it doesn't use a Trie... it is inefficient.

这将是非常高兴看到一些code除了你的意见和建议。毕竟,这是一个学习的过程,我...我从来没有实现的特里...所以我有很多这方面的经验中学习。

It would be really nice to see some code in addition to your comments and suggestions. After all, this is a learning process for me... I've never implemented a Trie... so I have plenty to learn from this experience.

感谢。

P.S。我可以提供任何来源$ C ​​$ c。如果需要的话。另外,我已经阅读并使用BK-树试过的建议中的尼克约翰逊的博客,但它不是,因为我认为它可以...也许我的实现是错误的,因为有效的。

p.s. I can provide any source code if need be. Also, I've already read through and tried using a BK-Tree as suggested in Nick Johnson's blog, but its not as efficient as I think it can be... or maybe my implementation is wrong.

推荐答案

我已经实现了算法中描述的C语言的文章用特里快速便捷的莱文斯坦距离+ +,这是非常快。如果你想(了解C ++比Python更好),我可以过去code的地方。

I've implemented the algo described on "Fast and Easy Levenshtein distance using a Trie" article in C++ and it is really fast. If you want (understand C++ better than Python), I can past the code in somewhere.

编辑: 我把它贴在我的<一个href="http://murilo.word$p$pss.com/2011/02/01/fast-and-easy-levenshtein-distance-using-a-trie-in-c/">blog.

这篇关于实现一个简单的特里高效Levenshtein距离计算 - Java的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆