使用Java中的Levenshtein距离改善搜索结果 [英] Improving search result using Levenshtein distance in Java

查看:147
本文介绍了使用Java中的Levenshtein距离改善搜索结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下工作Java代码,用于搜索单词列表中的单词,并且按预期完美运行:

I have following working Java code for searching for a word against a list of words and it works perfectly and as expected:

public class Levenshtein {
    private int[][] wordMartix;

    public Set similarExists(String searchWord) {

        int maxDistance = searchWord.length();
        int curDistance;
        int sumCurMax;
        String checkWord;

        // preventing double words on returning list
        Set<String> fuzzyWordList = new HashSet<>();

        for (Object wordList : Searcher.wordList) {
            checkWord = String.valueOf(wordList);
            curDistance = calculateDistance(searchWord, checkWord);
            sumCurMax = maxDistance + curDistance;
            if (sumCurMax == checkWord.length()) {
                fuzzyWordList.add(checkWord);
            }
        }
        return fuzzyWordList;
    }

    public int calculateDistance(String inputWord, String checkWord) {
        wordMartix = new int[inputWord.length() + 1][checkWord.length() + 1];

        for (int i = 0; i <= inputWord.length(); i++) {
            wordMartix[i][0] = i;
        }

        for (int j = 0; j <= checkWord.length(); j++) {
            wordMartix[0][j] = j;
        }

        for (int i = 1; i < wordMartix.length; i++) {
            for (int j = 1; j < wordMartix[i].length; j++) {
                if (inputWord.charAt(i - 1) == checkWord.charAt(j - 1)) {
                    wordMartix[i][j] = wordMartix[i - 1][j - 1];
                } else {
                    int minimum = Integer.MAX_VALUE;
                    if ((wordMartix[i - 1][j]) + 1 < minimum) {
                        minimum = (wordMartix[i - 1][j]) + 1;
                    }

                    if ((wordMartix[i][j - 1]) + 1 < minimum) {
                        minimum = (wordMartix[i][j - 1]) + 1;
                    }

                    if ((wordMartix[i - 1][j - 1]) + 1 < minimum) {
                        minimum = (wordMartix[i - 1][j - 1]) + 1;
                    }

                    wordMartix[i][j] = minimum;
                }
            }
        }

        return wordMartix[inputWord.length()][checkWord.length()];
    }

}

现在当我搜索一个像 job 之类的单词会返回一个列表:

Right now when I search for a word like job it returns a list:

输出

joborienterede
jobannoncer
jobfunktioner
perjacobsen
jakobsen
jobprofiler
jacob
jobtitler
jobbet
jobdatabaserne
jobfunktion
jakob
jobs
studenterjobber
johannesburg
jobmuligheder
jobannoncerne
jobbaser
job
joberfaringer

正如你所看到的那样输出很多相关的单词,但也有非相关的单词,如 jakob jacob 等,这是正确的Levenshtein公式,但我想进一步构建并编写一个方法,可以微调我的搜索,这样我就可以获得更多相关和相关的单词。

As you can see the output has a lot of related words but has also non-related ones like jakob, jacob etc., which is correct regarding the Levenshtein formula, but I would like to build further and write a method that can fine tune my search so I can get more relevant and related words.

我已经工作了几个小时在它上面,我失去了创造力。

I have worked few hours on it and lost my sight of creativity.

我的问题:是不是可以微调现有方法返回相关/相关的单词或者我应该采取另一种方法或???在所有情况下是或否,我很欣赏是否可以获得有关改善搜索结果的输入和灵感?

My Question: Is it possible to fine tune the existing method to return relevant/related words Or should I take another approach Or??? in all cases YES or NO, I appreciated if can get input and inspiration regarding improving searching results?

更新

长时间回答这个问题之后我还没有真正找到解决方案,我回到原点,因为现在是时候我需要一个有用的答案,可以使用JAVA代码示例提供答案,但最重要的是详细解答,并提供可用方法和方法的描述,用于索引最佳和最相关的搜索结果,并忽略任何相关的单词。我知道这是一个开放和无穷无尽的领域,但我需要一些灵感来开始一些地方。

After asking this question long time back I have not really found a solution and I back to it because it is time where I need a useful answer, it is fine to supply the answer with JAVA code samples, but what is most important is a detailed answer with description of available methods and approaches used to index best and most relevant search results and ignoring none relevant words. I know this is an open and endless area, but I need to have some inspiration to start some where.


注意:最老的答案权利现在基于其中一个评论输入而且
无用(无用),它只是对距离进行排序,这并不意味着获得更好的搜索结果/质量。

Note: The oldest answer right now is based on one of the comment inputs and is not helpful (useless), it just sorting the distance, that does not mean getting better search results/quality.

所以我进行了距离排序,结果如下:

So I did distance sorting and the results was like this:

job
jobs
jacob
jakob
jobbet
jakobsen
jobbaser
jobtitler
jobannoncer
jobfunktion
jobprofiler
perjacobsen
johannesburg
jobannoncerne
joberfaringer
jobfunktioner
jobmuligheder
jobdatabaserne
joborienterede
studenterjobber

所以单词jobbaser是相关的,jacob / jakob不相关,但jobbaser的距离大于jacob / jakob。所以这并没有真正帮助。

so word jobbaser is relevant and jacob/jakob is not relevant, but the distance for jobbaser is bigger than jacob/jakob. So that did not really helped.

关于答案的一般反馈


  • @SergioMontoro,它解决了几乎所有问题。

  • @uSeemSurprised,它解决了问题,但需要不断操纵。

  • @Gene概念非常好,但它正在转发外部网址。

谢谢
我个人感谢所有为此问题做出贡献的人,我得到了很好的答案和有用的评论。

Thanks I would like to personally thanks all of you who contributed to this question, I have got nice answers and useful comments.

特别感谢来自@ SergioMontoro,@ uememSurprised和@Gene的答案,这些是不同但有效且有用的答案。

Special thanks to answers from @SergioMontoro, @uSeemSurprised and @Gene, those are different but valid and useful answers.

@D.Kovács指出了一些有趣的解决方案。

@D.Kovács is pointing some interesting solution.

我希望我能给予所有这些答案赏金。
选择一个答案并给予赏金,这并不意味着其他答案无效,但这只意味着我选择的特定答案对我有用。

I wish I could give bounty to all of those answers. Chose one answer and give it bounty, that does not mean the other answers is not valid, but that only mean that the particular answer I chose was useful for me.

推荐答案

如果不理解@DrYap建议的词语的含义,比较两个词的下一个逻辑单位(如果你不是在寻找拼写错误)就是音节。修改Levenshtein以比较音节而不是字符非常容易。困难的部分是将单词分解为音节。有一个Java实现 TeXHyphenator-J ,可以用来分割单词。基于这个连字库,这里是由 Michael Gilleland& amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; Chas Emerick 。有关音节检测的更多信息,请此处此处。当然,你要避免使用标准Levenshtein处理这个案例的两个单音节词的音节比较。

Without understanding the meaning of the words like @DrYap suggests, the next logical unit to compare two words (if you are not looking for misspellings) is syllables. It is very easy to modify Levenshtein to compare syllables instead of characters. The hard part is breaking the words into syllables. There is a Java implementation TeXHyphenator-J which can be used to split the words. Based on this hyphenation library, here is a modified version of Levenshtein function written by Michael Gilleland & Chas Emerick. More about syllable detection here and here. Of course, you'll want to avoid syllable comparison of two single syllable words probably handling this case with standard Levenshtein.

import net.davidashen.text.Hyphenator;

public class WordDistance {

    public static void main(String args[]) throws Exception {
        Hyphenator h = new Hyphenator();
        h.loadTable(WordDistance.class.getResourceAsStream("hyphen.tex"));
        getSyllableLevenshteinDistance(h, args[0], args[1]);
    }

    /**
     * <p>
     * Calculate Syllable Levenshtein distance between two words </p>
     * The Syllable Levenshtein distance is defined as the minimal number of
     * case-insensitive syllables you have to replace, insert or delete to transform word1 into word2.
     * @return int
     * @throws IllegalArgumentException if either str1 or str2 is <b>null</b>
     */
    public static int getSyllableLevenshteinDistance(Hyphenator h, String s, String t) {
        if (s == null || t == null)
            throw new NullPointerException("Strings must not be null");

        final String hyphen = Character.toString((char) 173);
        final String[] ss = h.hyphenate(s).split(hyphen);
        final String[] st = h.hyphenate(t).split(hyphen);

        final int n = ss.length;
        final int m = st.length;

        if (n == 0)
            return m;
        else if (m == 0)
            return n;

        int p[] = new int[n + 1]; // 'previous' cost array, horizontally
        int d[] = new int[n + 1]; // cost array, horizontally

        for (int i = 0; i <= n; i++)
            p[i] = i;

        for (int j = 1; j <= m; j++) {
            d[0] = j;
            for (int i = 1; i <= n; i++) {
                int cost = ss[i - 1].equalsIgnoreCase(st[j - 1]) ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
                d[i] = Math.min(Math.min(d[i - 1] + 1, p[i] + 1), p[i - 1] + cost);
            }
            // copy current distance counts to 'previous row' distance counts
            int[] _d = p;
            p = d;
            d = _d;
        }

        // our last action in the above loop was to switch d and p, so p now actually has the most recent cost counts
        return p[n];
    }

}

这篇关于使用Java中的Levenshtein距离改善搜索结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆