找到最匹配的拼写错误的城市的名字呢? [英] Find closest match for misspelled city names?

查看:138
本文介绍了找到最匹配的拼写错误的城市的名字呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有有许多不正确的拼写为同一城市的城市名单。一个城市拼错18倍!我想打扫一下,但它的服用时间。有一些算法可能会猜测为每个这些拼写错误的人的有效城市名?加权某种形式?这些数据是在MySQL和我有正确的拼写一个表,以及进行比较的。

I have a list of cities that have numerous incorrect spelling for the same city. One city is misspelled 18 times! I am trying to clean this up but its taking hours. Is there some algorithm that might "guess" at the valid city name for each of these misspelled ones? Some form of weighting? The data is in MySQL and I do have a table of the correct spelling as well to compare against.

对此有何想法?一个PHP的例子将有助于如果可能的话。

Any ideas on this? A PHP example would help if possible.

推荐答案

您可以使用damerau - 莱文施泰因函数来获取两个字符串之间的字符串距离。 (这还检查换位)

you could use a damerau-levenstein function to get the string distance between two strings. (This also checks for transposition)

http://en.wikipedia.org/wiki/Damerau%E2% 80%93Levenshtein_distance

如果你的表是很大的,那么你可能需要优化算法中有点打破,一旦串距离超过threashold。

If your tables are big then you might need to optimise the algo a bit to break once the string distance exceeds a threashold.

此外,如果你可以假设城市的首字母输入正确,应该减少比较的数量。

Also if you can assume that the first letter of the city is typed correctly that should reduce the number of comparisons to.

它不是PHP的,但如果它的任何帮助继承人的Java版本,我写道:

Its not PHP, but if its of any help heres a java version I wrote:

class LevinshteinDistance{
    public static void main(String args[]){
        if(args.length != 2){
            System.out.println("Displays the Levenshtein distance between 2 strings");
            System.out.println("Usage: LevenshteinDistance stringA stringB");
        }else{
            int distance = getLevenshteinDistance(args[0], args[1]);
            System.out.print(getLevenshteinMatrix(args[0], args[1]));
            System.out.println("Distance: "+distance);
        }
    }   

    /**
     * @param a first string for comparison
     * @param b second string for comparison
     * @param caseSensitive whether or not to use case sensitive matching
     * @return a levenshtein string distance
     */  
    public static int getLevenshteinDistance(String a, String b, boolean caseSensitive){
        if(! caseSensitive){
        a = a.toUpperCase();
        b = b.toUpperCase();
        }
        int[][] matrix = generateLevenshteinMatrix(a, b);
        return matrix[a.length()][b.length()];
    }

    /**
     * @param a first string for comparison
     * @param b second string for comparison
     * @return a case sensitive levenshtein string distance
     */  
    public static int getLevenshteinDistance(String a, String b){
        int[][] matrix = generateLevenshteinMatrix(a, b);
        return matrix[a.length()][b.length()];
    }

    /**
     * @param a first string for comparison
     * @param b second string for comparison
     * @return a  case sensitive string representation of the search matrix
     */  
    public static String getLevenshteinMatrix(String a, String b){
        int[][] matrix = generateLevenshteinMatrix(a, b);
        StringBuilder result = new StringBuilder();
        final int ROWS = a.length()+1;
        final int COLS = b.length()+1;

        result.append(rowSeperator(COLS-1, false));
        result.append("|    "+b+" |\n");
        result.append(rowSeperator(COLS-1, true));  

        for(int r=0; r<ROWS; r++){
            result.append('|');
            if(r > 0){
                result.append(a.charAt(r-1));
            }else{
                result.append(' ');
            }
            result.append(" |");
            for(int c=0; c<COLS; c++){
                result.append(matrix[r][c]);
            }
            result.append(" |\n");
        }       
        result.append(rowSeperator(COLS-1, false));
        return result.toString();   
    }   


    private static String rowSeperator(final int LEN, boolean hasGap){
        StringBuilder result = new StringBuilder();
        if(hasGap){
            result.append("+  +-");
        }else{
            result.append("+----");
        }
        for(int i=0; i<LEN; i++) 
            result.append('-');
        result.append("-+\n");
        return result.toString();
    }

    private static int[][] generateLevenshteinMatrix(String a, String b){
        final int ROWS = a.length()+1;
        final int COLS = b.length()+1;
        int matrix[][] = new int[ROWS][COLS];

        for(int r=0; r<ROWS; r++){
            matrix[r][0]=r;
        }
        for(int c=0; c<COLS; c++){ 
            matrix[0][c]=c;
        }

        for(int r=1; r<ROWS; r++){
            char cA = a.charAt(r-1);
            for(int c=1; c<COLS; c++){
                    char cB = b.charAt(c-1);
                int cost = (cA == cB)?0:1;

                int deletion =  matrix[r-1][c]+1; 
                int insertion = matrix[r][c-1]+1;
                int substitution = matrix[r-1][c-1]+cost;
                int minimum = Math.min(Math.min(deletion, insertion), substitution);    

                if( (r > 1 && c > 1) && a.charAt(r-2) == cB && cA == b.charAt(c-2) ){
                    int transposition = matrix[r-2][c-2]+cost;
                    minimum = Math.min(minimum, transposition);
                }
                matrix[r][c] = minimum;
            }
        }   
        return matrix;
    }
}

这篇关于找到最匹配的拼写错误的城市的名字呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆