高效的算法来找到所有与QUOT;字符等于"字符串? [英] Efficient algorithm to find all "character-equal" strings?

查看:109
本文介绍了高效的算法来找到所有与QUOT;字符等于"字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们如何可以写一个高效的功能,它输出同形字等效 的输入字符串?

How can we write an efficient function that outputs "homoglyph equivalents" of an input string?

示例1 (伪code):

homoglyphs_list = [
                     ["o", "0"], // "o" and "0" are homoglyphs
                     ["i", "l", "1"] // "i" and "l" and "1" are homoglyphs
                  ]

input_string    = "someinput"
output          = [
                   "someinput", "s0meinput", "somelnput",
                   "s0melnput", "some1nput", "s0me1nput"
                  ]

例2

homoglyphs_list = [
                     ["rn", "m", "nn"],
                  ]

input_string    = "rnn"
output          = ["rnn", "rm", "mn", "rrn", "nnn", "nm", "nrn"]

示例3

homoglyphs_list = [
                     ["d", "ci", "a"], // "o" and "0" are homoglyphs
                     ["i", "l", "1"] // "i" and "l" and "1" are homoglyphs
                  ]
                  /*
                     notice that with the two rules above,
                     we can infer "d" = "ci" = "a" = "cl" = "c1"
                  */

input_string    = "pacerier"
output          = [
                   "pacerier",  "pacerler",  "pacer1er",  "pcicerier",
                   "pcicerler", "pcicer1er", "pclcerier", "pc1cerier",
                   "pclcerler", "pc1cerler", "pclcer1er", "pc1cer1er",
                   "pdcerier",  "pdcerler",  "pdcer1er"
                  ]

注:输出数组中成员的顺序并不重要,我们可以假设给定的同形字映射假定的正确的(投入不会给我们一个无限循环)。

Note: The order of the members in the output array isn't important, and we can assume that the given homoglyph mappings are assumed to be proper (inputs wouldn't give us an "infinite loop").

我现在的算法工作,但它的使用原始的暴力破解和性能是可怕的。例如。的输入MMMMM与homoglyphs [RN,M,NN] 需要38秒运行:

My current algorithm works, but it's using raw bruteforcing and performance is awful. E.g. an input of "mmmmm" with homoglyphs ["rn", "m", "nn"] takes 38 seconds to run:

// This is php code (non-pseudo just so we could test the running time), 
// but the question remains language-agnostic

public function Func($in, Array $mappings){
    $out_table = array();
    $out_table[$in] = null;
    while(true){
        $number_of_entries_so_far = count($out_table);
        foreach(array_keys($out_table) as $key){
            foreach($mappings as $mapping){
                foreach($mapping as $value){
                    for($a=0, $aa=strlen($key); $a<$aa; ++$a){
                        $pos = strpos($key, $value, $a);
                        if($pos === false){
                            continue;
                        }
                        foreach($mapping as $equal_value){
                            if($value === $equal_value){
                                continue;
                            }
                            $out_table[substr($key, 0, $pos) . $equal_value . substr($key, $pos + strlen($value))] = null;
                        }
                    }

                }
            }
        }
        if($number_of_entries_so_far === count($out_table)){
            // it means we have tried bruteforcing but added no additional entries,
            // so we can quit now
            break;
        }
    }
    return array_keys($out_table);
}

我们怎样才能实现有效的(快)同形字扩展算法?

How can we implement an efficient (fast) homoglyph expansion algorithm?

推荐答案

您应该获得一些表现出来的递归实现,我还没有把太多,虽然想进入实际性能。这将继续从循环虽然输出字符串多次和计数输出每一次迭代。此外,我已经添加了一点缓存,以prevent通过计算同同形字的两倍,为简单起见,它不检查针对的映射,但你可以实现它,或者干脆清除缓存每次调用其中,映射关系发生变化前, (但这是丑陋的,容易引入错误)。

You should gain some performance out of a recursive implementation, I haven't put much thought into actual performance though. This would keep from looping though the output string multiple times and counting the output every iteration. Also I've added a little "cache" to prevent from calculating the same homoglyph twice, for simplicity it does not check against the mappings, but you could implement it to, or simply clear the cache before every call where mappings change(but that's ugly and easily introduces bugs).

code是这样的:

cache = {}
def homoglyph(input_string, mappings):
    output = []
    character = input_string[0]
    if input_string in cache:
        return cache[input_string]
    elif not input_string:
        return []
    for charset in mappings:
        if character in charset:
            temp = input_string
            sub_homoglyphs = homoglyph(input_string[1:], mappings)
            for char in charset:
                temp[0] = char
                output.append(temp)
                #adding the homoglyph character to the rest of the string
                for subhomoglyph in sub_homoglyphs:
                    output.append(char+subhomoglyph)
    cache[input_string] = output
    return output

(这是使用Python编写因为我不精通PHP语法,我还没有实际运行该所以可能有错误,但逻辑的要点是存在的)

(This is written with python as I'm not well versed in php syntax, I haven't actually run this so there may be errors but the gist of the logic is there)

这篇关于高效的算法来找到所有与QUOT;字符等于&QUOT;字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆