算法的线性模式匹配? [英] Algorithm for linear pattern matching?

查看:137
本文介绍了算法的线性模式匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的零和一的线性表,我需要匹配多个简单的模式,找到的第一次出现。例如,我可能需要找到 0001101101 01010100100 10100100010 内长800万的名单。我只需要找到任一的第一次出现,然后返回在其发生的索引。但是,这样的循环,并在大名单访问可能是昂贵的,而我宁愿不这样做太多次了。

I have a linear list of zeros and ones and I need to match multiple simple patterns and find the first occurrence. For example, I might need to find 0001101101, 01010100100, OR 10100100010 within a list of length 8 million. I only need to find the first occurrence of either, and then return the index at which it occurs. However, doing the looping and accesses over the large list can be expensive, and I'd rather not do it too many times.

有没有更快的方法比做

foreach (patterns) {
    for (i=0; i < listLength; i++)
        for(t=0; t < patternlength; t++)
            if( list[i+t] != pattern[t] ) {
                 break;
            }
            if( t == patternlength - 1 ) {
                 return i;  // pattern found!
            }
        }
    }
}

编辑:顺便说一句,我照着上面的伪code实现这一计划,并表现还算可以,但没有壮观。我估计,我处理约600位在我处理器的单核一秒钟。我使用这个图像处理,并且这将要经过几千个800万像素的图像,所以每一点帮助。

BTW, I have implemented this program according to the above pseudocode, and performance is OK, but nothing spectacular. I'm estimating that I process about 6 million bits a second on a single core of my processor. I'm using this for image processing, and it's going to have to go through a few thousand 8 megapixel images, so every little bit helps.

编辑::如果现在还不清楚,我正与一个位数组,所以只有两种可能:1和0。而且它是用C ++。

If it's not clear, I'm working with a bit array, so there's only two possibilities: ONE and ZERO. And it's in C++.

编辑:谢谢你的指针BM和KMP算法。我注意到,在维基百科页面BM上,它说

Thanks for the pointers to BM and KMP algorithms. I noted that, on the Wikipedia page for BM, it says

该算法preprocesses目标   字符串(键),它被搜索   ,但没有被搜索的字符串   在(不像有些算法   要搜索preprocess字符串   然后,可以分摊为代价   在preprocessing通过搜索   反复)。

The algorithm preprocesses the target string (key) that is being searched for, but not the string being searched in (unlike some algorithms that preprocess the string to be searched and can then amortize the expense of the preprocessing by searching repeatedly).

这看起来很有趣,但它并没有给出这样的算法任何的例子。请问这样的事情也有帮助?

That looks interesting, but it didn't give any examples of such algorithms. Would something like that also help?

推荐答案

关键的谷歌搜索是多模式的字符串匹配。

The key for Googling is "multi-pattern" string matching.

早在1975年,阿霍和Corasick 发表的(线性时间)算法,这是用在 fgrep一样的原始版本。该算法随后得到了细化,许多研究人员。例如, Commentz·瓦尔特(1979年)结合阿霍和放大器; Corasick用的博耶和放大器;摩尔匹配。 巴埃萨 - 耶茨(1989)交流相结合的的博耶 - 穆尔-Horspool 的变种。 吴和曼伯(1994)做类似的工作。

Back in 1975, Aho and Corasick published a (linear-time) algorithm, which was used in the original version of fgrep. The algorithm subsequently got refined by many researchers. For example, Commentz-Walter (1979) combined Aho&Corasick with Boyer&Moore matching. Baeza-Yates (1989) combined AC with the Boyer-Moore-Horspool variant. Wu and Manber (1994) did similar work.

这是替代AC线的多模式匹配算法是拉宾和卡普的算法。

An alternative to the AC line of multi-pattern matching algorithms is Rabin and Karp's algorithm.

我建议先从阅读阿霍Corasick和拉宾,卡普维基百科的页面,然后再决定是否将是有意义,你的情况。如果是这样,也许已经是你的语言/运行时可实现。

I suggest to start with reading the Aho-Corasick and Rabin-Karp Wikipedia pages and then decide whether that would make sense in your case. If so, maybe there already is an implementation for your language/runtime available.

这篇关于算法的线性模式匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆