改进 Boyer-Moore 字符串搜索 [英] improving Boyer-Moore string search

查看:46
本文介绍了改进 Boyer-Moore 字符串搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在玩 Boyer-Moore sting 搜索算法,并从 Shriphani Palakodety 的基本代码集开始,我创建了 2 个附加版本(v2 和 v3)——每个版本都做了一些修改,例如从循环而不是重构 while/if 条件.从 v1 到 v2,我看到了大约 10%-15% 的改进,从 v1 到 v3 有 25%-30% 的改进(显着).

I've been playing around with the Boyer-Moore sting search algorithm and starting with a base code set from Shriphani Palakodety I created 2 additional versions (v2 and v3) - each making some modifications such as removing len() function from the loop and than refactoring the while/if conditions. From v1 to v2 I see about a 10%-15% improvement and from v1 to v3 a 25%-30% improvement (significant).

我的问题是:有没有人有任何额外的 mod 可以进一步提高性能(如果你可以作为 v4 提交) - 保持基本的算法"真实于 Boyer-Moore.

My question is: does anyone have any additional mods that would improve performance even more (if you can submit as a v4) - keeping the base 'algorithm' true to Boyer-Moore.

#!/usr/bin/env python
import time

bcs = {} #the table

def goodSuffixShift(key):
    for i in range(len(key)-1, -1, -1):
        if key[i] not in bcs.keys():
            bcs[key[i]] = len(key)-i-1


#---------------------- v1 ----------------------
def searchv1(text, key):
    """base from Shriphani Palakodety fixed for single char"""
    i = len(key)-1
    index = len(key) -1
    j = i

    while True:
        if i < 0:
            return j + 1
        elif j > len(text):
            return "not found"
        elif text[j] != key[i] and text[j] not in bcs.keys():
            j += len(key)
            i = index
        elif text[j] != key[i] and text[j] in bcs.keys():
            j += bcs[text[j]]
            i = index
        else:
            j -= 1
            i -= 1

#---------------------- v2 ----------------------
def searchv2(text, key):
    """removed string len functions from loop"""
    len_text = len(text)
    len_key = len(key)
    i = len_key-1
    index = len_key -1
    j = i

    while True:
        if i < 0:
            return j + 1
        elif j > len_text:
            return "not found"
        elif text[j] != key[i] and text[j] not in bcs.keys():
            j += len_key
            i = index
        elif text[j] != key[i] and text[j] in bcs.keys():
            j += bcs[text[j]]
            i = index
        else:
            j -= 1
            i -= 1

#---------------------- v3 ----------------------
def searchv3(text, key):
    """from v2 plus modified 3rd if condition 
    breaking down the comparison for efficiency,
    modified the while loop to include the first
    if condition (opposite of it)
    """
    len_text = len(text)
    len_key = len(key)
    i = len_key-1
    index = len_key -1
    j = i

    while i >= 0 and j <= len_text:
        if text[j] != key[i]:
            if text[j] not in bcs.keys():
                j += len_key
                i = index
            else:
                j += bcs[text[j]]
                i = index
        else:
            j -= 1
            i -= 1

    if j > len_text:
        return "not found"
    else:
        return j + 1


key_list = ["POWER", "HOUSE", "COMP", "SCIENCE", "SHRIPHANI", "BRUAH", "A", "H"]

text = "SHRIPHANI IS A COMPUTER SCIENCE POWERHOUSE"

t1 = time.clock()
for key in key_list:
    goodSuffixShift(key)
    #print searchv1(text, key)
    searchv1(text, key)
    bcs = {}

t2 = time.clock()
print('v1 took %0.5f ms' % ((t2-t1)*1000.0))

t1 = time.clock()
for key in key_list:
    goodSuffixShift(key)
    #print searchv2(text, key)
    searchv2(text, key)
    bcs = {}

t2 = time.clock()
print('v2 took %0.5f ms' % ((t2-t1)*1000.0))

t1 = time.clock()
for key in key_list:
    goodSuffixShift(key)
    #print searchv3(text, key)
    searchv3(text, key)
    bcs = {}

t2 = time.clock()
print('v3 took %0.5f ms' % ((t2-t1)*1000.0))

推荐答案

使用in bcs.keys()"是创建一个列表,然后对该列表进行 O(N) 搜索——只需使用in bcs".

Using "in bcs.keys()" is creating a list and then doing an O(N) search of the list -- just use "in bcs".

在搜索功能中做 goodSuffixShift(key) 的事情.两个好处:调用者只有一个 API 可以使用,并且可以避免将 bcs 作为全局(可怕的 ** 2).

Do the goodSuffixShift(key) thing inside the search function. Two benefits: the caller has only one API to use, and you avoid having bcs as a global (horrid ** 2).

你的缩进有几个地方不正确.

Your indentation is incorrect in several places.

更新

这不是 Boyer-Moore 算法(它使用两个查找表).看起来更像 Boyer-Moore-Horspool 算法,它只使用第一个 BM 表.

This is not the Boyer-Moore algorithm (which uses TWO lookup tables). It looks more like the Boyer-Moore-Horspool algorithm, which uses only the first BM table.

一个可能的加速:在设置 bcs dict 后添加行 'bcsget = bcs.get'.然后替换:

A probable speedup: add the line 'bcsget = bcs.get' after setting up the bcs dict. Then replace:

if text[j] != key[i]:
    if text[j] not in bcs.keys():
        j += len_key
        i = index
    else:
        j += bcs[text[j]]
        i = index

与:

if text[j] != key[i]:
    j += bcsget(text[j], len_key)
    i = index

更新 2 -- 回归基础,比如在优化之前先让代码正确

版本 1 有一些错误,您已将这些错误带入版本 2 和 3.一些建议:

Version 1 has some bugs which you have carried forward into versions 2 and 3. Some suggestions:

将未找到的响应从未找到"更改为 -1.这使其与 text.find(key) 兼容,您可以使用它来检查结果.

Change the not-found response from "not found" to -1. This makes it compatible with text.find(key), which you can use to check your results.

获取更多文本值,例如"R" * 20、"X" * 20 和 "XXXSCIENCEYYY" 用于您现有的键值.

Get some more text values e.g. "R" * 20, "X" * 20, and "XXXSCIENCEYYY" for use with your existing key values.

加强测试工具,如下所示:

Lash up a test harness, something like this:

func_list = [searchv1, searchv2, searchv3]
def test():
    for text in text_list:    
        print '==== text is', repr(text)
        for func in func_list:
             for key in key_list:
                try:
                    result = func(text, key)
                except Exception, e:
                    print "EXCEPTION: %r expected:%d func:%s key:%r" % (e, expected, func.__name__, key)
                    continue
                expected = text.find(key)
                if result != expected:
                    print "ERROR actual:%d expected:%d func:%s key:%r" % (result, expected, func.__name__, key)

运行它,修复 v1 中的错误,继续进行这些修复,再次运行测试,直到它们都正常为止.然后你可以沿着同样的路线整理你的时序线束,看看性能如何.然后你可以在这里报告,我会告诉你我对 searchv4 函数应该是什么样子的想法;-)

Run that, fix the errors in v1, carry those fixes forward, run the tests again until they're all OK. Then you can tidy up your timing harness along the same lines, and see what the performance is. Then you can report back here, and I'll give you my idea of what a searchv4 function should look like ;-)

这篇关于改进 Boyer-Moore 字符串搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆