博耶 - 穆尔好后缀启发式 [英] Boyer-Moore good-suffix heuristics

查看:285
本文介绍了博耶 - 穆尔好后缀启发式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我明白是怎么坏的角色启发式的工作。当你发现不匹配的信 X ,只是转移得最右侧的 X 的模式将与对齐模式 X 字符串研究。而且很容易在code来实现。

I understand how the bad character heuristics work. When you find the mismatched letter x, just shift the pattern so the rightmost x in the pattern would be aligned with the x in the string. And it's easy to implement in code.

我想我也明白是怎么好后缀启发式的工作。当我们找到一个很好的后缀取值,发现在图案相同后缀不同的位置和移动它,因此取值在该模式将与取值在字符串中对齐。但我不知道如何做,在code。我们怎样才能找到,如果同样的后缀存在于图案另一个地方?我们如何知道它的位置?在code:

I think I also understand how the good-suffix heuristics work. When we find a good suffix s, find the same suffix in different location in the pattern and shift it so the s in the pattern would be aligned with the s in the string. But I don't understand how to do that in code. How do we find if the same suffix exists in another place in pattern? And how do we know its position? The code:

void bmPreprocess1()
{
    int i=m, j=m+1;
    f[i]=j;
    while (i>0)
    {
        while (j<=m && p[i-1]!=p[j-1])
        {
            if (s[j]==0) s[j]=j-i;
            j=f[j];
        }
        i--; j--;
        f[i]=j;
    }
}

从<一个href="http://www.iti.fh-flensburg.de/lang/algorithmen/pattern/bmen.htm">http://www.iti.fh-flensburg.de/lang/algorithmen/pattern/bmen.htm没有任何意义,我......可能有人写的尽可能的简单伪code完成这个任务?或解释不知何故?

from http://www.iti.fh-flensburg.de/lang/algorithmen/pattern/bmen.htm doesn't make sense to me... Could someone write as simple as possible pseudo code for this task? Or explain somehow?

推荐答案

首先,我将使用 P [I] 表示模式中的角色,<$ C $ ç> M 模式lenght, $ 在模式的最后一个字符,即 $ = P [M-1 ]

First, I will use p[i] denote a character in the pattern, m the pattern lenght, $ the last character in the pattern, i.e., $ = p[m-1].

有两种情况好后缀启发式情况1。

There are two scenarios for good suffix heuristics case 1.

情况1

考虑下面的例子,

    leading TEXT cXXXbXXXcXXXcXXX rest of the TEXT
         cXXXbXXXcXXXcXXX
                     ^ 
                     | mismatch here

因此​​,在该模式的子字符串 XXX cXXXbXXXcXXXcXXX 是不错的后缀。不匹配发生在字符 C 。这样的不匹配后,我们应该转移4到右侧或8?

So the sub string XXX in the pattern cXXXbXXXcXXXcXXX is the good suffix. The mismatch occurs at character c. So after the mismatch, should we shift 4 to the right or 8?

如果我们转移4如以下,那么同样mismath将再次出现( B mismathes C

If we shift 4 as in the following, then the same mismath will occur again (b mismathes c),

    leading TEXT cXXXbXXXcXXXcXXX rest of the TEXT
             cXXXbXXXcXXXcXXX
                     ^ 
                     | mismatch occurs here again

所以,我们实际上可以切换8个字符的权利在这种情况下。

So we can actually shift 8 characters to the right in this situation.

情况2

让我们看看另一个例子

    leading TEXT cXXXcXXXbXXXcXXX rest of the TEXT
            cXXXXcXXXbXXXcXXX
                         ^ 
                         | mismatch happens here

我们可以按住Shift 4个或8个或更多吗?很显然,我们如果我们转向8个或更多,我们就会错失良机,以配合文字图案。所以我们只能转移4个字符向右在这种情况下

Can we shift 4 or 8 or more here? Obviously we if we shift 8 or more, we will miss the opportunity to match the pattern with the text. So we can only shift 4 characters to the right in this situation.

那么,什么是这两种情况的区别?

So what is the difference between these two situations?

不同的是,在第一种情况下,不匹配的字符 C 加上良好的后缀 XXX ,这是 CXXX ,是一样的下一场比赛(从右边数)的 XXX 加上之前的字符。而在第二种情况下, CXXX 是不一样的下一场比赛(从右边数),加上那场比赛之前的字符。

The difference is that in the first case, the mismatched character c plus the good suffix XXX, which is cXXX, is the same as the next (count from the right) match for XXX plus the character before that. While in the second situation, cXXX is not the same as the next match (count from the right) plus the character before that match.

因此​​,对于任何给定的良好后缀(暂且称之为 XXX ),我们需要弄清楚的转变这两种情况,(1)字符(我们称之为它 C )的完美后缀加上良好的前后缀,在模式也匹配面前有好的后缀的右边)匹配+字符的下一个(计数, (2)性格加上完美后缀不匹配

So for any given GOOD SUFFIX (let us call it XXX) we need to figure out the shift in these two situations, (1) the character (let us call it c) before the GOOD SUFFIX plus the GOOD SUFFIX, in the pattern is also match the next (count from the right) match of the good suffix + the character before it , (2) the character plus the GOOD SUFFIX does not match

有关情况(1),例如,如果你有一个模式, 0XXXcXXXcXXXcXXXcXXXcXXX ,如果 c第一次测试后失败,你可以切换20个字符的权利,并调整 0XXX 与被测试的文本。

For situation (1), for example, if you have a pattern, 0XXXcXXXcXXXcXXXcXXXcXXX, if after the first test of c fails, you can shift 20 characters to the right, and align 0XXX with the text that been tested.

这是怎样的20号计算,

This is how the number 20 is calculated,

              1         2
    012345678901234567890123
    0XXXcXXXcXXXcXXXcXXXcXXX
    ^                   ^

不匹配出现的位置是20,移位的子串 0XXX 将位置从20到23和 0XXX 开始位置0,所以20 - 0 = 20

The position the mismatch occurs is 20, the shifted sub string 0XXX will take position from 20 to 23. And 0XXX starts with position 0, so 20 - 0 = 20.

有关情况(2),就像在这个例子中, 0XXXaXXXbXXXcXXX ,如果 C 第一次测试失败之后,您可以移动只有4个字符的权利,并调整 bXXX 与被测试的文本。

For situation (2), like in this example, 0XXXaXXXbXXXcXXX, if after the first test of c fails, you can shift only 4 characters to the right, and align bXXX with the text that been tested.

这是怎么数 4 计算,

    0123456789012345
    0XXXaXXXbXXXcXXX

在这里发生的不匹配为12的位置,下一个子拿那个地方是 bXXX 其中有8位,12开始 - 8 = 4。如果我们设置 J = 12 I = 8 ,则移的J - 我,这是 S [J] = j的 - 我。在code

The position where the mismatch occurs is 12, the next substring to take that place is bXXX which starts with position 8, 12 - 8 = 4. If we set j = 12, and i = 8, then the shift is j - i, which is s[j] = j - i; in the code.

边框

要考虑所有的好后缀,我们首先需要了解什么是所谓的边框。 边框是一个子这既是一种正确 preFIX和正确的后缀字符串。例如,对于一个字符串 XXXcXXX X 是一个边界, XX 是一个边界, XXX 太。但 XXXc 不是。我们需要确定的模式后缀的最宽边框的起点,这个信息被保存在阵列 F [I]

To consider all the good suffix, we first need to understand what is a so called border. A border is a substring which is both a proper prefix and a proper suffix of a string. For example, for a string XXXcXXX, X is a border, XX is a border, XXX too. But XXXc is not. We need to identify the starting point of the widest border of the suffix of the pattern and this info is saved in array f[i].

如何判断 F [I]

假设我们知道 F [i] = j的和其他 F [K] s的 I&LT; K&LT;米,这意味着从位置开始的后缀最宽的边框开始在位置Ĵ。我们希望找到 F [I-1] 根据 F [I]

Assume we know f[i] = j and for all other f[k]s with i < k < m, which means the widest border for the suffix starting from position i started at position j. We want to find f[i-1] based on f[i].

例如,一个模式 aabbccaacc ,在现在的位置 I = 4 ,后缀是 ccaacc ,并为最广泛的边界 CC P [8] P [9] ),所以 J = F [I = 4] = 8 。现在我们想知道 F [I-1] = F [3] 的基础上,我们有信息的 F [4] F [5] ,...对于 F [3] ,后缀现在 bccaacc 。在位置, J-1 = 7 ,它是 A != P [4- 1] B 。因此, BCC 不是一个边界。

For example, for a pattern aabbccaacc, at postion i=4, the suffix is ccaacc, and the widest border for that is cc (p[8]p[9]), so j = f[i=4] = 8. And now we want to know f[i-1] = f[3] based on the info we have for f[4], f[5], ... For f[3], the suffix now is bccaacc. At position, j-1=7, it is a != p[4-1] which is b. So bcc is not a border.

我们知道,宽度任何边框> = 1 bccaacc 已开始与 B 加的边界后缀从这个例子positin J = 8 ,这是 CC 启动。 CC 具有最宽的边框 C 在位置 J = F [8] 9 。因此,我们继续我们的搜索与比较 P [4-1] P [J-1] 。并且它们不相等试。现在的后缀是 P [9] = C 键,它只有零位长度的边界 10 。所以现在 J = F [9] ,它是 10 。因此,我们继续我们的搜索与比较 P [4-1] P [J-1] ,他们不平等,是字符串的结尾。然后 F [3] 只有长度为零的边界,这使得它等于10。

We know any border with width >= 1 of bccaacc has to begin with b plus the border of the suffix starting from positin j = 8, which is cc in this example. cc has the widest border c at position j = f[8] which is 9. So we continue our search with comparing p[4-1] against p[j-1]. And they are not equal again. Now the suffix is p[9] = c and it has only zero length border at position 10. so now j = f[9] and it is 10. So we continue our search with comparing p[4-1] against p[j-1], they are not equal and that is the end of the string. Then f[3] has only zero length border which make it equal to 10.

描述这一过程中更普遍的意义

因此​​, F [i] = j的表示这样的事情,

Therefore, f[i] = j means something like this,

    Position:    012345     i-1  i     j - 1   j       m
    pattern:     abcdef ... @    x ... ?       x ...   $ 

如果字符 @ 这在位置 I - 1 是相同的字符在位置的J - 1 ,我们知道, F [我 - 1] = j的 - 1; - 我; --j; F [i] = j的; 。边境后缀 @x ... $ 从位置开始 J-1

If character @ which at position i - 1 is the same as character ? at position j - 1, we know that f[i - 1] = j - 1; , or --i; --j; f[i] = j;. The border is suffix @x ... $ starting from position j-1.

但如果字符 @ 这在位置 I - 1 是性格不同的在位置的J - 1 , 我们必须继续我们的搜索到右侧。我们知道两个事实:(1),我们现在知道的边框宽度必须小于一个从位置Ĵ,即比小x开始... $ 。第二个边界,必须首先 @ ... 和字符结束 $ ,也可能是空的。

But if character @ which at position i - 1 is different from character ? at position j - 1, we have to continue our search to the right. We know two facts: (1) we know now the border width has to be smaller than the one started from position j, i.e, smaller than x...$. Second the border has to be begin with @... and ends with character $ or it could be empty.

,我们将继续我们的子字符串中搜索 X ... $ (从位置j到m)的边界开始 X 。那么接下来的边界应该在Ĵ等于 F [J]; ,即 J = F [J]; 。然后,我们才 X比较字符 @ 的字符,它是 J-1 。如果他们是平等的,我们找到了边界,如果没有,继续这个过程,直到J>时微米。该过程由下面的code,

Based on these two facts, we continue our search within sub string x ... $ (from position j to m) for a border begin with x. Then the next border should be at j which is equal to f[j];, i.e. j = f[j];. Then we compare character @ with the character before x, which is at j-1. If they are equal, we found the border, if not, continue the process until j > m. This process is shown by the following code,

    while (j<=m && p[i-1]!=p[j-1])
    {
        j=f[j];
    }

    i--; j--;
    f[i]=j;

现在看情况 P [我-1]!= P [J-1] ,这就是我们的情况谈了( 2), P [I] 匹配 P [J] ,而 P [我-1]!= P [J-1] ,所以我们从转向我 Ĵ,那就是 S [J] = j的 - 我。

Now look at condition p[i -1] !=p[j-1], this is what we talked about in situation (2),p[i]matchesp[j], butp[i -1] != p[j-1], so we shift from i to j, that that is s[j] = j - i;.

现在离开没有解释的唯一的事情是测试如果(S [J] == 0)时,较短的后缀有同样的边界将发生。例如,你有一个模式

Now the only thing left not explained is the test if (s[j] == 0) which will occur when a shorter suffix has the same border. For example, you have a pattern

    012345678
    addbddcdd

当你计算 F [我 - 1] I = 4 ,您将设置 S [7] 。但是,当你计算 F [I-1] I = 1 ,您将设置 S [7] 再次,如果你不具备测试如果(S [J] == 0)。这意味着如果你有不匹配的位置 6 ,你移 3 向右(调整 BDD 的位置 CDD 占用)没有 6 (不移位,直到添加的位置 CDD 占用)。

When you calculate f[i - 1] and i = 4, you will set s[7]. But when you calculate f[i-1] for i = 1, you will set s[7] again if you don't have the test if (s[j] == 0). This means if you have mismatch at position 6, you shift 3 to the right (align bdd to the positions cdd occupied) not 6 (not shift until add to the positions cdd occupied).

为code的意见

    void bmPreprocess1()
    {
        // initial condition, set f[m] = m+1;
        int i=m, j=m+1;

        f[i]=j;

        // calculate f[i], s[j] from i = m to 0.
        while (i>0)
        {
            // at this line, we know f[i], f[i+1], ... f[m].
            while (j<=m && p[i-1]!=p[j-1]) // calculate the value of f[i-1] and save it to j-1
            {
                if (s[j]==0) s[j]=j-i; // if s[j] is not occupied, set it.
                j=f[j]; // get the start position of the border of suffix p[j] ... p[m-1]
            }
            // assign j-1 to f[i-1]
            i--; j--;
            f[i]=j;
        }
    }

这篇关于博耶 - 穆尔好后缀启发式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆