了解使用LCP阵列算法进行模式匹配 [英] Understanding the algorithm for pattern matching using an LCP array

查看:408
本文介绍了了解使用LCP阵列算法进行模式匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

前言:我的问题主要是一个算法的问题,所以即使你不熟悉的后缀,LCP阵列,你或许可以帮我

Foreword: My question is mainly an algorithmic question, so even if you are not familiar with suffix and LCP arrays you can probably help me.

它描述了如何有效地使用后缀,LCP阵列字符串模式匹配这个

我的理解SA和LCP的工作,以及如何算法的运行时间可以从 O(P *日志(N))(其中改进P 是该模式的长度和 N 是字符串的长度)为 0(P +日志(N))(感谢Chris Eelmaa的回答 href="http://stackoverflow.com/a/28385677/3722347">和jogojapans回答的这里)。

I understood SA and LCP work and how the algorithm's runtime can be improved from O(P*log(N)) (where P is the length of the pattern and N is length of the string) to O(P+log(N)) (Thanks to Chris Eelmaa's answer here and jogojapans answer here).

我是想通过该算法在图4中说明 LLCP 的用法和 RLCP 。但我有问题,理解它是如何工作的。

I was trying to go through the algorithm in figure 4 which explains the usage of LLcp and RLcp. But I have problems understanding how it works.

该算法(摘自):

的使用变量名称说明:

lcp(v,w) : Length of the longest common prefix of v and w
W = w0..wP-1 : pattern of length P
A = a0..aN-1 : the text (length N)
Pos[0..N-1] : suffix array
L_W : index (in A) of first occurrence of the matched pattern
M : middle index of current substring
L : lower bound
R : upper bound
Lcp : array of size N-2 such that Lcp[M] = lcp(A_Pos[L_M], A_pos[M]) where L_M is the lower bound of the unique interval with M in the middle
Rcp : array of size N-2 such that Rcp[M] = lcp(A_Pos[R_M], A_pos[M]) where R_M is the upper bound of the unique interval with M in the middle

现在我想用下面的例子来试试算法(部分摘自<一个href="http://stackoverflow.com/questions/11373453/how-does-lcp-help-in-finding-the-number-of-occurences-of-a-pattern/">here):

Now I want to try the algorithm using the following example (partly taken from here):

 SA | LCP | Suffix entry
 -----------------------
 5  | N/A | a  
 3  | 1   | ana  
 1  | 3   | anana  
 0  | 0   | banana  
 4  | 0   | na  
 2  | 2   | nana 

A = "banana" ; N = 6
W = "ban" ; P = 3

我要尝试匹配的字符串,比如禁止和希望的算法返回0作为 L_W

I want to try to match a string, say ban and would expect the algorithm to return 0 as L_W.

下面是我怎么会通过算法步骤:

Here is how I would step through the algorithm:

l = lcp("a", "ban") = 0
r = lcp("nana", "ban") = 0
if 0 = 3 or 'b' =< 'a' then // which is NOT the case for both conditions
    L_W = 0
else if 0 < 3 or 'b' =< 'n' then // which is the case for both conditions 
    L_W = 6 // which means 'not found'
...
...

我觉得我失去了一些东西,但我无法找出什么。另外我想知道如何在precomputed LCP阵列可以使用,而不是调用 LCP(V,W)

推荐答案

目前,我通过这个自己的工作,而最后,我来到的是,无论在五号线的条件句是印刷错误。

I'm currently working through this myself, and the conclusion I came to is that both of the conditionals on line five are typographical errors.

很显然,我没有任何的证明的这个,我不是曼伯或斯堡,它更是一个合理的猜测:

Obviously, I don't have any proof for this, I'm not Manber or Myers, it's more of a reasoned guess:

这似乎是第一个块(行3至6)的目的是要认识到,当一个查询外树完全。通过比较第一个和最后一个后缀,这是第一个和最后的字典顺序记住,并发现它在错误的一边任何一个,你可以肯定它不是任何地方的后缀数组中,因此无处在文中都没有。

It seems that the purpose of that first chunk (lines three through six) is to recognise when a query is outside the tree entirely. By comparing it to the first and last suffixes, which are first and last in lexicographical order remember, and finding it on the "wrong side" of either one, you can be sure it isn't anywhere in the suffix array, and hence nowhere in the text at all.

第二个条件是三号线实现这一令人钦佩。如果我们无视说的平等的那一刻起的 P 的是不相等的,但第二个条件是真实的,那就是 - W [1]; SA [0] [1],注意那些的的S和没有的 1 的秒 - 这将则意味着W是字典序后缀数组的第一个条目之前,并因此没有在文中在所有

The second conditional on line three achieves this admirably. If we ignore the equality for the moment by saying that l and P are not equal, but that the second conditional is true -- that is, w[l] < SA[0][l], note those are ls and not 1 s -- that would then mean that W is lexicographically before the first entry of the suffix array, and thus not in the text at all.

这是可能只在您有没有他们的第一个条目被定义为是独一无二的,之前字典顺序任何其它字符定点后缀阵列情况下非常有用。

This is probably only useful in situations where you have suffix arrays that don't have their first entries as the sentinel which is defined as being unique and lexicographically before any other character.

(不过,我不知道为什么是的等于的阵列中的第一个后缀 - 这是的 P 等于将意味着,作为最长的共同preFIX两个字符串是 P 的长,ERGO,只要查询,ERGO,整个查询 - 与被混为一谈的第一个后缀。在一个基本水平,如果查询是数组中的第一个后缀,它的阵列中,从而在文本中。)

(That said, I'm not sure why being equal to the very first suffix in the array -- which is what l and P being equal would mean, as the longest common prefix in both strings is P long, ergo, as long as the query, ergo, the entire query -- is lumped together with being before the first suffix. On a fundamental level, if the query is the first suffix in the array, it's in the array, thus in the text.)

在另一方面,五号线是的字面上总是如此的当查询是不是因为不匹配的最后一个后缀的研究的&LT; P 的。当查询是不是最后的后缀,因为最后的后缀到达其字符串的末尾,瓦特[R] == SA [N-1] [R](值得注意其中,将在最后的非岗哨字符文本)。当查询是最后的后缀,R = P,并因此瓦特[R] == SA [N-1] [R]。

On the other hand, line five is literally always true. When the query isn't the final suffix because of a mismatch, r < P. When the query isn't the final suffix because the final suffix reaches the end of its string, w[r] == SA[n-1][r] (which, of note, would be the final non-sentry character in the text). When the query is the final suffix, r = P, and thus w[r] == SA[n-1][r].

所以...是啊。这只是一个合理的猜测,但我要说,该文件对五号线两个重要的印刷错误。当然,反点这一理论,是强调它是多么的不可能的错误不仅得到了过去的作者,但该杂志的编辑......特别是考虑到它在1989年最初收到,终于接受了在1992年,并且错误是在整张纸上的字面上的最关键的部分。

So... Yeah. It's only a reasoned guess, but I'd say that the paper had two critical typographical errors on line five. The counter-point to this theory, of course, is highlighting how incredibly improbable it is that an error not only got past the authors, but the editors of the journal... Especially given it was initially received in 1989, finally accepted in 1992, and the errors are on literally the single most critical part of the entire paper.

这篇关于了解使用LCP阵列算法进行模式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆