如何LCP在寻找一个模式出现的次数帮助吗? [英] How does LCP help in finding the number of occurences of a pattern?

查看:277
本文介绍了如何LCP在寻找一个模式出现的次数帮助吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已阅读,在最长公共preFIX(LCP)可以用来找到一个模式字符串中出现的次数。

I have read that the Longest Common Prefix (LCP) could be used to find the number of occurences of a pattern in a string.

具体而言,你只需要创建一个文本的后缀数组,排序它,而不是做二进制搜索找到的范围,使您可以找出出现的次数那么,您只需计算LCP对于每个连续进入后缀数组中开始。

Specifically, you just need to create the suffix array of the text, sort it, and then instead of doing binary search to find the range so that you can figure out the number of occurences, you simply compute the LCP for each successive entry in the suffix array.

虽然使用二进制搜索找到出现的模式的数量是明显的,我不能找出LCP如何帮助查找出现的次数在这里。

Although using binary search to find the number of occurrences of a pattern is obvious I can't figure out how the LCP helps find the number of occurrences here.

例如对于此后缀数组香蕉

LCP  Suffix entry
N/A  a  
1    ana  
3    anana  
0    banana  
0    na  
2    nana  

如何在LCP帮助查找出现像香蕉和娜一子的数量并不明显我。

How does the LCP help find the number of occurrences of a substring like "banana" or "na" is not obvious to me.

任何帮助搞清楚LCP如何帮助这里?

Any help figuring out how LCP helps here?

推荐答案

我不知道任何方式使用LCP阵列的的执行二进制搜索,而不是的,但我相信你参考在<一个技术描述乌迪曼伯和基因迈尔斯href="http://scholar.google.com/scholar?q=suffix+arrays+a+new+method+for+on-line+string+searches&btnG=&hl=en&as_sdt=0%2C5">Suffix数组:上线字符串搜索的一种新方法

I do not know any way of using the LCP array instead of carrying out a binary search, but I believe what you refer to is the technique described by Udi Manber and Gene Myers in Suffix arrays: a new method for on-line string searches.

的想法是这样的:为了寻找出现的给定串P(长度米)的在文本T(长度N)的数目,

The idea is this: In order to find the number of occurrences of a given string P (length m) in a text T (length N),

  • 您使用针对T(就像你的建议)
  • 的后缀阵列二进制搜​​索
  • 但是,您加快它使用LCP阵列作为辅助数据结构。更具体地讲,你生成LCP阵列的特殊版本(我将其称为LCP-LR下面)并使用它。
  • You use binary search against the suffix array of T (just like you suggested)
  • But you speed it up using the LCP array as auxiliary data structure. More specifically, you generate a special version of the LCP array (I will call it LCP-LR below) and use that.

使用标准的二分查找问题(的没有的LCP的信息)的是,在每个澳(日志N)的比较你需要,你比较P为后缀阵列的电流输入,这意味着满弦长达M字符比较。所以复杂度为O(M * log n)的。

The issue with using standard binary search (without the LCP information) is that in each of the O(log N) comparisons you need to make, you compare P to the current entry of the suffix array, which means a full string comparison of up to m characters. So the complexity is O(m*log N).

在LCP-LR阵列有助于提高这个为O(M +日志N),通过以下方式:

The LCP-LR array helps improve this to O(m+log N), in the following way:

  • 在二进制搜索算法中的任何时候,你认为,像往常一样,一个范围(1,...,R)后缀阵列和它的中央点M,并决定是否继续在左子您的搜索-range(L,...,M)或在右子范围(M,...,R)。
  • 在为了做出决定,你比较P到串,M。如果P是相同的男,你都做了,但如果没有,你就会有比较P的前k个字符,然后决定P是否字典顺序小于或大于M.假设的结果是,P是大于M
  • 所以,在中间的下一步,你考虑(男,...,R)和一个新的中央点M':

  • At any point during the binary search algorithm, you consider, as usual, a range (L,...,R) of the suffix array and its central point M, and decide whether you continue your search in the left sub-range (L,...,M) or in the right sub-range (M,...,R).
  • In order to make the decision, you compare P to the string at M. If P is identical to M, you are done, but if not, you will have compared the first k characters of P and then decided whether P is lexicographically smaller or larger than M. Let's assume the outcome is that P is larger than M.
  • So, in the next step, you consider (M,...,R) and a new central point M' in the middle:

              M ...... M' ...... R
              |
       we know:
          lcp(P,M)==k

现在是LCP-LR是precomputed使得O(1)-lookup告诉您M和M的最长的共同preFIX',LCP(M ,M')。

The trick now is that LCP-LR is precomputed such that a O(1)-lookup tells you the longest common prefix of M and M', lcp(M,M').

您已经(从previous步骤)知道M本身具有的共同ķ字符以P一preFIX:LCP(P,M)= K。现在有三种可能:

You know already (from the previous step) that M itself has a prefix of k characters in common with P: lcp(P,M)=k. Now there are three possibilities:

  • 案例1:K&LT; LCP(M,M'),即P具有的更少的共同preFIX字符具有M大于M具有共同与M'。此装置M的第(k + 1)个字符'是相同的M,且由于P是字典顺序大于M时,它必须是按字典顺序大于M',太。所以我们在右半(M',...,R)的继续。
  • 案例2:K> LCP(M,M'),即P具有的更多的共同preFIX字符以M比M具有共同点M'。因此,如果我们比较P至M'的共同preFIX将小于k,并且M'将是字典顺序大于P,因此,而不实际在进行比较的,我们在左半继续(M,...,M')。
  • 案例3:K == LCP(M,M')。所以M和M'都是以P中的前k个字符相同。要决定我们是否在左或右半边继续,就足够了比较P至M从起点(K + 1)个字符
  • Case 1: k < lcp(M,M'), i.e. P has fewer prefix characters in common with M than M has in common with M'. This means the (k+1)-th character of M' is the same as that of M, and since P is lexicographically larger than M, it must be lexicographically larger than M', too. So we continue in the right half (M',...,R).
  • Case 2: k > lcp(M,M'), i.e. P has more prefix characters in common with M than M has in common with M'. Consequently, if we were to compare P to M', the common prefix would be smaller than k, and M' would be lexicographically larger than P, so, without actually making the comparison, we continue in the left half (M,...,M').
  • Case 3: k == lcp(M,M'). So M and M' are both identical with P in the first k characters. To decide whether we continue in the left or right half, it suffices to compare P to M' starting from the (k+1)-th character.

我们继续递归。

总体效应是,的P没有任何字符相对于文本的任何字符一次以上。的字符比较的总数由米界,所以总的复杂性是确实O(M +日志N)

The overall effect is that no character of P is compared to any character of the text more than once. The total number of character comparisons is bounded by m, so the total complexity is indeed O(m+log N).

显然,关键剩下的问题就是我们怎么precompute LCP-LR所以它能够告诉我们在O(1)时间后缀阵列中的任何两个条目之间的LCP?正如你所说的,标准的LCP数组告诉你的连续项只,即LCP(X-1,X)对任何x的LCP。但是,M和M'在上面的描述不一定是连续的条目,所以这是怎么一回事?

Obviously, the key remaining question is how did we precompute LCP-LR so it is able to tell us in O(1) time the lcp between any two entries of the suffix array? As you said, the standard LCP array tells you the lcp of consecutive entries only, i.e. lcp(x-1,x) for any x. But M and M' in the description above are not necessarily consecutive entries, so how is that done?

的关键,这是认识到,只有某些范围(L,...,R)永远不会发生二进制搜索期间:它总是从(0,...,N)和分割,在中心,然后继续左边或右边,并再次等等除以一半。如果你想起来:后缀阵列的每个条目的发生是完全一个可能的范围二进制搜索时中心点。所以有完全相同N个不同范围(L ... M ... R)的二进制搜索期间可以可能发挥作用,并且它足以precompute LCP(L,M)和LCP(M,R),用于这N个可能的范围。所以这是2 * N个不同precomputed值,因此LCP​​-LR是O(N)的大小。

The key to this is to realize that only certain ranges (L,...,R) will ever occur during the binary search: It always starts with (0,...,N) and divides that at the center, and then continues either left or right and divide that half again and so forth. If you think of it: Every entry of the suffix array occurs as central point of exactly one possible range during binary search. So there are exactly N distinct ranges (L...M...R) that can possibly play a role during binary search, and it suffices to precompute lcp(L,M) and lcp(M,R) for those N possible ranges. So that is 2*N distinct precomputed values, hence LCP-LR is O(N) in size.

此外,存在一个直接的递归算法来计算的LCP-LR的2 * N个值中O与标准的LCP阵列ndash的(N)的时间;我建议发布一个单独的问题,如果你需要的是一个详细的说明。

Moreover, there is a straight-forward recursive algorithm to compute the 2*N values of LCP-LR in O(N) time from the standard LCP array – I'd suggest posting a separate question if you need a detailed description of that.

综上所述:

  • 可以计算LCP-LR在O(n)时间和O(2 * N)从LCP = O(N)的空间
  • 二分查找过程中使用LCP-LR有助于从O(M *日志N)的加速搜索过程以O(M +日志N)
  • 如你所说,你可以用两个二进制搜索,以确定比赛范围P的左右结束,比赛范围的长度相当于事故为P的数量。
  • It is possible to compute LCP-LR in O(N) time and O(2*N)=O(N) space from LCP
  • Using LCP-LR during binary search helps accelerate the search procedure from O(M*log N) to O(M+log N)
  • As you suggested, you can use two binary searches to determine the left and right end of the match range for P, and the length of the match range corresponds with the number of occurrences for P.

这篇关于如何LCP在寻找一个模式出现的次数帮助吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆