LCP 如何帮助查找模式的出现次数? [英] How does LCP help in finding the number of occurrences of a pattern?

查看:22
本文介绍了LCP 如何帮助查找模式的出现次数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我了解到最长公共前缀 (LCP) 可用于查找字符串中某个模式出现的次数.

I have read that the Longest Common Prefix (LCP) could be used to find the number of occurrences of a pattern in a string.

具体来说,您只需要创建文本的后缀数组,对其进行排序,然后无需进行二分查找来查找范围,这样您就可以计算出出现的次数,您只需计算每个连续的 LCP后缀数组中的条目.

Specifically, you just need to create the suffix array of the text, sort it, and then instead of doing binary search to find the range so that you can figure out the number of occurrences, you simply compute the LCP for each successive entry in the suffix array.

虽然使用二分搜索来查找模式的出现次数很明显,但我无法弄清楚 LCP 如何帮助找到此处出现的次数.

Although using binary search to find the number of occurrences of a pattern is obvious I can't figure out how the LCP helps find the number of occurrences here.

例如对于banana的这个后缀数组:

For example for this suffix array for banana:

LCP  Suffix entry
N/A  a  
1    ana  
3    anana  
0    banana  
0    na  
2    nana  

LCP 如何帮助找到像banana"或na"这样的子字符串出现的次数对我来说并不明显.

How does the LCP help find the number of occurrences of a substring like "banana" or "na" is not obvious to me.

了解 LCP 如何在这里提供帮助有什么帮助吗?

Any help figuring out how LCP helps here?

推荐答案

我不知道使用 LCP 数组的任何方式代替进行二分查找,但我相信你所指的是 Udi Manber 和 Gene Myers 在 后缀数组:一种在线字符串搜索的新方法.

I do not know any way of using the LCP array instead of carrying out a binary search, but I believe what you refer to is the technique described by Udi Manber and Gene Myers in Suffix arrays: a new method for on-line string searches.

(注意:以下解释已复制到 2014 年 4 月 9 日的维基百科文章中,参见 diff.如果您查看此处和维基百科上的修订历史记录,您会发现此处的修订历史是最先编写的.请不要插入诸如摘自维基百科"到我的答案中.)

(Note: The below explanation has been copied into a Wikipedia article on 9th April 2014, see diff. If you look at the revision history here and on Wikipedia, you'll see that the one here was written first. Please don't insert comments like "taken from Wikipedia" into my answer.)

想法是这样的:为了找到给定字符串P(长度m)在文本T(长度N)中出现的次数,

The idea is this: In order to find the number of occurrences of a given string P (length m) in a text T (length N),

  • 您对 T 的后缀数组使用二分查找(就像您建议的那样)
  • 但是您可以加速使用 LCP 数组作为辅助数据结构.更具体地说,您生成一个特殊版本的 LCP 数组(我将在下面将其称为 LCP-LR)并使用它.
  • You use binary search against the suffix array of T (just like you suggested)
  • But you speed it up using the LCP array as auxiliary data structure. More specifically, you generate a special version of the LCP array (I will call it LCP-LR below) and use that.

使用标准二分搜索(没有 LCP 信息)的问题在于,在每个 O(log N) 比较中,您需要比较 P到后缀数组的当前条目,这意味着最多 m 个字符的完整字符串比较.所以复杂度是O(m*log N).

The issue with using standard binary search (without the LCP information) is that in each of the O(log N) comparisons you need to make, you compare P to the current entry of the suffix array, which means a full string comparison of up to m characters. So the complexity is O(m*log N).

LCP-LR 阵列通过以下方式帮助将其改进为 O(m+log N):

The LCP-LR array helps improve this to O(m+log N), in the following way:

  • 在二分搜索算法的任何时候,你会像往常一样考虑后缀数组的范围 (L,...,R) 及其中心点 M,然后决定是否在左子中继续搜索-范围 (L,...,M) 或在正确的子范围 (M,...,R).
  • 为了做出决定,您将 P 与 M 处的字符串进行比较.如果 P 与 M 相同,则完成,但如果不相同,则您将比较 P 的前 k 个字符,然后再决定 P 是否为字典序小于或大于 M.让我们假设结果是 P 大于 M.
  • 因此,在下一步中,您考虑 (M,...,R) 和中间的新中心点 M':

  • At any point during the binary search algorithm, you consider, as usual, a range (L,...,R) of the suffix array and its central point M, and decide whether you continue your search in the left sub-range (L,...,M) or in the right sub-range (M,...,R).
  • In order to make the decision, you compare P to the string at M. If P is identical to M, you are done, but if not, you will have compared the first k characters of P and then decided whether P is lexicographically smaller or larger than M. Let's assume the outcome is that P is larger than M.
  • So, in the next step, you consider (M,...,R) and a new central point M' in the middle:

              M ...... M' ...... R
              |
       we know:
          lcp(P,M)==k

技巧现在是预先计算 LCP-LR,以便 O(1) 查找告诉您 M 和 M' 的最长公共前缀 lcp(M,M').

The trick now is that LCP-LR is precomputed such that a O(1)-lookup tells you the longest common prefix of M and M', lcp(M,M').

您已经知道(从上一步中)M 本身具有与 P 相同的 k 个字符的前缀:lcp(P,M)=k.现在有三种可能:

You know already (from the previous step) that M itself has a prefix of k characters in common with P: lcp(P,M)=k. Now there are three possibilities:

  • 情况 1:k 比 M 与 M' 共有的前缀字符要少.这意味着 M' 的第 (k+1) 个字符与 M 的相同,并且由于 P 在字典上大于 M,它也必须在字典上大于 M'.所以我们在右半部分继续(M',...,R).
  • 情况 2:k > lcp(M,M'),即 P 与 M 的共同前缀字符 比 M 与 M' 的共同前缀字符多.因此,如果我们将 P 与 M' 进行比较,公共前缀将小于 k,并且 M' 将在字典序上大于 P,因此,没有实际进行比较,我们继续左半部分 (M,...,M').
  • 情况 3:k == lcp(M,M').所以 M 和 M' 在前 k 个字符中都与 P 相同.要决定我们是在左半边还是右半边继续,只需将 P 与 M' 从第 (k+1) 个字符开始进行比较即可.
  • Case 1: k < lcp(M,M'), i.e. P has fewer prefix characters in common with M than M has in common with M'. This means the (k+1)-th character of M' is the same as that of M, and since P is lexicographically larger than M, it must be lexicographically larger than M', too. So we continue in the right half (M',...,R).
  • Case 2: k > lcp(M,M'), i.e. P has more prefix characters in common with M than M has in common with M'. Consequently, if we were to compare P to M', the common prefix would be smaller than k, and M' would be lexicographically larger than P, so, without actually making the comparison, we continue in the left half (M,...,M').
  • Case 3: k == lcp(M,M'). So M and M' are both identical with P in the first k characters. To decide whether we continue in the left or right half, it suffices to compare P to M' starting from the (k+1)-th character.

我们继续递归.

总体效果是P 的任何字符都不会与文本中的任何字符进行多次比较.字符比较的总数以m为界,所以总复杂度确实是O(m+log N).

The overall effect is that no character of P is compared to any character of the text more than once. The total number of character comparisons is bounded by m, so the total complexity is indeed O(m+log N).

显然,剩下的关键问题是我们如何预先计算 LCP-LR 以便它能够在 O(1) 时间内告诉我们后缀数组的任意两个条目之间的 lcp?正如您所说,标准 LCP 数组仅告诉您 连续条目 的 lcp,即任何 x 的 lcp(x-1,x).但是上面描述中的M和M'不一定是连续的条目,那是怎么做的呢?

Obviously, the key remaining question is how did we precompute LCP-LR so it is able to tell us in O(1) time the lcp between any two entries of the suffix array? As you said, the standard LCP array tells you the lcp of consecutive entries only, i.e. lcp(x-1,x) for any x. But M and M' in the description above are not necessarily consecutive entries, so how is that done?

关键是要意识到在二分查找过程中只有某些范围 (L,...,R) 会出现:它总是以 (0,...,N) 开头并在中心处除以它,然后继续向左或向右并再次划分那一半,依此类推.如果您想一想:后缀数组的每个条目都作为二分搜索期间恰好一个可能范围的中心点出现.因此,正好有 N 个不同的范围 (L...M...R) 可能在二分搜索过程中发挥作用,并且对于这 N 个可能的情况,预先计算 lcp(L,M) 和 lcp(M,R) 就足够了范围.所以这是 2*N 个不同的预计算值,因此 LCP-LR 的大小为 O(N).

The key to this is to realize that only certain ranges (L,...,R) will ever occur during the binary search: It always starts with (0,...,N) and divides that at the center, and then continues either left or right and divide that half again and so forth. If you think of it: Every entry of the suffix array occurs as central point of exactly one possible range during binary search. So there are exactly N distinct ranges (L...M...R) that can possibly play a role during binary search, and it suffices to precompute lcp(L,M) and lcp(M,R) for those N possible ranges. So that is 2*N distinct precomputed values, hence LCP-LR is O(N) in size.

此外,还有一个直接的递归算法,可以在 O(N) 时间内从标准 LCP 数组中计算 LCP-LR 的 2*N 个值.如果您需要详细说明,我建议您单独发布一个问题.

Moreover, there is a straight-forward recursive algorithm to compute the 2*N values of LCP-LR in O(N) time from the standard LCP array – I'd suggest posting a separate question if you need a detailed description of that.

总结:

  • 可以在 O(N) 时间和 O(2*N)=O(N) 空间从 LCP 计算 LCP-LR
  • 在二分搜索过程中使用 LCP-LR 有助于加快搜索过程从 O(M*log N) 到 O(M+log N)
  • 如您所建议的,您可以使用两次二分搜索来确定 P 的匹配范围的左端和右端,并且匹配范围的长度与 P 的出现次数相对应.
  • It is possible to compute LCP-LR in O(N) time and O(2*N)=O(N) space from LCP
  • Using LCP-LR during binary search helps accelerate the search procedure from O(M*log N) to O(M+log N)
  • As you suggested, you can use two binary searches to determine the left and right end of the match range for P, and the length of the match range corresponds with the number of occurrences for P.

这篇关于LCP 如何帮助查找模式的出现次数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆