通过后缀数组的最长公共子字符串:我们真的需要唯一的标记吗? [英] Longest common substring via suffix array: do we really need unique sentinels?

查看:44
本文介绍了通过后缀数组的最长公共子字符串:我们真的需要唯一的标记吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读有关LCP数组及其与后缀数组结合使用的信息,以解决最长的公共子字符串"问题.此视频指出,用于分隔单个字符串的标记必须唯一,而不是包含在任何字符串本身中.

I am reading about LCP arrays and their use, in conjunction with suffix arrays, in solving the "Longest common substring" problem. This video states that the sentinels used to separate individual strings must be unique, and not be contained in any of the strings themselves.

除非我弄错了,否则的原因是这样,当我们构造LCP数组时(通过比较相邻后缀共有多少个字符),在两个前哨碰巧位于以下位置的情况下,我们不计算前哨值我们正在比较的两个后缀中的索引相同.

Unless I am mistaken, the reason for this is so when we construct the LCP array (by comparing how many characters adjacent suffixes have in common) we don't count the sentinel value in the case where two sentinels happen to be at the same index in both the suffixes we are comparing.

这意味着我们可以编写如下代码:

This means we can write code like this:

for each character c in the shortest suffix
    if suffix_1[c] == suffix_2[c]
        increment count of common characters

但是,为了促进这一点,我们需要跳过一些箍以确保我们使用独特的哨兵,我在这里问过

However, in order to facilitate this, we need to jump through some hoops to ensure we use unique sentinels, which I asked about here.

但是,一种更简单(实现)的解决方案不是简单地计算共同字符的数量,在达到(单个,唯一)定点字符时停止,如下所示:

However, would a simpler (to implement) solution not be to simply count the number of characters in common, stopping when we reach the (single, unique) sentinel character, like this:

set sentinel = '#'
for each character c in the shortest suffix
    if suffix_1[c] == suffix_2[c]
        if suffix_1[c] != sentinel
            increment count of common characters
        else
            return

或者,我在这里错过了一些基本的东西吗?

Or, am I missing something fundamental here?

推荐答案

简短的版本是这主要是关于后缀数组构造算法如何工作的人工产物,与LCP计算无关,因此请提供您的后缀数组构造算法不需要这些标记,您可以安全地跳过它们."

The short version is "this is mostly an artifact of how suffix array construction algorithms work and has nothing to do with LCP calculations, so provided your suffix array building algorithm doesn't need those sentinels, you can safely skip them."

更长的答案:

从总体上讲,视频中描述的基本算法是这样的:

At a high level, the basic algorithm described in the video goes like this:

  1. 为字符串T 1 和T 2 构造广义后缀数组.
  2. 为生成的后缀数组构造一个LCP数组.
  3. 遍历LCP数组,查找来自不同字符串的相邻后缀对.
  4. 在任何两个这样的字符串之间找到最大的LCP;称为k.
  5. 从两个后缀中的任意一个中提取前k个字符.
  1. Construct a generalized suffix array for the strings T1 and T2.
  2. Construct an LCP array for that resulting suffix array.
  3. Iterate across the LCP array, looking for adjacent pairs of suffixes that come from different strings.
  4. Find the largest LCP between any two such strings; call it k.
  5. Extract the first k characters from either of the two suffixes.

那么,哨兵在这里出现在哪里?它们主要是在步骤(1)和(2)中提出的.该视频暗示使用线性时间后缀数组构造算法(SACA).用于生成两个或多个字符串的后缀数组的大多数快速SACA都假设,作为其操作的一部分,在这些字符串的末尾有不同的结束标记,并且算法的内部正确性通常依赖于此.因此,从这个意义上讲,可能纯粹是为了使用快速SACA而添加了标记标记,完全独立于您以后可能会使用的标记.

So, where do sentinels appear in here? They mostly come up in steps (1) and (2). The video alludes to using a linear-time suffix array construction algorithm (SACA). Most fast SACAs for generating suffix arrays for two or more strings assume, as part of their operation, that there are distinct endmarkers at the ends of those strings, and often the internal correctness of the algorithm relies on this. So in that sense, the endmarkers might need to get added in purely to use a fast SACA, completely independent of any later use you might have.

(为什么SACA需要这个?某些最快的SACA(例如SA-IS算法)假定字符串的最后一个字符是唯一的,在字典上位于所有内容的前面,并且在其他任何地方都没有出现.该具有多个字符串的算法,您需要某种内部定界符来标记一个字符串结束并在另一个字符串开始的位置.该字符需要充当强力的,我们现在使用第一个字符串"字符完成操作,这就是为什么需要按字典顺序在所有其他字符之前.)

(Why do SACAs need this? Some of the fastest SACAs, such as the SA-IS algorithm, assume the last character of the string is unique, lexicographically precedes everything, and doesn't appear anywhere else. In order to use that algorithm with multiple strings, you need some sort of internal delimiter to mark where one string ends and another starts. That character needs to act as a strong "and we're now done with the first string" character, which is why it needs to lexicographically precede all the other characters.)

假设您以这种方式将SACA用作黑匣子,那么从现在开始,这些哨兵是完全没有必要的.它们不能用来分辨哪个后缀来自哪个字符串(这应该由SACA提供),并且它们不能成为相邻字符串之间重叠部分的一部分.

Assuming you're using a SACA as a black box this way, from this point forward, those sentinels are completely unnecessary. They aren't used to tell which suffix comes from which string (this should be provided by the SACA), and they can't be a part of the overlap between adjacent strings.

因此,从某种意义上讲,您可以将这些标记视为使用快速SACA所需的实现细节,您需要执行此操作才能获得快速运行时间.

So in that sense, you can think of these sentinels as an implementation detail needed to use a fast SACA, which you'd need to do in order to get the fast runtime.

这篇关于通过后缀数组的最长公共子字符串:我们真的需要唯一的标记吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆