通过后缀数组的最长公共子字符串:sentinel的使用 [英] Longest common substring via suffix array: uses of sentinel

查看:60
本文介绍了通过后缀数组的最长公共子字符串:sentinel的使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读一系列字符串中最长(常见)子字符串的(显然)众所周知的问题,并且一直在关注这两个视频,它们讨论了如何使用后缀数组解决问题:(请注意,这个问题并不不需要您观看):

I am reading about the (apparently) well known problem of the longest common substring in a series of strings, and have been following these two videos which talk about how to solve the problem using suffix arrays: (note that this question doesn't require you to watch them):

https://youtu.be/Ic80xQFWevc

https://youtu.be/DTLjHSToxmo

第一步是首先将所有源字符串连接成一个大字符串,并用一个唯一的"标记来分隔每个源,其中每个标记的ASCII码都小于任何字符串中可能出现的任何字符的ASCII码..这样我们就可以拥有单独的字符串

The first step is that we start by concatenating all the source strings into one big one, separating each with a 'unique' sentinel, where the ASCII code of each sentinel is less than that of any character that may occur in any string. So we could have the individual strings

abca
bcad
daca

并连接它们以给出

abca#bcad$daca%

现在,可能的哨兵数量有限,如果我们有大量的琴弦,则会导致问题.确实,有人在第一个链接的视频中指出了这一点,对此的回应是

Now, there are only a limited number of possible sentinels, which leads to problems if we have a large number of strings. Indeed, someone has pointed this out on the first linked video, the response to which was

正确,解决方案是将您的字母映射到自然数并根据您需要的哨兵数量向上移动.这使您能够在值[1,N]和您的字母之间总是有前哨在那之上.此技巧使后缀数组可伸缩,但是您需要撤消移位,对后缀中存储的真实值进行解码数组.

Correct, the solution is to map your alphabet to the natural numbers and shift up by the number of sentinels you need. This allows you to always have sentinels between the values say [1,N] and your alphabet above that. This trick makes the suffix array scaleable, but you need to undo the shift the decode the true value stored in the suffix array.

我不明白答案的意思.

我知道我可以在视频中发布我的问题,但是我不能保证(及时)回复,而且这里的观众范围更广,所以我在这里问人:有人可以解释一下吗这个答案意味着以及如何实现?

I know I could post my question on the video, but I am not guaranteed a (timely) response and the audience here is far wider, so am asking people here: could someone please explain what this answer means and how to implement it?

推荐答案

不确定如何比引用的注释更好/不同地解释它.也许一个例子会有所帮助.请注意,我这里不是使用真正的ASCII代码,因为我不想显示约100个源字符串的示例.因此,我们只假设A = 1,B = 2,C = 3,等等.

Not sure how to explain it better/different than in the quoted comment. Maybe an example will help. Note that I am not using the true ASCII codes here as I do not want to show an example with ~100 source strings. So instead, we will just assume A=1, B=2, C=3, etc.

因此,您的源字符串 abca bcad daca 将转换为 [1,2,3,1],[2,3,1,4],[4,1,3,1] ,但是为了适合三个标记,您必须将所有这些值上移3,即1到3现在是标记,并且A = 4,B = 5等.联接的字符串"(实际上现在是整数列表)为 [4,5,6,4,1,5,6,4,7,2,7,4,6​​,4,3] .然后,您可以将其转换回 defda ... 字符,执行算法,然后转换回,以免移位.

Thus, your source strings abca bcad daca would translate to [1,2,3,1],[2,3,1,4],[4,1,3,1], but in order to fit in the three sentinels, you have to shift all those values up by 3, i.e. 1 to 3 are now sentinels and A=4, B=5, etc.; the joined "string" (actually, it is a list of integers now) is [4,5,6,4, 1, 5,6,4,7, 2, 7,4,6,4, 3]. You can then translate those back to characters defda..., do the algorithm, and then translate back, undoing the shift.

但是,我要说的是,除了移位整数之外,我们还可以对标记使用负数,然后直接在整数列表上工作,而不是将其转换回字符(对于负数是不可能的)): [1,2,3,1,-1,2,3,1,4,-2,4,1,3,1,-3] (注意:我有不是观看了视频,并且不知道该特定算法的工作原理;可能是负数是个问题,例如,在使用某种最短路径"算法的情况下.)

However, I would argue that instead of shifting the integers, we could just as well use negative numbers for the sentinels and then work directly on the list of integers instead of converting those back to characters (which is not possible for negative numbers): [1,2,3,1, -1, 2,3,1,4, -2, 4,1,3,1, -3] (Note: I have not watched the video and do not know how this specific algorithm works; it could be that negative numbers are a problem, e.g. in case this is using some sort of "shortest path" algorithm.)

这篇关于通过后缀数组的最长公共子字符串:sentinel的使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆