字符串模式使用后缀数组和LCP匹配的执行情况(-LR) [英] Implementation of string pattern matching using Suffix Array and LCP(-LR)

查看:219
本文介绍了字符串模式使用后缀数组和LCP匹配的执行情况(-LR)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在过去的几个星期我试图找出如何有效地找到另一个字符串内的一个字符串模式。

During the last weeks I tried to figure out how to efficiently find a string pattern within another string.

我发现了很长一段时间,最有效的方式将使用后缀树了。然而,由于该数据结构是在空间非常昂贵的,我研究使用后缀阵列进一步(使用少得多的空间)的。不同的文件,如后缀芯片:一个上线的字符串搜索的新方法(曼伯&安培;迈尔斯,1993年)状态,即寻找一个子可以在O实现(P +日志(N))(其中,P图案的N和长度是通过使用二进制搜索和后缀数组使用LCP阵列沿所述串的长度)。

I found out that for a long time, the most efficient way would have been using a suffix tree. However, since this data structure is very expensive in space, I studied the use of suffix arrays further (which use far less space). Different papers such as "Suffix Arrays: A new method for on-line string searches" (Manber & Myers, 1993) state, that searching for a substring can be realised in O(P+log(N)) (where P is the length of the pattern and N is length of the string) by using binary search and suffix arrays along with LCP arrays.

我特别研究了后一份文件,了解搜索算法。 这个答案在帮助我理解该算法的表现非常出色(顺带做了它成<一个href=\"http://en.wikipedia.org/wiki/LCP_array#LCP_Array_usage_in_finding_the_number_of_occurrences_of_a_pattern\"相对=nofollow> LCP维基百科页面)。

I especially studied the latter paper to understand the search algorithm. This answer did a great job in helping me understand the algorithm (and incidentally made it into the LCP Wikipedia Page).

但我仍然在寻找实现这个算法的方式。尤其是上述LCP-LR阵列的建设显得非常复杂。

But I am still looking for an way to implement this algorithm. Especially the construction of the mentioned LCP-LR arrays seems very complicated.

参考文献:

&曼伯放大器;迈尔斯,1993:曼伯,乌迪;迈尔斯,基因,SIAM杂志上计​​算,1993年,第22卷(5),pp.935-948,的 http://epubs.siam.org/doi/pdf/10.1137/0222058

Manber & Myers, 1993: Manber, Udi ; Myers, Gene, SIAM Journal on Computing, 1993, Vol.22(5), pp.935-948, http://epubs.siam.org/doi/pdf/10.1137/0222058

更新1

只是为了强调我所感兴趣的是:我的理解LCP阵列,我找到了实现它们。然而,普通LCP数组将不适合有效的模式匹配(如在参考中所述)。因此,我感兴趣的实施LCP-LR阵列,这似乎不仅仅是实现一个LCP数组要复杂得多。

Just to emphasize on what I am interested in: I understood LCP arrays and I found ways to implement them. However, the "plain" LCP array would not be appropriate for efficient pattern matching (as described in the reference). Thus I am interested in implementing LCP-LR arrays which seems much more complicated than just implementing an LCP array

更新2

添加链接引用的文件

推荐答案

该TERMIN,可以帮助你:这是用来描述各种后缀阵列enchanced后缀阵列,以取代后缀树等阵列( LCP ,儿童)。

The termin that can help you: enchanced suffix array, which is used to describe suffix array with various other arrays in order to replace suffix tree (lcp, child).

这些可以是一些例子:

的https://$c$c.google.com/p/esaxx/ ESAXX

http://bibiserv.techfak.uni-bielefeld.de/mkesa/ MKESA

该esaxx人似乎做你想要什么,再加上,它有例如enumSubstring.cpp如何使用它。

The esaxx one seems to be doing what you want, plus, it has example enumSubstring.cpp how to use it.

如果你看看引用<一个href=\"http://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=10&cad=rja&uact=8&ved=0CEsQFjAJ&url=http%3A%2F%2Fwebglimpse.net%2Fpubs%2Fsuffix.pdf&ei=szHXVMHdF8TZ7gbIu4DwCQ&usg=AFQjCNFj4JQQ2S1vdKZ09py7_SLZ45m_SA&bvm=bv.85464276,d.ZGU\"相对=nofollow>纸,它提到一个有用的属性(4.2)。既然这么不支持数学,有没有点在这里复制。

If you take a look at the referenced paper, it mentions an useful property (4.2). Since SO does not support math, there is no point to copy it here.

我已经做了快速实施,它使用线段树:

I've done quick implementation, it uses segment tree:

// note that arrSize is O(n)
// int arrSize = 2 * 2 ^ (log(N) + 1) + 1; // start from 1

// LCP = new int[N];
// fill the LCP...
// LCP_LR = new int[arrSize];
// memset(LCP_LR, maxValueOfInteger, arrSize);
// 

// init: buildLCP_LR(1, 1, N);
// LCP_LR[1] == [1..N]
// LCP_LR[2] == [1..N/2]
// LCP_LR[3] == [N/2+1 .. N]

// rangeI = LCP_LR[i]
//   rangeILeft  = LCP_LR[2 * i]
//   rangeIRight = LCP_LR[2 * i + 1]
// ..etc
void buildLCP_LR(int index, int low, int high)
{
    if(low == high)
    {
        LCP_LR[index] = LCP[low];
        return;
    }

    int mid = (low + high) / 2;

    buildLCP_LR(2*index, low, mid);
    buildLCP_LR(2*index+1, mid + 1, high);

    LCP_LR[index] = min(LCP_LR[2*index], LCP_LR[2*index + 1]);
}

这篇关于字符串模式使用后缀数组和LCP匹配的执行情况(-LR)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆