最长重复(k次)子字符串 [英] Longest repeated (k times) substring

查看:100
本文介绍了最长重复(k次)子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这是一个比较棘手的话题,但是我已经达到了可以从已经回答的问题中获得帮助的极限。

I know this is a somewhat beaten topic, but I have reached the limit of help I can get from what's already been answered.

这是针对 Rosalind项目问题LREP 。我试图在字符串中找到最长的k字串子字符串,并且提供了后缀树,这很好。我知道我需要用每个节点的后代叶子数注释后缀表,然后找到具有> = k 后代的节点,最后找到其中最深的节点节点。从理论上讲,我已经定了。

This is for the Rosalind project problem LREP. I'm trying to find the longest k-peated substring in a string and I've been provided the suffix tree, which is nice. I know that I need to annotate the suffix table with the number of descendant leaves from each node, then find nodes with >=k descendants, and finally find the deepest of those nodes. Theory-wise I'm set.

我从以下资源中获得了很多帮助(糟糕,我只能发布2条信息):

I've gotten a lot of help from the following resources (oops, I can only post 2):

  • Find longest repetitive sequence in a string
  • Depth-first search (Python)

我可以获得从根到每个叶子的路径,但是我可以没有弄清楚如何对树进行预处理,这样我就可以从每个节点中获取后代的数量。我有一个单独的算法,可以处理较小的序列,但是它的计算复杂度是指数级的,因此对于较大的算法,它花费的时间太长。我知道使用DFS,我应该能够以线性复杂度执行整个任务。为了使该算法起作用,我需要能够在不到5分钟的时间内获得〜40,000长度的字符串中最长的k-peat。

I can get the paths from the root to each leaf, but I can't figure out how to pre-process the tree in such a way that I can get the number of descendants from each node. I have a separate algorithm that works on small sequences but it's in exponential complexity, so for larger stuff it takes way too long. I know with a DFS I should be able to perform the whole task in linear complexity. For this algorithm to work I need to be able to get the longest k-peat of an ~40,000 length string in less than 5 minutes.

以下是一些示例数据(第一个行:序列,第二行: k ,后缀表格式:父子位置长度):

Here's some sample data (first line: sequence, second line: k, suffix table format: parent child location length):

CATACATAC$
2
1 2 1 1
1 7 2 1
1 14 3 3
1 17 10 1
2 3 2 4
2 6 10 1
3 4 6 5
3 5 10 1
7 8 3 3
7 11 5 1
8 9 6 5
8 10 10 1
11 12 6 5
11 13 10 1
14 15 6 5
14 16 10 1

此输出应该是 CATAC

使用以下代码(从 LiteratePrograms )我已经能够获取路径,但是在较长的序列上解析路径仍然需要很长时间

With the following code (modified from LiteratePrograms) I've been able to get the paths, but it still takes a long time on longer sequences to parse out a path for each node.

#authors listed at
#http://en.literateprograms.org/Depth-first_search_(Python)?action=history&offset=20081013235803
class Vertex:
    def __init__(self, data):
        self.data = data
        self.successors = []

def depthFirstSearch(start, isGoal, result):
    if start in result:
        return False

    result.append(start)

    if isGoal(start):
        return True
    for v in start.successors:
        if depthFirstSearch(v, isGoal, result):
            return True

    # No path was found
    result.pop()
    return False

def lrep(seq,reps,tree):
    n = 2 * len(seq) - 1
    v = [Vertex(i) for i in xrange(n)]
    edges = [(int(x[0]),int(x[1])) for x in tree]
    for a, b in edges:
        v[a].successors.append(v[b])

    paths = {}
    for x in v:
        result = []
        paths[x.data] = []
        if depthFirstSearch(v[1], (lambda v: v.data == x.data), result):
            path = [u.data for u in result]
            paths[x.data] = path

我想做的是对树进行预处理,以找到满足后代>的节点。 = k 的要求,然后再找到深度。我什至还不了解如何计算深度。尽管我想我会拥有一些字典来跟踪路径中每个节点的深度,然后求和。

What I'd like to do is pre-process the tree to find nodes which satisfy the descendants >= k requirement prior to finding the depth. I haven't even gotten to how I'm going to calculate depth yet. Though I imagine I'll have some dictionary to keeps track of the depths of each node in the path then sums.

因此,我最重要的问题是: 如何预处理带有后代叶子的树?

So, my first-most-important question is: "How do I preprocess the tree with descendant leaves?"

我的次要的重要问题是:如何快速计算深度?

My second-less-important question is: "After that, how can I quickly compute depth?"

PS我应该说这不是作业或任何类似的事情。我只是一个生物化学家,试图通过一些计算挑战来拓展我的视野。

P.S. I should state that this isn't homework or anything of that sort. I'm just a biochemist trying to expand my horizons with some computational challenges.

推荐答案

关于基本字符串练习的好问题-操作。我不再记得后缀树了;)但是正如您所说的:从理论上讲,您已经设置好了。

Nice question for an excercise in basic string-operations. I didnt remember the suffix-tree anymore ;) But as you have stated: theory-wise, you are set.

该主题的 wikipedia-stub 有点混乱。您只需要知道自己是否是最外面的非叶节点,其子节点为 n> = k 。如果您在整个字符串中找到了从根节点到该节点的子字符串,则后缀树会告诉您,有 n 个可能的延续。因此必须在 n 个出现此字符串的位置。

The wikipedia-stub onto this topic is a bit confusing. You only need to know, if you are the outermost non-leaf-node with n >= k childs. If you found the substring from the root-node to this one in the whole string, the suffix-tree tells you, that there are n possible continuitations. So there must be n places, where this string occurs.

对此和许多类似问题的简单关键概念是进行深度优先搜索:在每个Node中,询问子元素的值并返回最大的给父母。根节点将获得最终结果。

A simple key-concept of this and many similar problems is to do a depth-first-search: In every Node, ask the child-elements for their value and return the maximum of it to the parent. The root-node will get the final result.

这些问题之间值的计算方式有所不同。在这里,每个节点都有三种可能性:

How the values are calculated differs between the problems. Here you have three possiblilitys for every node:


  1. 该节点没有子节点。它是一个叶节点,结果无效。

  2. 每个孩子都返回一个无效结果。其最后一个非叶子节点,结果为零(此节点之后没有更多字符)。如果此节点有 n 个子节点,则从根到此节点的每个边的缩进字符串在该节点中出现 n 次。整个字符串。如果我们至少需要 k 个节点和 k> n ,结果也无效。

  3. 一个或多个叶子返回有效值。结果是返回值的最大值加上附加有边缘的字符串的长度。

  1. The node have no childs. Its a leaf-node, the result is invalid.
  2. Every child returns an invalid result. Its the last non-leaf-node, the result is zero (no more characters after this node). If this node have n childs, the concated string of every edge from the root to this node appears n times in the whole string. If we need at least k nodes and k > n, the result is also invalid.
  3. One or more leafs return something valid. The result is the maximum of the returned value plus the length of the string attached the edge to it.

当然,您还必须返回对应的结点。否则,您将知道最长的重复子字符串多长时间,但不知道它在哪里。

Of course, you also have to return the correspondending node. Otherwise you will know, how long the longest repeated substring is but not where it is.

您应该首先尝试自己编写代码。如果您想收集所有必要的信息,则构造树很简单,但并不简单。不过,这是一个简单的示例。请注意:如果输入某种程度上无效,则将放弃所有的完整性检查,并且一切都会可怕地失败。例如。不要尝试使用除根索引之外的任何其他根索引,不要将节点引用为父节点,而以前未将其引用为子节点,等等。还有很大的改进空间* 提示;)*

You should try to code this by yourself first. Constructing the tree is simple but not trivial if you want to gather all necessary informations. Nevertheless here is a simple example. Please note: every sanity-checking is dropped out and everything will fail horribly, if the input is somehow invalid. E.g. do not try to use any other root-index than one, do not refere to nodes as a parent, which weren't referenced as a childs before, etc. Much room for improvement *hint;)*.

class Node(object):
    def __init__(self, idx):
        self.idx = idx     # not needed but nice for prints 
        self.parent = None # edge to parent or None
        self.childs = []   # list of edges

    def get_deepest(self, k = 2):
        max_value = -1
        max_node = None
        for edge in self.childs:
            r = edge.n2.get_deepest()
            if r is None: continue # leaf
            value, node = r
            value += len(edge.s)
            if value > max_value: # new best result
                max_value = value
                max_node = node
        if max_node is None:
            # we are either a leaf (no edge connected) or 
            # the last non-leaf.
            # The number of childs have to be k to be valid.
            return (0, self) if len(self.childs) == k else None
        else:
            return (max_value, max_node)

    def get_string_to_root(self):
        if self.parent is None: return "" 
        return self.parent.n1.get_string_to_root() + self.parent.s

class Edge(object):
    # creating the edge also sets the correspondending
    # values in the nodes
    def __init__(self, n1, n2, s):
        #print "Edge %d -> %d [ %s]" % (n1.idx, n2.idx, s)
        self.n1, self.n2, self.s = n1, n2, s
        n1.childs.append(self)
        n2.parent = self

nodes = {1 : Node(1)} # root-node
string = sys.stdin.readline()
k = int(sys.stdin.readline())
for line in sys.stdin:
    parent_idx, child_idx, start, length = [int(x) for x in line.split()]
    s = string[start-1:start-1+length]
    # every edge constructs a Node
    nodes[child_idx] = Node(child_idx)
    Edge(nodes[parent_idx], nodes[child_idx], s)

(depth, node) = nodes[1].get_deepest(k)
print node.get_string_to_root()

这篇关于最长重复(k次)子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆