如何基于模式的列表上记号化的条纹字符串 [英] How to tokenize a striped string based on a list of patterns

查看:119
本文介绍了如何基于模式的列表上记号化的条纹字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个串S和模式的列表L [L <子> 1 ,...,L <子> N ],你会怎么发现在S中令牌名单匹配在L和图案,使得S中的字母匹配的总数最大化

Given a string S and a list L of patterns [L1, ..., Ln], how would you find the list of all tokens in S matching a pattern in L and so that the total number of matched letters in S is maximized?

一个虚拟的例子是S =thenuke,L = {的,然后,核弹},我们想检索[了,核弹],如果我们开始通过匹配那么,我们没有得到解决最大化的字母S是匹配的总数。

A dummy example would be S = "thenuke", L = {"the", "then", "nuke"} and we would like to retrieve ["the", "nuke"] as if we start by matching "then", we do not get the solution maximizing the total number of letters in S being matched.

我一直在寻找其他等问题,字符串匹配算法但没有发现任何有效的解决问题的最大化一部分。 这一定是如研究在生物信息学,但我没有在现场,所以任何帮助(包括链接到学术论文)深深AP preciated!

I have been looking at other SO questions, string matching algorithms but found nothing to efficiently solve the maximization part of the problem. This must have been studied e.g. in bioinformatics but I'm not in the field so any help (including link to academic papers) deeply appreciated!

推荐答案

这可以为O解决(| S | + | L | + K)的时间,其中k为所有字符串由L比赛的总人数S.有两个主要步骤:

This can be solved in O(|S| + |L| + k) time, where k is the total number of matches of all strings from L in S. There are two major steps:

  1. 运行阿霍Corasick 。这会给你从L的S.此运行的同时上述任何字符串的所有比赛。

  1. Run Aho-Corasick. This will give you all matches of any string from L in S. This runs in the same time as mentioned above.

初始化数组长度的整数,A,| S | + 1为零。 3月通过阵列,在位置我设置A [1]到A [I-1]如果是较大的,那么对于每场比赛,男,由L S中的位置i,设A [1 + | M |]到A的最大值[I + | M |]和A [I] + | M |。

Initialize an array, A, of integers of length |S| + 1 to all zeros. March through the array, at position i set A[i] to A[i-1] if it is larger, then for every match, M, from L in S at position i, set A[i+|M|] to the max of A[i+|M|] and A[i] + |M|.

下面是一些code,中去,正是这一点。它采用了包我写了,有一个方便的包装要求阿霍Corasick。

Here is some code, in Go, that does exactly this. It uses a package I wrote that has a convenient wrapper for calling Aho-Corasick.

package main

import (
  "fmt"
  "github.com/runningwild/stringz"
)

func main() {
  input := []byte("thenuke")
  patterns := [][]byte{[]byte("hen"), []byte("thenu"), []byte("uke")}

  // This runs Aho-Corasick on the input string and patterns, it returns a
  // map, matches, such that matches[i] is a list of indices into input where
  // patterns[i] matches the input string.  The running time of this is
  // O(|input| + |patterns| + k) and uses O(|patterns| + k) auxillary storage,
  // where k is the total number of matches found.
  find := stringz.FindSet(patterns)
  matches := find.In(input)

  // We want to invert the map so that it maps from index to all strings that
  // match at that index.
  at_pos := make([][]int, len(input)+1)
  for index, hits := range matches {
    for _, hit := range hits {
      at_pos[hit] = append(at_pos[hit], index)
    }
  }

  // Now we do a single pass through the string, at every position we will
  // keep track of how many characters in the input string we can match up to
  // that point.
  max := make([]int, len(input)+1)
  for i := range max {
    // If this position isn't as good as the previous position, then we'll use
    // the previous position.  It just means that there is a character that we
    // were unable to match anything to.
    if i > 0 && max[i-1] > max[i] {
      max[i] = max[i-1]
    }

    // Look through any string that matches at this position, if its length is
    // L, then L positions later in the string we can have matched L more
    // character if we use this string now.  This might mean that we don't use
    // another string that earlier we thought we'd be matching right now, we'll
    // find out later which one was better.
    for _, hit := range at_pos[i] {
      L := len(patterns[hit])
      if i+L < len(max) && max[i+L] < max[i]+L {
        max[i+L] = max[i] + L
      }
    }
  }

  fmt.Printf("%v\n", max)
}

这篇关于如何基于模式的列表上记号化的条纹字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆