查找连续重复序列的算法 [英] Algorithm for finding continuous repeated sequences

查看:351
本文介绍了查找连续重复序列的算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种在基因组序列中找到短串联重复序列的算法.

I'm looking for an algorithm that finds short tandem repeats in a genome sequence.

基本上,给定一个非常长的字符串,该字符串只能包含4个字符"ATCG",我需要查找彼此相邻的2-5个字符之间的短重复.

Basically, given a really long string which can only consist of the 4 characters 'ATCG', I need to find short repeats between 2-5 characters long that are next to each other.

例如: TACATGAGATCATGATGATGATGATGGAGCTGTGAGATC 会重复给 ATGATGATG 或ATG 3次

ex: TACATGAGATCATGATGATGATGATGGAGCTGTGAGATC would give ATGATGATG or ATG repeated 3 times

该算法需要扩展到一百万个字符的字符串,因此我正在尝试尽可能接近线性运行时.

The algorithm needs to scale up to a string of 1 million characters so I'm trying to get as close to linear runtime as possible.

我当前的算法: 由于重复的长度可以是2-5个字符,因此我逐个字符地检查字符串,看看第N个字符是否与第N + X个字符相同,且X为2到5.每个X都有一个计数器,该计数器按顺序计数匹配并在不匹配时重置,我们知道X =计数器时是否存在重复.随后可以手动检查后续重复.

My current algorithm: Since the repeats can be 2-5 characters long, I check the string character by character and see if the Nth character is the same as the N+Xth character, with X being 2 through 5. With a counter for each X that counts sequential matches and resets at a mismatch, we know if there is a repeat when X = the counter. The subsequent repeats can then be checked manually.

推荐答案

您正在查看赋予您O(n)的每个字符,因为您在每个字符上比较了下一个( maximum )个五个字符这给你一个常数c:

You are looking at each character which gives you O(n), since you compare on each character the next (maximum) five characters this gives you a constant c:

var data    = get_input();
var compare = { `A`, `T`, `G`, `A`, `T` }         // or whatever
var MAX_LOOKAHEAD = compare.length
var n
var c

for(n = data_array.length; n < size; i++) {       // Has runtime O(n)

  for(c = 0; c < MAX_LOOKAHEAD; c++) {            // Maximum O(c)

    if( compare[c] != data[i+c] ) {
      break;
    } else {
      report( "found match at position " + i )
    }

  }
}

很容易看到它运行了O(n*c)次.由于c非常小,因此可以忽略-我认为不能摆脱该常数-这导致O(n)的总运行时间.

It is easy to see that this runs O(n*c) times. Since c is very small it can be ignored - and I think one can not get rid of that constant - which results in a total runtime of O(n).

好消息:

您可以通过并行化来加快速度.例如.您可以按k间隔进行拆分,并通过给它们适当的开始和结束索引让多个线程为您完成工作.这可以为您提供线性加速.

You can speed this up with parallelization. E.g. you could split this up in k intervals and let multiple threads do the job for you by giving them appropriate start and end indices. This could give you a linear speedup.

如果这样做,请确保将相交视为特殊情况,因为如果间隔将匹配一分为二,则可能会丢失匹配.

If you do that make sure that you treat the intersections as special cases since you could miss a match if your intervals split a match in two.

例如n = 50000:

4个线程的分区:(n/10000) - 1 = 4.第五线程没有什么事情要做,因为它只处理交集,这就是为什么我们不需要考虑它的开销(在我们的例子中很小的情况)的原因.

Partition for 4 threads: (n/10000) - 1 = 4. The 5th thread won't have a lot to do since it just handles the intersections which is why we don't need to consider its (in our case tiny) overhead.

1                 10000               20000               40000               50000
|-------------------|-------------------|-------------------|-------------------|
| <-   thread 1  -> | <-   thread 2  -> | <-   thread 3  -> | <-   thread 4  -> |
                  |---|               |---|               |---|              
                    |___________________|___________________|
                                        |
                                     thread 5

这就是它的样子:

var data;
var compare = { `A`, `T`, `G`, `A`, `T` };
var MAX_LOOKAHEAD = compare.length;

thread_function(args[]) {

    var from = args[0];
    var to   = args[1];

    for(n = from ; n < to ; i++) {

      for(c = 0; c < MAX_LOOKAHEAD; c++) {
        if( compare[c] != data[i+c] ) {
          break;
        } else {
          report( "found match at position " + i )
        }
      }
    }
}

main() {
    var data_size     = 50000;
    var thread_count  = 4;
    var interval_size = data_size / ( thread_count + 1) ;

    var tid[]

    // This loop starts the threads for us:

    for( var i = 0; i < thread_count; i++ ) {
        var args = { interval_size * i, (interval_size * i) + interval_size };

        tid.add( create_thread( thread_function, args ) );
    }

    // And this handles the intersections:

    for( var i = 1; i < thread_count - 1; i++ ) {
        var args = { interval_size * i, (interval_size * i) + interval_size };

        from = (interval_size * i) - compare.length + 1;
        to   = (interval_size * i) + compare.length - 1;

        for(j = from; j < to ; j++) {

            for(k = 0; k < MAX_LOOKAHEAD; k++) {
                if( compare[k] != data[j+k] ) {
                    break;
                } else {
                    report( "found match at position " + j )
                }
            }
        }
    }

    wait_for_multiple_threads( tid );
}

这篇关于查找连续重复序列的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆