如何匹配的DNA序列模式 [英] how to match dna sequence pattern

查看:170
本文介绍了如何匹配的DNA序列模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到一个很难找到一种方法来解决这个问题。

I am getting a trouble finding an approach to solve this problem.

输入输出序列如下:

 **input1 :** aaagctgctagag 
 **output1 :** a3gct2ag2

 **input2 :** aaaaaaagctaagctaag 
 **output2 :** a6agcta2ag

输入nsequence可以是10 ^ 6个字符,最大的连续模式将被考虑。

Input nsequence can be of 10^6 characters and largest continuous patterns will be considered.

例如用于输入2agctaagcta输出将不会是agcta2gcta,但它会agcta2。

For example for input2 "agctaagcta" output will not be "agcta2gcta" but it will be "agcta2".

任何帮助AP preciated。

Any help appreciated.

推荐答案

该算法的说明:

  • 有一个序列S与符号S(1)中,s(2),...,S(N)。
  • 令B(ⅰ)与元件s(1)中,s(2),...,序列s(i)的最佳的COM pressed序列。
  • 因此,例如,B(3)将是对于s(1)中,s(2)中,s(3)。
  • 最好的COM pressed序列
  • 我们想知道的是B(N)。
  • Having a sequence S with symbols s(1), s(2),…, s(N).
  • Let B(i) be the best compressed sequence with elements s(1), s(2),…,s(i).
  • So, for example, B(3) will be the best compressed sequence for s(1), s(2), s(3).
  • What we want to know is B(N).

要找到它,我们将继续通过归纳。我们要计算乙第(i + 1),知道B(i)中,B(I-1),B(I-2),...,B(1),B(0),其中,B(0)为空序列,和与B(1)= S(1)。同时,这构成了证明该溶液是最佳的。 ;)

To find it, we will proceed by induction. We want to calculate B(i+1), knowing B(i), B(i-1), B(i-2), …, B(1), B(0), where B(0) is empty sequence, and and B(1) = s(1). At the same time, this constitutes a proof that the solution is optimal. ;)

要计算B(1 + 1),我们将挑选最好的序列的候选人中:

To calculate B(i+1), we will pick the best sequence among the candidates:

  1. 候选序列,其中最后一个块有一个元素:

  1. Candidate sequences where the last block has one element:

B(ⅰ)序列s(i + 1)的1 乙第(i-1)序列s(i + 1)2;仅当序列s(i)=序列s(i + 1)的 乙(ⅰ-2)序列s(i + 1)3;仅当序列s(i-1)= S(i)和序列s(i)=序列s(i + 1)的 ... B(1)序列s(i + 1)的[I-1];只有当s(2)= S(3)和s(3)= S(4)和......和序列s(i)=序列s(i + 1)的 B(0)序列s(i + 1)I =序列s(i + 1)I;只有当s(1)= S(2)和s(2)= S(3)和......和序列s(i)=序列s(i + 1)的

B(i )s(i+1)1 B(i-1)s(i+1)2 ; only if s(i) = s(i+1) B(i-2)s(i+1)3 ; only if s(i-1) = s(i) and s(i) = s(i+1) … B(1)s(i+1)[i-1] ; only if s(2)=s(3) and s(3)=s(4) and … and s(i) = s(i+1) B(0)s(i+1)i = s(i+1)i ; only if s(1)=s(2) and s(2)=s(3) and … and s(i) = s(i+1)

候选序列,其中最后一个块有2个元素:

Candidate sequences where the last block has 2 elements:

乙第(i-1)序列s(i)序列s(i + 1)的1 乙(ⅰ-3)序列s(i)序列s(i + 1)2;仅当序列s(i-2)序列s(i-1)=序列s(i)序列s(i + 1)的 乙(ⅰ-5)序列s(i)序列s(i + 1)3;仅当序列s(i-4)序列s(i-3)= S(I-2)序列s(i-1)和s(ⅰ-2)序列s(i-1)=序列s(i)序列s(i + 1个)

B(i-1)s(i)s(i+1)1 B(i-3)s(i)s(i+1)2 ; only if s(i-2)s(i-1)=s(i)s(i+1) B(i-5)s(i)s(i+1)3 ; only if s(i-4)s(i-3)=s(i-2)s(i-1) and s(i-2)s(i-1)=s(i)s(i+1) …

候选序列,其中最后一块有3个要素:

Candidate sequences where the last block has 3 elements:

...

候选序列,其中最后一块有4个元素:

Candidate sequences where the last block has 4 elements:

...

...

候选序列,其中最后一块有n + 1个元素:

Candidate sequences where last block has n+1 elements:

(1)的S(2)S(3).........序列s(i + 1)的

s(1)s(2)s(3)………s(i+1)

对于每一个可能性,该算法停止时的顺序块不再重复。就是这样。

For each possibility, the algorithm stops when the sequence block is no longer repeated. And that’s it.

该算法将一些像这样的事情在psude-C code:

The algorithm will be some thing like this in psude-c code:

B(0) = ""
for (i=1; i<=N; i++) {
    // Calculate all the candidates for B(i)
    BestCandidate=null
    for (j=1; j<=i; j++) {
        Calculate all the candidates of length (i)

        r=1;
        do {
             Candidadte = B([i-j]*r-1) s(i-j+1)…s(i-1)s(i) r
             If (   (BestCandidate==null) 
                      || (Candidate is shorter that BestCandidate)) 
                 {
            BestCandidate=Candidate.
                 }
             r++;
        } while (  ([i-j]*r <= i) 
             &&(s(i-j*r+1) s(i-j*r+2)…s(i-j*r+j) == s(i-j+1) s(i-j+2)…s(i-j+j))

    }
    B(i)=BestCandidate
}

希望这可以帮助多一点。

Hope that this can help a little more.

执行所需的任务的完整的C程序如下。它运行在为O(n ^ 2)。中心部分是只有30行code。

The full C program performing the required task is given below. It runs in O(n^2). The central part is only 30 lines of code.

修改我已经调整一点点的code,改变了变量的名称,并增加了一些注释,以便更容易阅读。

EDIT I have restructured a little bit the code, changed the names of the variables and added some comment in order to be more readable.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>


// This struct represents a compressed segment like atg4, g3,  agc1
    struct Segment {
        char *elements;
        int nElements;
        int count;
    };
         // As an example, for the segment agagt3  elements would be:
         // {
         //      elements: "agagt",
         //      nElements: 5,
         //      count: 3
         // }

    struct Sequence {
        struct Segment lastSegment;
        struct Sequence *prev;      // Points to a sequence without the last segment or NULL if it is the first segment
        int totalLen;  // Total length of the compressed sequence.
    };
       // as an example, for the sequence agt32ta5, the representation will be:
       // {
       //     lastSegment:{"ta" , 2 , 5},
       //     prev: @A,
       //     totalLen: 8
       // }
       // and A will be
       // {
       //     lastSegment{ "agt", 3, 32},
       //     prev: NULL,
       //     totalLen: 5
       // }


// This function converts a sequence to a string.
// You have to free the string after using it.
// The strategy is to construct the string from right to left.

char *sequence2string(struct Sequence *S) {
    char *Res=malloc(S->totalLen + 1);
    char *digits="0123456789";

    int p= S->totalLen;
    Res[p]=0;

    while (S!=NULL) {
            // first we insert the count of the last element.
            // We do digit by digit starting with the units.
            int C = S->lastSegment.count;
            while (C) {
                p--;
                Res[p] = digits[ C % 10 ];
                C /= 10;
            }

            p -= S->lastSegment.nElements;
            strncpy(Res + p , S->lastSegment.elements, S->lastSegment.nElements);

            S = S ->prev;
    }


    return Res;
}


// Compresses a dna sequence.
// Returns a string with the in sequence compressed.
// The returned string must be freed after using it.
char *dnaCompress(char *in) {
    int i,j;

    int N = strlen(in);;            // Number of elements of a in sequence.



    // B is an array of N+1 sequences where B(i) is the best compressed sequence sequence of the first i characters.
    // What we want to return is B[N];
    struct Sequence *B;
    B = malloc((N+1) * sizeof (struct Sequence));

    // We first do an initialization for i=0

    B[0].lastSegment.elements="";
    B[0].lastSegment.nElements=0;
    B[0].lastSegment.count=0;
    B[0].prev = NULL;
    B[0].totalLen=0;

    // and set totalLen of all the sequences to a very HIGH VALUE in this case N*2 will be enougth,  We will try different sequences and keep the minimum one.
    for (i=1; i<=N; i++) B[i].totalLen = INT_MAX;   // A very high value

    for (i=1; i<=N; i++) {

        // at this point we want to calculate B[i] and we know B[i-1], B[i-2], .... ,B[0]
        for (j=1; j<=i; j++) {

            // Here we will check all the candidates where the last segment has j elements

            int r=1;                  // number of times the last segment is repeated
            int rNDigits=1;           // Number of digits of r
            int rNDigitsBound=10;     // We will increment r, so this value is when r will have an extra digit.
                                      // when r = 0,1,...,9 => rNDigitsBound = 10
                                      // when r = 10,11,...,99 => rNDigitsBound = 100
                                      // when r = 100,101,.,999 => rNDigitsBound = 1000 and so on.

            do {

                // Here we analitze a candidate B(i).
                // where the las segment has j elements repeated r times.

                int CandidateLen = B[i-j*r].totalLen + j + rNDigits;
                if (CandidateLen < B[i].totalLen) {
                    B[i].lastSegment.elements = in + i - j*r;
                    B[i].lastSegment.nElements = j;
                    B[i].lastSegment.count = r;
                    B[i].prev = &(B[i-j*r]);
                    B[i].totalLen = CandidateLen;
                }

                r++;
                if (r == rNDigitsBound ) {
                    rNDigits++;
                    rNDigitsBound *= 10;
                }
            } while (   (i - j*r >= 0)
                     && (strncmp(in + i -j, in + i - j*r, j)==0));

        }
    }

    char *Res=sequence2string(&(B[N]));
    free(B);

    return Res;
}

int main(int argc, char** argv) {
    char *compressedDNA=dnaCompress(argv[1]);
    puts(compressedDNA);
    free(compressedDNA);
    return 0;
}

这篇关于如何匹配的DNA序列模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆