快速算法独特的集文本两个很长的序列 [英] Fast algorithms for finding unique sets in two very long sequences of text

查看:126
本文介绍了快速算法独特的集文本两个很长的序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要比较的X和Y染色体的DNA序列,和寻找模式(约50-75个碱基对组成)所特有的Y染色体。请注意,这些序列的部分可以在染色体重复。这需要快速进行(BLAST需要47个天,需要几个小时或更小)。是否有任何特别的算法或方案适合于这种比较?此外,速度是关键在这里。

一个我把这个SO的一个原因是,从特定应用领域以外的人,谁可以提出自己的字符串比较使用在日常使用的算法,可以适用于我们利用获得的视角。所以,不要害羞!

解决方案
  1. 在构建序列X中的后缀树秒。
  2. 对于序列Y每个首发位置我,寻找Y [i..i + 75]在S.如果不匹配,可以发现在首发位置i(即,如果查找失败J&LT后面的字符串; 75个核苷酸配对),那么你已经找到了长度-J字符串唯一为y。
  3. 在最小的这类字符串在所有起始位置我是最短的唯一的字符串(或只是暂停后,你发现任何这样的字符串,如果你不感兴趣的最小化的长度)。

总时间:O(| X | + M | Y |),其中m为最大字符串长度(如:M = 75)

有基于广义后缀树可能是更高效的算法。

I need to compare the DNA sequences of X and Y chromosomes, and find patterns (composed of around 50-75 base pairs) that are unique to the Y chromosome. Note that these sequence parts can repeat in the chromosome. This needs to be done quickly (BLAST takes 47 days, need a few hours or less). Are there any algorithms or programs in particular suited to this kind of comparison? Again, speed is the key here.

One of the reasons I put this on SO was to get perspective from people outside the specific application domain, who can put forth algorithms they use in string comparison in their daily use, that may apply to our use. So don't be shy!

解决方案

  1. Build a suffix tree S on sequence X.
  2. For each starting position i in sequence Y, look for the string Y[i..i+75] in S. If no match can be found starting at position i (i.e. if lookup fails after j < 75 nucleotides matched) then you have found a length-j string unique to Y.
  3. The smallest such string over all starting positions i is the shortest unique string (or just halt after you find any such string if you aren't interested in minimising the length).

Total time: O(|X| + m|Y|) where m is the maximum string length (e.g. m = 75).

There are probably even more efficient algorithms based on generalised suffix trees.

这篇关于快速算法独特的集文本两个很长的序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆