搜索字符串,允许在字符串的任何位置出现一个不匹配 [英] Search for string allowing for one mismatch in any location of the string

查看:66
本文介绍了搜索字符串,允许在字符串的任何位置出现一个不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理长度为 25 的 DNA 序列(参见下面的示例).我有一个 230,000 的列表,需要查找整个基因组中的每个序列(弓形虫寄生虫).我不确定基因组有多大,但比 230,000 个序列要长得多.

I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasite). I am not sure how large the genome is, but much longer than 230,000 sequences.

我需要查找每个 25 个字符的序列,例如 (AGCCTCCCATGATTGAAACAGATCAT).

I need to look for each of my sequences of 25 characters, for example, (AGCCTCCCATGATTGAACAGATCAT).

基因组被格式化为一个连续的字符串,即 (CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAGTGCGGAGCCTGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTT....)

The genome is formatted as a continuous string, i.e. (CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAGTGCGGAGCCTGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTT....)

我不在乎它在哪里或找到了多少次,只关心它是否存在.
我觉得这很简单 --

I don't care where or how many times it is found, only whether it is or not.
This is simple I think --

str.find(AGCCTCCCATGATTGAACAGATCAT)

但我也发现了一个接近匹配的地方,定义为错误(不匹配)的任何位置,但只有一个位置,并在序列中记录该位置.我不确定如何做到这一点.我唯一能想到的是使用通配符并在每个位置使用通配符执行搜索.即,搜索 25 次.

But I also what to find a close match defined as wrong (mismatched) at any location, but only one location, and record the location in the sequence. I am not sure how do do this. The only thing I can think of is using a wildcard and performing the search with a wildcard in each position. I.e., search 25 times.

例如

AGCCTCCCATGATTGAACAGATCAT    
AGCCTCCCATGATAGAACAGATCAT

在第 13 位错配的接近匹配.

A close match with a mismatch at position 13.

速度不是大问题,因为我只做了 3 次,不过如果速度快就好了.

Speed is not a big issue because I am only doing it 3 times, though it would be nice if it was fast.

有些程序可以执行此操作——查找匹配项和部分匹配项——但我正在寻找一种在这些应用程序中无法发现的部分匹配项.

There are programs that do this -- find matches and partial matches -- but I am looking for a type of partial match that is not discoverable with these applications.

这是 perl 的类似帖子,尽管它们只是比较序列而不是搜索连续字符串:

Here is a similar post for perl, although they are only comparing sequences and not searching a continuous string :

相关帖子

推荐答案

在您继续阅读之前,您是否看过 biopython?

您似乎想要找到具有一个替换错误和零插入/删除错误的近似匹配,即汉明距离为 1.

It appears that you want to find approximate matches with one substitution error, and zero insertion/deletion errors i.e. a Hamming distance of 1.

如果您有汉明距离匹配功能(例如参见 Ignacio 提供的链接),您可以像这样使用它来搜索第一个匹配项:

If you have a Hamming distance match function (see e.g. the link provided by Ignacio), you could use it like this to do a search for the first match:

any(Hamming_distance(genome[x:x+25], sequence) == 1 for x in xrange(len(genome)))

但这会相当慢,因为 (1) 汉明距离函数会在第二次替换错误后继续研磨 (2) 失败后,它会将光标向前移动一个,而不是根据它看到的内容向前跳过(例如Boyer-Moore 搜索确实如此).

but this would be rather slow, because (1) the Hamming distance function would keep on grinding after the 2nd substitution error (2) after failure, it advances the cursor by one rather than skipping ahead based on what it saw (like a Boyer-Moore search does).

你可以用这样的函数克服(1):

You can overcome (1) with a function like this:

def Hamming_check_0_or_1(genome, posn, sequence):
    errors = 0
    for i in xrange(25):
        if genome[posn+i] != sequence[i]:
            errors += 1
            if errors >= 2:
                return errors
    return errors 

注意:这不是 Pythonic,而是 Cic,因为您需要使用 C(可能通过 Cython)来获得合理的速度.

Note: that's intentionally not Pythonic, it's Cic, because you'd need to use C (perhaps via Cython) to get reasonable speed.

Navarro 和 Raffinot(谷歌Navarro Raffinot nrgrep")已经完成了一些关于带跳过的位并行近似 Levenshtein 搜索的工作,这可以适用于汉明搜索.请注意,位并行方法对查询字符串的长度和字母大小有限制,但您的方法分别为 25 和 4,因此没有问题.更新:对于字母大小为 4 的情况,跳过可能没有太大帮助.

Some work on bit-parallel approximate Levenshtein searches with skipping has been done by Navarro and Raffinot (google "Navarro Raffinot nrgrep") and this could be adapted to Hamming searches. Note that bit-parallel methods have limitations on length of query string and alphabet size but yours are 25 and 4 respectively so no problems there. Update: skipping probably not much help with an alphabet size of 4.

当您在 google 上搜索汉明距离搜索时,您会注意到很多关于在硬件中实现它的内容,而在软件中却没有太多内容.这是一个很大的提示,也许您想出的任何算法都应该用 C 或其他编译语言实现.

When you google for Hamming distance search, you will notice lots of stuff about implementing it in hardware, and not much in software. This is a big hint that maybe whatever algorithm you come up with ought to be implemented in C or some other compiled language.

更新: 位并行方法的工作代码

我还提供了一种帮助进行正确性检查的简单方法,并且我打包了 Paul 的 re 代码的变体以进行一些比较.请注意,使用 re.finditer() 提供不重叠的结果,这可能会导致距离为 1 的匹配掩盖精确匹配;查看我的最后一个测试用例.

I've also supplied a simplistic method for helping with the correctness checking, and I've packaged a variation of Paul's re code for some comparisons. Note that using re.finditer() delivers non-overlapping results, and this can cause a distance-1 match to shadow an exact match; see my last test case.

位并行方法具有以下特点:保证线性行为 O(N),其中 N 是文本长度.注意天真的方法是 O(NM) 正则表达式方法(M 是模式长度).Boyer-Moore 风格的方法将是最坏的情况 O(NM) 和预期的 O(N).当输入必须被缓冲时,位并行方法也可以很容易地使用:它可以一次输入一个字节或一个兆字节;没有前瞻,缓冲区边界没有问题.最大的优势:用 C 编码时,简单的每输入字节代码的速度.

The bit-parallel method has these features: guaranteed linear behaviour O(N) where N is text length. Note naive method is O(NM) as is the regex method (M is the pattern length). A Boyer-Moore-style method would be worst case O(NM) and expected O(N). Also the bit-parallel method can be used easily when input has to be buffered: it can be fed a byte or a megabyte at a time; no look-ahead, no problems with buffer boundaries. The big advantage: the speed of that simple per-input-byte code when coded in C.

缺点:模式长度实际上受限于快速寄存器中的位数,例如32 或 64.在这种情况下,模式长度为 25;没问题.它使用与模式中不同字符数成正比的额外内存 (S_table).在这种情况下,字母大小"仅为 4;没问题.

Downsides: the pattern length is effectively limited to the number of bits in a fast register e.g. 32 or 64. In this case the pattern length is 25; no problem. It uses extra memory (S_table) proportional to the number of distinct characters in the pattern. In this case, the "alphabet size" is only 4; no problem.

来自本技术报告的详细信息.该算法用于使用 Levenshtein 距离进行近似搜索.为了转换为使用汉明距离,我简单地(!)删除了处理插入和删除的语句 2.1.您会注意到很多对带有d"上标的R"的引用.d"是距离.我们只需要 0 和 1.这些R"成为下面代码中的 R0 和 R1 变量.

Details from this technical report. The algorithm there is for approximate search usin Levenshtein distance. To convert to using Hamming distance, I simply (!) removed the pieces of statement 2.1 that handle insertion and deletion. You'll notice lots of reference to "R" with a "d" superscript. "d" is distance. We need only 0 and 1. These "R"s become the R0 and R1 variables in the code below.

# coding: ascii

from collections import defaultdict
import re

_DEBUG = 0


# "Fast Text Searching with Errors" by Sun Wu and Udi Manber
# TR 91-11, Dept of Computer Science, University of Arizona, June 1991.
# http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.8854

def WM_approx_Ham1_search(pattern, text):
    """Generate (Hamming_dist, start_offset)
    for matches with distance 0 or 1"""
    m = len(pattern)
    S_table = defaultdict(int)
    for i, c in enumerate(pattern):
        S_table[c] |= 1 << i
    R0 = 0
    R1 = 0
    mask = 1 << (m - 1)
    for j, c in enumerate(text):
        S = S_table[c]
        shR0 = (R0 << 1) | 1
        R0 = shR0 & S
        R1 = ((R1 << 1) | 1) & S | shR0
        if _DEBUG:
            print "j= %2d msk=%s S=%s R0=%s R1=%s" \
                % tuple([j] + map(bitstr, [mask, S, R0, R1]))
        if R0 & mask: # exact match
            yield 0, j - m + 1
        elif R1 & mask: # match with one substitution
            yield 1, j - m + 1

if _DEBUG:

    def bitstr(num, mlen=8):
       wstr = ""
       for i in xrange(mlen):
          if num & 1:
             wstr = "1" + wstr
          else:
             wstr = "0" + wstr
          num >>= 1
       return wstr

def Ham_dist(s1, s2):
    """Calculate Hamming distance between 2 sequences."""
    assert len(s1) == len(s2)
    return sum(c1 != c2 for c1, c2 in zip(s1, s2))

def long_check(pattern, text):
    """Naively and understandably generate (Hamming_dist, start_offset)
    for matches with distance 0 or 1"""
    m = len(pattern)
    for i in xrange(len(text) - m + 1):
        d = Ham_dist(pattern, text[i:i+m])
        if d < 2:
            yield d, i

def Paul_McGuire_regex(pattern, text):
    searchSeqREStr = (
        '('
        + pattern
        + ')|('
        + ')|('.join(
            pattern[:i]
            + "[ACTGN]".replace(c,'')
            + pattern[i+1:]
            for i,c in enumerate(pattern)
            )
        + ')'
        )
    searchSeqRE = re.compile(searchSeqREStr)
    for match in searchSeqRE.finditer(text):
        locn = match.start()
        dist = int(bool(match.lastindex - 1))
        yield dist, locn


if __name__ == "__main__":

    genome1 = "TTTACGTAAACTAAACTGTAA"
    #         01234567890123456789012345
    #                   1         2

    tests = [
        (genome1, "ACGT ATGT ACTA ATCG TTTT ATTA TTTA"),
        ("T" * 10, "TTTT"),
        ("ACGTCGTAAAA", "TCGT"), # partial match can shadow an exact match
        ]

    nfailed = 0
    for genome, patterns in tests:
        print "genome:", genome
        for pattern in patterns.split():
            print pattern
            a1 = list(WM_approx_Ham1_search(pattern, genome))
            a2 = list(long_check(pattern, genome))
            a3 = list(Paul_McGuire_regex(pattern, genome))
            print a1
            print a2
            print a3
            print a1 == a2, a2 == a3
            nfailed += (a1 != a2 or a2 != a3)
    print "***", nfailed

这篇关于搜索字符串,允许在字符串的任何位置出现一个不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆