正则表达式蛋白质消化 [英] Regex Protein Digestion

查看:139
本文介绍了正则表达式蛋白质消化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我正在用一种酶(为了您的好奇心,Asp-N)消化一个蛋白质序列,该酶在单字母编码序列中先由B或D编码的蛋白质裂解.我的实际分析使用String#scan进行捕获.我试图弄清楚为什么以下正则表达式不能正确地消化它...

So, I'm digesting a protein sequence with an enzyme (for your curiosity, Asp-N) which cleaves before the proteins coded by B or D in a single-letter coded sequence. My actual analysis uses String#scan for the captures. I'm trying to figure out why the following regular expression doesn't digest it correctly...

(\w*?)(?=[BD])|(.*\b)

其中前一个(.*\b)存在以捕获序列的结尾. 对于:

where the antecedent (.*\b) exists to capture the end of the sequence. For:

MTMDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN

这应该像:[MTM, DKPSQY, DKIEAELQ, DICN, DVLELL, DSKG, ... ],但是会错过序列中的每个D.

This should give something like: [MTM, DKPSQY, DKIEAELQ, DICN, DVLELL, DSKG, ... ] but instead misses each D in the sequence.

我一直在使用 http://www.rubular.com 进行故障排除,该版本运行于1.8. 7,尽管我也已经在1.9.2上测试了此REGEX,但无济于事.据我了解,在两个版本的ruby中都支持零宽度的超前断言.我的正则表达式有什么问题?

I've been using http://www.rubular.com for troubleshooting, which runs on 1.8.7 although I've also tested this REGEX on 1.9.2 to no avail. It is my understanding that zero-width lookahead assertions are supported in both versions of ruby. What am I doing wrong with my regex?

推荐答案

最简单的支持方法是拆分零宽度的超前行:

The simplest way to support this is to split on the zero-width lookahead:

s = "MTMDKPSQYDKIEAELQDICNDVLELLDSKG"
p s.split /(?=[BD])/
#=> ["MTM", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG"]

为了了解您的解决方案出了什么问题,让我们先来看一下正则表达式与有效的正则表达式:

For understanding as to what was going wrong with your solution, let's look first at your regex versus one that works:

p s.scan(/.*?(?=[BD]|$)/)
#=> ["MTM", "", "KPSQY", "", "KIEAELQ", "", "ICN", "", "VLELL", "", "SKG", ""]

p s.scan(/.+?(?=[BD]|$)/)
#=> ["MTM", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG"]

问题是,如果您可以捕获零个字符并且仍然匹配零宽度的超前查询,则无需前进扫描指针就可以成功.让我们看一个更简单但相似的测试用例:

The problem is that if you can capture zero characters and still match your zero-width lookahead, you succeed without advancing the scanning pointer. Let's look at a simpler-but-similar test case:

s = "abcd"
p s.scan //      # Match any position, without advancing
#=> ["", "", "", "", ""]

p s.scan /(?=.)/ # Anywhere that is followed by a character, without advancing
#=> ["", "", "", ""]

String#scan的幼稚实现可能会陷入无限循环,并反复与第一个字符之前的指针匹配.看起来,一旦发生匹配而没有前进指针,则算法会强制将指针前进一个字符.这说明了您的情况下的结果:

A naive implementation of String#scan might get stuck in an infinite loop, repeatedly matching with the pointer before the first character. It appears that once a match occurs without advancing the pointer the algorithm forcibly advances the pointer by one character. This explains the results in your case:

  1. 首先,它匹配所有字符,直到B或D,
  2. 然后它与B或D之前的零宽度位置匹配,而无需移动字符指针,
  3. 结果是算法将指针移到了B或D上方,并在此之后继续.

这篇关于正则表达式蛋白质消化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆