使用Mathematica在已定义位置的左侧或右侧使用"StringCut" [英] 'StringCut' to the left or right of a defined position using Mathematica

查看:127
本文介绍了使用Mathematica在已定义位置的左侧或右侧使用"StringCut"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在阅读这个问题时,我认为使用StringSplit

On reading this question, I thought the following problem would be simple using StringSplit

给出以下字符串,我想将其剪切"到每个"D"的左侧,使得:

Given the following string, I want to 'cut' it to the left of every "D" such that:

  1. 我得到一个片段的列表(序列保持不变)

StringJoin @fragments返回原始字符串(但是我是否必须重新排序片段以获得该字符串并不重要).也就是说,每个片段中的顺序很重要,我不想丢失任何字符.

StringJoin@fragments gives back the original string (but is does not matter if I have to reorder the fragments to obtain this). That is, sequence within each fragment is important, and I do not want to lose any characters.

(我感兴趣的示例是一个蛋白质序列(字符串),其中每个字符都代表一个字母代码的氨基酸.我想获得所有片段的理论列表,这些片段是通过用已知的先裂解的酶处理而获得的"D")

(The example I am interested in is a protein sequence (string) where each character represents an amino acid in one-letter code. I want to obtain the theoretical list of ALL fragments obtained by treating with an enzyme known to split before "D")

str = "MTPDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN"

我能想到的最好的方法是使用StringReplace在每个"D"之前插入一个空格,然后使用StringSplit.至少可以这样说,这似乎很尴尬.

The best I can come up with is to insert a space before each "D" using StringReplace and then use StringSplit. This seems quite awkward, to say the least.

frags1 = StringSplit@StringReplace[str, "D" -> " D"]

提供输出:

{"MTP", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"}

,或者使用StringReplacePart:

frags1alt = 
 StringSplit@StringReplacePart[str, " D", StringPosition[str, "D"]]

最后(更现实的是),如果我想在"D"之前进行拆分,条件是紧接其之前的残基不是"P"(即,PD,(Pro-Asp)键不被裂解),我可以这样做如下:

Finally (and more realistically), if I want to split before "D" provided that the residue immediately preceding it is not "P" [ie P-D,(Pro-Asp) bonds are not cleaved], I do it as follows:

StringSplit@StringReplace[str, (x_ /; x != "P") ~~ "D" -> x ~~ " D"]

有没有更优雅的方式?

Is there a more elegant way?

速度不一定是问题.我不太可能处理大于500个字符的字符串.我正在使用Mma 7.

Speed is not necessarily an issue. I am unlikely to be dealing with strings of greater than, say, 500 characters. I am using Mma 7.

更新

我已经添加了生物信息学标签,并且我认为从该领域添加示例可能很有趣.

I have added the bioinformatics tag, and I thought it might be of interest to add an example from that field.

以下内容从 NCBI中导入蛋白质序列(牛血清白蛋白,登录号3336842). eutils 数据库,然后生成(理论上的)<一个href ="http://en.wikipedia.org/wiki/Trypsin" rel ="nofollow noreferrer">胰蛋白酶摘要.我假设如果A1不是"R","K"或"P",则当A1为"R"或"K"时,酶的tripsin会在残基A1-A2之间裂解.如果有人有任何改进建议,请随时提出修改建议.

The following imports a protein sequence (Bovine serum albumin, accession number 3336842) from the NCBI database using eutils and then generates a (theoretical) trypsin digest. I have assumed that the enzyme tripsin cleaves between residues A1-A2 when A1 is either "R" or "K", provided that A2 is not "R", "K" or "P". If anyone has any suggestions for improvements, please feel free to suggest modifications.

使用sakra方法的修改(可能需要删除'?db ='之后的回车符):

Using a modification of sakra's method ( a carriage return after '?db=' possibly needs to be removed):

StringJoin /@ 
   Split[Characters[#], 
    And @@ Function[x, #1 != x] /@ {"R", "K"} || 
      Or @@ Function[xx, #2 == xx] /@ {"R", "K", "P"} &] & @
 StringJoin@
  Rest@Import[
    "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=\
protein&id=3336842&rettype=fasta&retmode=text", "Data"]

我可能使用regex方法(Sasha/WReach)来做同样的事情很困难:

My possibly ham-fisted attempt at using the regex method (Sasha/WReach) to do the same thing:

StringSplit[#, RegularExpression["(?![PKR])(?<=[KR])"]] &@
 StringJoin@Rest@Import[...]

输出

{MK,WVTFISLLLLFSSAYSR,GVFRR,<<69>>,CCAADDK,EACFAVEGPK,LVVSTQTALA}

推荐答案

以下是一些替代解决方案:

Here are some alternate solutions:

以任何出现的"D"分隔:

Splitting by any occurrence of "D":

In[18]:= StringJoin /@ Split[Characters["MTPDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN"], #2!="D" &]
Out[18]:= {"MTP", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"}

以任何出现的"D"分隔,前提是它前面没有"P":

Splitting by any occurrence of "D" provided it is not preceded by "P":

In[19]:= StringJoin /@ Split[Characters["MTPDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN"], #2!="D" || #1=="P" &]
Out[19]:= {"MTPDKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"}

这篇关于使用Mathematica在已定义位置的左侧或右侧使用"StringCut"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆