正则表达式匹配不适用于Pyteomics解析器的简单字符串 [英] regex match not working on simple string with Pyteomics parser

查看:81
本文介绍了正则表达式匹配不适用于Pyteomics解析器的简单字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对人类蛋白质组进行计算机分析,这意味着我试图在特定位置切碎每种蛋白质的氨基酸序列.我正在使用内部的Pyteomics解析器功能

解决方案

此处是蛋白质组学的维护者.

错误消息实际上告诉您问题的根源: PyteomicsError:电池组错误,消息:无效的modX序列:{'sequence':'AKDEVQKN'}"

这意味着您传递了字典 {'sequence':'AKDEVQKN'} 而不是字符串'AKDEVQKN'.这实际上发生在这里:

  pep_dic = [{'sequence':i} for i in unique_peptides]对于pep_dic中的肽:pep_dic ['parsed_sequence'] = parser.parse(peptides,show_unmodified_termini = False)... 

您应该将序列本身传递给 parse ,而不是字典:

  pep_dic ['parsed_sequence'] = parser.parse(peptides ['sequence'],show_unmodified_termini = False) 

I am performing an in silico digestion of the human proteome, meaning that I am trying to chopped the amino acid sequence of every protein at a certain position. I am using the Pyteomics parser function Pyteomics Parser within a bigger function that I have created.

I am getting this error: PyteomicsError: Pyteomics error, message: "Not a valid modX sequence: {'sequence': 'AKDEVQKN'}"

However, I am unsure how AKDEVQKN doesn't match the modX_reqquence compilier:

_modX_sequence = re.compile(r'^([^-]+-)?((?:[^A-Z-]*[A-Z])+)(-[^-]+)?$')

From my understanding of this regex, it should find any string that doesn't start with (-) and is followed by a series of alphabetical characters.

This is the function I am trying to use it on.

import re
import pyteomics
from pyteomics import fasta, parser
def ButcherShop(df, target, rule,min_length=7,exception=None,max_legnth=100, pH=2.0):
>     raw = df[target]
>     unique_peptides = set()
>     for peptide in raw:
>         new_peptides = parser.cleave(peptide, rule=rule,min_length=min_length,exception=exception)
>         unique_peptides.update(new_peptides)
>     print(f'Done,{len(unique_peptides)} sequences of >= 7 amino acids!')
>     pep_dic = [{'sequence': i} for i in unique_peptides]
>     for peptides in pep_dic:
>         pep_dic['parsed_sequence'] = parser.parse(peptides,show_unmodified_termini=False)
>         pep_dic['xlength'] = len(peptides)
>         pep_dic['charge'] = int(round(electrochem.charge(peptides, pH=pH)))
>         pep_dic['mass']=int(round(Peptide_mass(peptides)))
>     pep_dic = [peptide for peptide in pep_dic if peptide['length'] <= int(max_length)]
>     pep_df = pd.DataFrame.from_dict(pep_dic)
>     return unique_peptides,pep_dic,pep_df

Thank you for any insight on how to address this.

** Update: If I run on a different set, I am getting the same error which may suggest it is the library itself.

Screenshot of Error:

解决方案

Pyteomics maintainer here.

The error message actually tells you the source of the problem: PyteomicsError: Pyteomics error, message: "Not a valid modX sequence: {'sequence': 'AKDEVQKN'}"

It means that instead of a string 'AKDEVQKN' you passed a dictionary {'sequence': 'AKDEVQKN'}. This actually happens here:

pep_dic = [{'sequence': i} for i in unique_peptides]
for peptides in pep_dic:
    pep_dic['parsed_sequence'] = parser.parse(peptides,show_unmodified_termini=False)
    ...

You should pass the sequence itself to parse, not the dict:

pep_dic['parsed_sequence'] = parser.parse(peptides['sequence'], show_unmodified_termini=False)

这篇关于正则表达式匹配不适用于Pyteomics解析器的简单字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆