在具有超过300个aa和"C"的fasta文件中选择序列.发生至少4次 [英] Select sequences in a fasta file with more than 300 aa and "C" occurs at least 4 times

查看:86
本文介绍了在具有超过300个aa和"C"的fasta文件中选择序列.发生至少4次的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个fasta文件,其中包含蛋白质序列.我想选择具有300个以上氨基酸的序列,而半胱氨酸(C)氨基酸出现的次数超过4次.

I have a fasta file which contains protein sequences. I'd like to select sequences with more than 300 amino acids and Cysteine (C) amino acid appears more than 4 times.

我已经使用此命令来选择300氨基酸以上的序列:

I've used this command to select sequences with more than 300 aa:

 cat 72hDOWN-fasta.fasta | bioawk -c fastx 'length($seq) > 300{ print ">"$name; print $seq }' 

一些示例:

  >jgi|Triasp1|216614|CE216613_3477
 MPSLYLTSALGLLSLLPAAQAGWNPNSKDNIVVYWGQDAGSIGQNRLSYYCENAPDVDVI
 NISFLVGITDLNLNLANVGNNCTAFAQDPNLLDCPQVAADIVECQQTYGKTIMMSLFGST
 YTESGFSSSSTAVSAAQEIWAMFGPVQSGNSTPRPFGNAVIDGFDFDLEDPIENNMEPFA
 AELRSLTSAATSKKFYLSAAPQCVYPDASDESFLQGEVAFDWLNIQFYNNGCGTSYYPSG
 YNYATWDNWAKTVSANPNTKLLVGTPASVHAVNFANYFPTNDQLAGAISSSKSYDSFAGV
 MLWDMAQLFGNPGYLDLIVADLGGASTPPPPASTTLSTVTRSSTASTGPTSPPPSGGSVP
 QWGQCGGQGYTGPTQCQSPYTCVVESQWWSSCQ* 

推荐答案

我不知道 bioawk ,但我认为它与

POSIX标准IEEE Std 1003.1-2017 :

From the POSIX standard IEEE Std 1003.1-2017:

gsub(ere,repl [,in]) :表现与 sub 相似(见下文),不同之处在于它将替换所有出现的内容正则表达式(例如 实用程序的全局替代项) $ 0 或in参数中,指定时.

gsub(ere, repl[, in]): Behave like sub (see below), except that it shall replace all occurrences of the regular expression (like the ed utility global substitute) in $0 or in the in argument, when specified.

sub(ere,repl [,in]) :用字符串 repl 代替扩展正则表达式的第一个实例字符串 in 中的 ere 返回替换次数.&符(& )出现在字符串 repl 中的字符串应替换为 in 中的字符串与ERE匹配.&符前面有一个<反斜杠>应解释为文字< &&& gt;特点.连续两次发生<反斜杠>字符应解释为仅一个字符文字<反斜杠>特点.任何其他出现的<反斜杠>(例如,在任何其他字符之前)应被视为文字<反斜杠>特点.请注意,如果 repl 是一个字符串文字(词汇标记STRING;请参见语法), 这&符号的处理字符出现在任何词法之后处理,包括任何词法<反斜杠>-转义序列加工.如果指定了 in ,并且它不是左值(请参见 awk中的表达式),则行为未定义.如果省略 in ,则awk应该使用当前记录( $ 0 )代替.

sub(ere, repl[, in ]): Substitute the string repl in place of the first instance of the extended regular expression ere in string in and return the number of substitutions. An <ampersand> ( & ) appearing in the string repl shall be replaced by the string from in that matches the ERE. An <ampersand> preceded with a <backslash> shall be interpreted as the literal <ampersand> character. An occurrence of two consecutive <backslash> characters shall be interpreted as just a single literal <backslash> character. Any other occurrence of a <backslash> (for example, preceding any other character) shall be treated as a literal <backslash> character. Note that if repl is a string literal (the lexical token STRING; see Grammar), the handling of the <ampersand> character occurs after any lexical processing, including any lexical <backslash>-escape sequence processing. If in is specified and it is not an lvalue (see Expressions in awk), the behavior is undefined. If in is omitted, awk shall use the current record ($0) in its place.

注意: BioAwk 基于.我不确定此版本是否与 POSIX 兼容

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

这篇关于在具有超过300个aa和"C"的fasta文件中选择序列.发生至少4次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆