在具有超过300个aa和"C"的fasta文件中选择序列.发生至少4次 [英] Select sequences in a fasta file with more than 300 aa and "C" occurs at least 4 times
问题描述
我有一个fasta文件,其中包含蛋白质序列.我想选择具有300个以上氨基酸的序列,而半胱氨酸(C)氨基酸出现的次数超过4次.
I have a fasta file which contains protein sequences. I'd like to select sequences with more than 300 amino acids and Cysteine (C) amino acid appears more than 4 times.
我已经使用此命令来选择300氨基酸以上的序列:
I've used this command to select sequences with more than 300 aa:
cat 72hDOWN-fasta.fasta | bioawk -c fastx 'length($seq) > 300{ print ">"$name; print $seq }'
一些示例:
>jgi|Triasp1|216614|CE216613_3477
MPSLYLTSALGLLSLLPAAQAGWNPNSKDNIVVYWGQDAGSIGQNRLSYYCENAPDVDVI
NISFLVGITDLNLNLANVGNNCTAFAQDPNLLDCPQVAADIVECQQTYGKTIMMSLFGST
YTESGFSSSSTAVSAAQEIWAMFGPVQSGNSTPRPFGNAVIDGFDFDLEDPIENNMEPFA
AELRSLTSAATSKKFYLSAAPQCVYPDASDESFLQGEVAFDWLNIQFYNNGCGTSYYPSG
YNYATWDNWAKTVSANPNTKLLVGTPASVHAVNFANYFPTNDQLAGAISSSKSYDSFAGV
MLWDMAQLFGNPGYLDLIVADLGGASTPPPPASTTLSTVTRSSTASTGPTSPPPSGGSVP
QWGQCGGQGYTGPTQCQSPYTCVVESQWWSSCQ*
推荐答案
I do not know bioawk
but I assume it is identical to awk with some initial parsing and constant definitions.
我将按以下步骤进行.假设您要查找的字符串的字符数大于字母 C
的4倍,并且长度大于300,那么您可以执行以下操作:
I would proceed as follows. Assuming you want the find the strings with more then 4 times the letter C
in and a length of more than 300, then you could do :
bioawk -c fastx '
(length($seq) > 300) && (gsub("C","C",$seq)>4) {
print ">"$name; print $seq
}' 72hDOWN-fasta.fasta
,但这假设 seq
是完整字符序列.
but this assumes that seq
is the full character sequence.
其背后的想法如下. gsub
命令在字符串中执行替换,并返回其执行的总替换.因此,如果我们用"C"替换所有字符"C",则实际上并没有更改字符串,而是取回了字符串中"C"的总数.
The idea behind it is the following. The gsub
command performs substitutions in strings and returns the total substitutions it did. Hence, if we substitute all characters "C" with "C" we actually did not change the string, but get the total amount of "C"'s in the string back.
From the POSIX standard IEEE Std 1003.1-2017:
gsub(ere,repl [,in])
:表现与 sub
相似(见下文),不同之处在于它将替换所有出现的内容正则表达式(例如
实用程序的全局替代项) $ 0
或in参数中,指定时.
gsub(ere, repl[, in])
: Behave like sub
(see below), except that it shall replace all occurrences of the regular expression (like
the ed
utility global substitute) in $0
or in the in argument,
when specified.
sub(ere,repl [,in])
:用字符串 repl
代替扩展正则表达式的第一个实例字符串 in
中的 ere
并返回替换次数.&符(&
)出现在字符串 repl
中的字符串应替换为 in
中的字符串与ERE匹配.&符前面有一个<反斜杠>应解释为文字< &&& gt;特点.连续两次发生<反斜杠>字符应解释为仅一个字符文字<反斜杠>特点.任何其他出现的<反斜杠>(例如,在任何其他字符之前)应被视为文字<反斜杠>特点.请注意,如果 repl
是一个字符串文字(词汇标记STRING;请参见语法), 这&符号的处理字符出现在任何词法之后处理,包括任何词法<反斜杠>-转义序列加工.如果指定了 in
,并且它不是左值(请参见 awk中的表达式),则行为未定义.如果省略 in
,则awk应该使用当前记录( $ 0
)代替.
sub(ere, repl[, in ])
: Substitute the string repl
in place of the first instance of the extended regular expression ere
in string in
and return the number of substitutions. An <ampersand> ( &
) appearing in the string repl
shall be replaced by the string from in
that matches the ERE. An <ampersand> preceded with a
<backslash> shall be interpreted as the literal
<ampersand> character. An occurrence of two consecutive
<backslash> characters shall be interpreted as just a single
literal <backslash> character. Any other occurrence of a
<backslash> (for example, preceding any other character) shall
be treated as a literal <backslash> character. Note that if repl
is a string literal (the lexical token STRING; see Grammar), the
handling of the <ampersand> character occurs after any lexical
processing, including any lexical <backslash>-escape sequence
processing. If in
is specified and it is not an lvalue (see
Expressions in awk), the behavior is undefined. If in
is omitted, awk
shall use the current record ($0
) in its place.
注意: BioAwk 基于" AWK编程语言,作者:Al Aho,Brian Kernighan和Peter Weinberger(Addison-Wesley,1988,ISBN 0-201-07981-X).我不确定此版本是否与 POSIX 兼容
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.
这篇关于在具有超过300个aa和"C"的fasta文件中选择序列.发生至少4次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!