在具有超过300个aa和"C"的fasta文件中选择序列.发生至少4次 [英] Select sequences in a fasta file with more than 300 aa and "C" occurs at least 4 times

查看：86 发布时间：2021/4/15 19:46:45 linux awk bioinformatics sequences fasta

本文介绍了在具有超过300个aa和"C"的fasta文件中选择序列.发生至少4次的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个fasta文件，其中包含蛋白质序列.我想选择具有300个以上氨基酸的序列，而半胱氨酸(C)氨基酸出现的次数超过4次.

I have a fasta file which contains protein sequences. I'd like to select sequences with more than 300 amino acids and Cysteine (C) amino acid appears more than 4 times.

我已经使用此命令来选择300氨基酸以上的序列:

I've used this command to select sequences with more than 300 aa:

 cat 72hDOWN-fasta.fasta | bioawk -c fastx 'length($seq) > 300{ print ">"$name; print $seq }'

一些示例:

  >jgi|Triasp1|216614|CE216613_3477
 MPSLYLTSALGLLSLLPAAQAGWNPNSKDNIVVYWGQDAGSIGQNRLSYYCENAPDVDVI
 NISFLVGITDLNLNLANVGNNCTAFAQDPNLLDCPQVAADIVECQQTYGKTIMMSLFGST
 YTESGFSSSSTAVSAAQEIWAMFGPVQSGNSTPRPFGNAVIDGFDFDLEDPIENNMEPFA
 AELRSLTSAATSKKFYLSAAPQCVYPDASDESFLQGEVAFDWLNIQFYNNGCGTSYYPSG
 YNYATWDNWAKTVSANPNTKLLVGTPASVHAVNFANYFPTNDQLAGAISSSKSYDSFAGV
 MLWDMAQLFGNPGYLDLIVADLGGASTPPPPASTTLSTVTRSSTASTGPTSPPPSGGSVP
 QWGQCGGQGYTGPTQCQSPYTCVVESQWWSSCQ*

推荐答案

我不知道 bioawk ，但我认为它与

I do not know bioawk but I assume it is identical to awk with some initial parsing and constant definitions.

我将按以下步骤进行.假设您要查找的字符串的字符数大于字母 C 的4倍，并且长度大于300，那么您可以执行以下操作:

I would proceed as follows. Assuming you want the find the strings with more then 4 times the letter C in and a length of more than 300, then you could do :

bioawk -c fastx '
   (length($seq) > 300) && (gsub("C","C",$seq)>4) {
       print ">"$name; print $seq
   }' 72hDOWN-fasta.fasta

，但这假设 seq 是完整字符序列.

but this assumes that seq is the full character sequence.

其背后的想法如下. gsub 命令在字符串中执行替换，并返回其执行的总替换.因此，如果我们用"C"替换所有字符"C"，则实际上并没有更改字符串，而是取回了字符串中"C"的总数.

The idea behind it is the following. The gsub command performs substitutions in strings and returns the total substitutions it did. Hence, if we substitute all characters "C" with "C" we actually did not change the string, but get the total amount of "C"'s in the string back.

摘录自 POSIX标准IEEE Std 1003.1-2017 :

From the POSIX standard IEEE Std 1003.1-2017:

gsub(ere，repl [，in]) :表现与 sub 相似(见下文)，不同之处在于它将替换所有出现的内容正则表达式(例如实用程序的全局替代项) $ 0 或in参数中，指定时.

gsub(ere, repl[, in]): Behave like sub (see below), except that it shall replace all occurrences of the regular expression (like the ed utility global substitute) in $0 or in the in argument, when specified.

sub(ere，repl [，in]) :用字符串 repl 代替扩展正则表达式的第一个实例字符串 in 中的 ere 并返回替换次数.&符(& )出现在字符串 repl 中的字符串应替换为 in 中的字符串与ERE匹配.&符前面有一个<反斜杠>应解释为文字< &&& gt;特点.连续两次发生<反斜杠>字符应解释为仅一个字符文字<反斜杠>特点.任何其他出现的<反斜杠>(例如，在任何其他字符之前)应被视为文字<反斜杠>特点.请注意，如果 repl 是一个字符串文字(词汇标记STRING；请参见语法)，这&符号的处理字符出现在任何词法之后处理，包括任何词法<反斜杠>-转义序列加工.如果指定了 in ，并且它不是左值(请参见 awk中的表达式)，则行为未定义.如果省略 in ，则awk应该使用当前记录( $ 0 )代替.

sub(ere, repl[, in ]): Substitute the string repl in place of the first instance of the extended regular expression ere in string in and return the number of substitutions. An <ampersand> ( & ) appearing in the string repl shall be replaced by the string from in that matches the ERE. An <ampersand> preceded with a <backslash> shall be interpreted as the literal <ampersand> character. An occurrence of two consecutive <backslash> characters shall be interpreted as just a single literal <backslash> character. Any other occurrence of a <backslash> (for example, preceding any other character) shall be treated as a literal <backslash> character. Note that if repl is a string literal (the lexical token STRING; see Grammar), the handling of the <ampersand> character occurs after any lexical processing, including any lexical <backslash>-escape sequence processing. If in is specified and it is not an lvalue (see Expressions in awk), the behavior is undefined. If in is omitted, awk shall use the current record ($0) in its place.

注意: BioAwk 基于.我不确定此版本是否与 POSIX 兼容

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

这篇关于在具有超过300个aa和"C"的fasta文件中选择序列.发生至少4次的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在具有超过300个aa和"C"的fasta文件中选择序列.发生至少4次 [英] Select sequences in a fasta file with more than 300 aa and "C" occurs at least 4 times

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

在具有超过300个aa和"C"的fasta文件中选择序列.发生至少4次 [英] Select sequences in a fasta file with more than 300 aa and &quot;C&quot; occurs at least 4 times

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

在具有超过300个aa和"C"的fasta文件中选择序列.发生至少4次 [英] Select sequences in a fasta file with more than 300 aa and "C" occurs at least 4 times

登录关闭