删除重复的Fasta序列(biopython方法的重击) [英] Remove duplicated fasta sequence (bash of biopython method)

查看:795
本文介绍了删除重复的Fasta序列(biopython方法的重击)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你好,我有一个Fasta文件,例如:

Hello I have a fasta file such as :

>sequence1_CP [seq  virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

>sequence2 [virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

>sequence3
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK

>sequence5 hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK

>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

>sequence7 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

在这个文件中,我想删除重复的序列并得到:

And in this file I would like to remove duplicated sequence and get :

>sequence1_CP [seq  virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK

>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

在这里您可以看到sequence1_CPsequence2sequence3> name之后的包含内容相同,那么我只想保留其中的3个.但是,如果3个序列之一具有一个_CP的名称,那么我要特别保留此名称.如果其中任何一个都不包含_CP,则不会保留我所保留的一个.

Here as you can see the containt after the > name for sequence1_CP, sequence2 and sequence3 is the same, then I want only to keep on of the 3. But if one of the 3 sequences have a _CP in its name, then I want to keep this one especially. If there is none _CP in any of them it does not mater wich one I keep.

  • 因此对于Sequence1_CPSequence2Sequence3之间的第一个重复项,我保留sequence1_CP
  • 对于sequence4_CPsequence5之间的第二个重复项,我保留sequence4_CP
  • 对于sequence6和sequence7之间的第三次重复,我保留了第一个sequence6
  • So for the first duplicates between Sequence1_CP, Sequence2 and Sequence3 I keep sequence1_CP
  • For the second duplicates between sequence4_CP and sequence5 I keep sequence4_CP
  • And for the third duplicates between sequence6 and sequence7 I keep the first one sequence6

有人使用biopython或bash方法有想法吗? 非常感谢

Does someone have an idea using biopython or a bash method ? Thanks a lot

推荐答案

您可以使用以下awk单行代码:

You could use this awk one-liner:

$ awk 'BEGIN{FS="\n";RS=""}{if(!seen[$2,$3]++)print}' file

输出:

>sequence1_CP [seq  virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

上面的

依赖于观察到的顺序是_CP顺序在样本中的其他顺序之前.如果实际上并非如此,请使用以下内容.它存储每个序列的第一个实例,如果找到_CP序列,则该实例将被覆盖:

Above relies on observation that the sequences are in order where the _CPs come before others like in the sample. If this is not in fact the case, use the following. It stores the first instance of each sequence which is overwritten if a _CP sequence is found:

$ awk 'BEGIN{FS="\n";RS=""}{if(!($2,$3) in seen||$1~/^[^ ]+_CP /)seen[$2,$3]=$0}END{for(i in seen)print (++j>1?ORS:"") seen[i]}' file

或采用精美印刷:

$ awk '
BEGIN {
    FS="\n"
    RS=""
}
{
    if(!($2,$3) in seen||$1~/^[^ ]+_CP /)
        seen[$2,$3]=$0
}
END {
    for(i in seen)
        print (++j>1?ORS:"") seen[i]
}' file

输出顺序是awk的默认值,即.似乎是随机的.

The output order is awk default ie. appears random.

更新如果在这种情况下@kvantour的BOTH注释均有效,请使用以下awk:

Update If @kvantour's BOTH comments are valid in this case, use this awk:

$ awk '
BEGIN {
    FS="\n"
    RS=""
}
{
    for(i=2;i<=NF;i++)
        k=(i==2?"":k) $i
    if(!(k in seen)||$1~/^[^ ]+_CP /)
        seen[k]=$0
}
END {
    for(i in seen)
        print (++j>1?ORS:"") seen[i]
}' file

现在输出:

>sequence1_CP [seq  virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK

这篇关于删除重复的Fasta序列(biopython方法的重击)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆