如何通过名称串联(合并)AAStringSet? [英] How to concatenate (merge) AAStringSets by name?

查看:207
本文介绍了如何通过名称串联(合并)AAStringSet?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在生物信息学/微生物生态学文献中,一种相当普遍的做法是在构建系统树之前,将多个基因的多个序列比对连接起来.用R术语来说,可以说它们与来自的生物体融合"起来更为清楚,但是我敢肯定例子是更好的.

In bioinformatics/microbial ecology literature a fairly common practice is to concatenate multiple sequence alignments of multiple genes prior to building phylogenetic trees. In R terminology it may be clearer to say 'merge' these sequences by the organism they came from, but I'm sure examples are better.

说这是两个多重序列比对.

Say these are two multiple sequence alignments.

library(Biostrings)

set1<-AAStringSet(c("IVR", "RDG", "LKS"))
names(set1)<-paste("org", 1:3, sep="_")

set2<-AAStringSet(c("VRT", "RKG", "AST"))
names(set2)<-paste("org", 2:4, sep="_")

set1

A AAStringSet instance of length 3
    width seq    names               
[1]     3 IVR    org_1
[2]     3 RDG    org_2
[3]     3 LKS    org_3

set2

A AAStringSet instance of length 3
    width seq    names               
[1]     3 VRT    org_2
[2]     3 RKG    org_3
[3]     3 AST    org_4

这些序列的正确串联是

A AAStringSet instance of length 4
    width seq    names               
[1]     6 IVR--- org_1
[2]     6 RDGVRT org_2
[3]     6 LKSRKG org_3
[4]     6 ---AST org_4

-"表示该位置存在缺口"(缺少氨基酸),或者在这种情况下缺少连接的基因.

The "-" notes a 'gap' (lack of amino acid) in that position, or in this case a lack of a gene to concatenate.

我认为可以在BioStringsMSADECIPHER或其他相关程序包中执行此操作,但是一直找不到.

I thought there would be a function to do this in BioStrings, MSA, DECIPHER, or other related packages, but have been unable to find one.

我发现以下问题和解答,但每个问题都没有提供所需的输出.

I found the following Q&As, each does not provide the desired output as described.

1: https://support.bioconductor.org/p/38955/

输出

  A AAStringSet instance of length 6
    width seq names               
[1]     3 IVR org_1
[2]     3 RDG org_2
[3]     3 LKS org_3
[4]     3 VRT org_2
[5]     3 RKG org_3
[6]     3 AST org_4

可以更好地描述为附加"序列(垂直连接两组).

May be better described as 'appending' the sequences (joins the two sets vertically).

2: https://support.bioconductor.org/p/39878/

输出

  A AAStringSet instance of length 2

        width seq
    [1]     9 IVRRDGLKS
    [2]     9 VRTRKGAST

将每个集合中的序列连接起来,这是每个集合的完整嵌合体(肯定是不需要的).

Concatenates sequences in each set, a complete chimera of each set (certainly not desired).

3:如何将每个DNAStringSet序列连接在一起R中的样本?

输出

  A AAStringSet instance of length 3
    width seq
[1]     6 IVRVRT
[2]     6 RDGRKG
[3]     6 LKSAST

按照序列的顺序创建嵌合体.如果序列数量不同(循环并连接较短的集合...),则更糟

Creates chimeras of sequences by the order they are in. Even worse with different number of sequences (loops and concatenates shorter set...)

4: https://www.biostars.org/p/115192/

输出

  A AAStringSet instance of length 2
    width seq
[1]     3 IVR
[2]     3 VRT

仅追加每个集合的第一个序列,不确定为什么有人要这个...

Only appends the first sequence from each set, not sure why anyone wants this...

我通常认为这些过程将通过bashPython的某种组合来完成,但是我在R中使用了DECIPHER多序列比对器,因此进行R中的其余处理.在编写这个问题的过程中,我想出了一个我将发布的答案,但是我有点希望有人指出我所想念的手册,该手册描述了执行此功能的功能.谢谢!

I would normally think these kinds of processes would be done with some combination of bash and Python, but I'm using the DECIPHER multiple sequence aligner in R, so it makes sense to do the rest of the processing in R. In the process of writing up this question I came up with an answer that I will post, but I'm kind of expecting someone to point me to the manual I missed that describes the function that does this. Thanks!

推荐答案

所以我是Rdata.table的狂热用户,在很多方面,按名称合并数据集非常有用.我发现Biostrings::AAStringSet可以使用as.matrix转换为矩阵,并且可以转换为data.table并合并.

So I am a somewhat fanatical user of data.table in R, among many things it is great to merge datasets by names. I found Biostrings::AAStringSets can be converted to matrices using as.matrix and these can be converted to data.table and merged.

set1.dt<-data.table(as.matrix(set1), keep.rownames = TRUE)
set2.dt<-data.table(as.matrix(set2), keep.rownames = TRUE)
set12.dt<-merge(set1.dt, set2.dt, by="rn", all=TRUE)
    set12.dt
      rn V1.x V2.x V3.x V1.y V2.y V3.y
1: org_1    I    V    R <NA> <NA> <NA>
2: org_2    R    D    G    V    R    T
3: org_3    L    K    S    R    K    G
4: org_4 <NA> <NA> <NA>    A    S    T

这是正确的合并,但需要更多工作才能获得最终结果.

This is the correct merge, but needs more work to get the final result.

需要替换"NA"带-".我总是需要查找此问题,以记住使用data.table做到这一点的最佳方法.

Need to replace "NA" with "-". I always need to look up this question to remember the best way to do this with a data.table.

#slightly modified from original, added arg "x"
f_dowle = function(dt, x) {     # see EDIT later for more elegant solution
      na.replace = function(v,value=x) { v[is.na(v)] = value; v }
      for (i in names(dt))
        eval(parse(text=paste("dt[,",i,":=na.replace(",i,")]")))
    }
    
f_dowle(set12.dt, "-")

连接序列(不包括!"rn"的名称)

Concatenate the sequences (not included the names with !"rn")

set12<-apply(set12.dt[ ,!"rn"], 1, paste, collapse="")

转换回AAStringSet并重新添加名称

set12<-AAStringSet(set12)
names(set12)<-set12.dt$rn

所需的输出

set12
 A AAStringSet instance of length 4
    width seq names               
[1]     6 IVR--- org_1
[2]     6 RDGVRT org_2
[3]     6 LKSRKG org_3
[4]     6 ---AST org_4

这可行,但看起来很麻烦,尤其是在不同数据格式之间进行转换时.显然可以将其包装为一个函数以更轻松地使用,但是再次看来,这应该已经是某些Bioconductor包中的函数了……

This works, but seems quite cumbersome, especially converting between different data formats. Obviously can wrap it into a function to use more easily, but again seems like this should already be a function in some Bioconductor package...

这篇关于如何通过名称串联(合并)AAStringSet?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆