如何使用data.table在多个列(loci)中按组有效地计算等位基因频率(比例) [英] How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci)

查看:75
本文介绍了如何使用data.table在多个列(loci)中按组有效地计算等位基因频率(比例)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个等位基因身份的数据表(行是个人,列是基因座),由单独的列分组。我想按组有效地计算每个基因座的等位基因频率(比例)。数据表示例:

I have a data.table of allele identities (rows are individuals, columns are loci), grouped by a separate column. I want to calculate allele frequencies (proportions) for each locus efficiently, by group. An example data table:

    DT = data.table(Loc1=rep(c("G","T"),each=5), 
      Loc2=c("C","A"), Loc3=c("C","G","G","G",
      "C","G","G","G","G","G"), 
    Group=c(rep("G1",3),rep("G2",4),rep("G3",3)))
    for(i in 1:3)
        set(DT, sample(10,2), i, NA)
    > DT
        Loc1 Loc2 Loc3 Group
     1:    G   NA    C    G1
     2:    G    A    G    G1
     3:    G    C    G    G1
     4:   NA   NA   NA    G2
     5:    G    C   NA    G2
     6:    T    A    G    G2
     7:    T    C    G    G2
     8:    T    A    G    G3
     9:    T    C    G    G3
    10:   NA    A    G    G3

我遇到的问题是,当我尝试按组进行计算时,只能识别组中存在的等位基因ID,所以我努力寻找可以告诉我的代码,例如在所有三个组中,基因座1的G的比例。一个简单的例子,计算每个位点的第一个等位基因的总和(而不是比例):

The problem I have is that when I try to do calculations by group, only the allele i.d.s present in the group are recognized, so I'm struggling to find code that can tell me e.g. the proportion of G's for locus 1 in all 3 groups. Simple example, calculating a sum (not proportion) for the first allele at each locus:

    > fun1<- function(x){sum(na.omit(x==unique(na.omit(x))[1]))}
    > DT[,lapply(.SD,fun1),by=Group,.SDcols=1:3]
       Group Loc1 Loc2 Loc3
    1:    G1    3    1    1
    2:    G2    1    2    2
    3:    G3    2    2    3

对于G1,结果是Loc1有3个G,但是对于G3,它显示Loc1有2个T,而不是G的数量。在这种情况下,我想要两者的G数。因此,关键问题在于等位基因身份是由组而不是整个列确定的。我尝试用要在计算中使用的等位基因身份制作一个单独的表格,但无法弄清楚如何将其包括在fun1中,以便在上面的lapply中引用正确的单元格。等位基因身份表:

For G1 the result is that Loc1 has 3 G's, but for G3 it shows Loc1 has 2 T's, not the number of G's. I want the number of G's for both in this case. So the key problem is that the allele identities are determined by group, not over the whole column. I tried making a separate table with the allele identities I want to use in calculations, but can't figure out how to include it in fun1 so that the correct cells are referenced in lapply above. Allele identities table:

    > fun2<- function(x){sort(na.omit(unique(x)))}
    > allele.id<-data.table(DT[,lapply(.SD,fun2),.SDcols=1:3])
    > allele.id
       Loc1 Loc2 Loc3
    1:    G    A    C
    2:    T    C    G


推荐答案

首先将您的 data.table 转换为长格式可能是明智的。这将使它更易于用于进一步的计算(例如,使用 ggplot2 进行可视化)。使用 data.table melt 函数(其功能与 melt reshape2 包的code>函数),您可以从宽格式转换为长格式:

It's probably wise to transform your data.table into long format first. This will make it easier to use for further calculations (or making visualisations with ggplot2 for example). With the melt function of data.table (which works the same as the melt function of the reshape2 package) you can transform from wide to long format:

DT2 <- melt(DT, id = "Group", variable.name = "loci")

在融化操作期间要删除 NA 值时,可以添加 na.rm = TRUE 在上述调用中(默认行为 na.rm = FALSE )。

When you want to remove the NA-values during the melt-operation, you can add na.rm = TRUE in the above call (na.rm = FALSE is the default behaviour).

然后您可以使计数和比例变量如下:

Then you can make count and proportion variables as follows:

DT2 <- DT2[, .N, by = .(Group, loci, value)][, prop := N/sum(N), by = .(Group, loci)]

得到以下结果:

> DT2
    Group loci value N      prop
 1:    G1 Loc1     G 3 1.0000000
 2:    G2 Loc1    NA 1 0.2500000
 3:    G2 Loc1     G 1 0.2500000
 4:    G2 Loc1     T 2 0.5000000
 5:    G3 Loc1     T 2 0.6666667
 6:    G3 Loc1    NA 1 0.3333333
 7:    G1 Loc2    NA 1 0.3333333
 8:    G1 Loc2     A 1 0.3333333
 9:    G1 Loc2     C 1 0.3333333
10:    G2 Loc2    NA 1 0.2500000
11:    G2 Loc2     C 2 0.5000000
12:    G2 Loc2     A 1 0.2500000
13:    G3 Loc2     A 2 0.6666667
14:    G3 Loc2     C 1 0.3333333
15:    G1 Loc3     C 1 0.3333333
16:    G1 Loc3     G 2 0.6666667
17:    G2 Loc3    NA 2 0.5000000
18:    G2 Loc3     G 2 0.5000000
19:    G3 Loc3     G 3 1.0000000

想要以宽格式返回时,可以在多个变量上使用 dcast

I you want it back in wide format, you can use dcast on multiple variables:

DT3 <- dcast(DT2, Group + loci ~ value, value.var = c("N", "prop"), fill = 0)

其结果是:

> DT3
   Group loci N_A N_C N_G N_T N_NA    prop_A    prop_C    prop_G    prop_T   prop_NA
1:    G1 Loc1   0   0   3   0    0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
2:    G1 Loc2   1   1   0   0    1 0.3333333 0.3333333 0.0000000 0.0000000 0.3333333
3:    G1 Loc3   0   1   2   0    0 0.0000000 0.3333333 0.6666667 0.0000000 0.0000000
4:    G2 Loc1   0   0   1   2    1 0.0000000 0.0000000 0.2500000 0.5000000 0.2500000
5:    G2 Loc2   1   2   0   0    1 0.2500000 0.5000000 0.0000000 0.0000000 0.2500000
6:    G2 Loc3   0   0   2   0    2 0.0000000 0.0000000 0.5000000 0.0000000 0.5000000
7:    G3 Loc1   0   0   0   2    1 0.0000000 0.0000000 0.0000000 0.6666667 0.3333333
8:    G3 Loc2   2   1   0   0    0 0.6666667 0.3333333 0.0000000 0.0000000 0.0000000
9:    G3 Loc3   0   0   3   0    0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000






使用另一种直接的方法一次调用融化 dcast (这是@Frank回答的第一部分的简化版本):


Another and straightforward approach is using melt and dcast in one call (which is a simplified version of the first part of @Frank's answer):

DT2 <- dcast(melt(DT, id="Group"), Group + variable ~ value)

它给出:

> DT2
   Group variable A C G T NA
1:    G1     Loc1 0 0 3 0  0
2:    G1     Loc2 1 1 0 0  1
3:    G1     Loc3 0 1 2 0  0
4:    G2     Loc1 0 0 1 2  1
5:    G2     Loc2 1 2 0 0  1
6:    G2     Loc3 0 0 2 0  2
7:    G3     Loc1 0 0 0 2  1
8:    G3     Loc2 2 1 0 0  0
9:    G3     Loc3 0 0 3 0  0

由于 dcast 中的默认聚合函数为 length ,因此您将自动获得每个值。

Because the default aggregation function in dcast is length, you will automatically get the counts for each of the values.

使用的数据

Used data:

DT <- structure(list(Loc1 = c("G", "G", "G", NA, "G", "T", "T", "T", "T", NA), 
                     Loc2 = c(NA, "A", "C", NA, "C", "A", "C", "A", "C", "A"), 
                     Loc3 = c("C", "G", "G", NA, NA, "G", "G", "G", "G", "G"), 
                     Group = c("G1", "G1", "G1", "G2", "G2", "G2", "G2", "G3", "G3", "G3")), 
                .Names = c("Loc1", "Loc2", "Loc3", "Group"), row.names = c(NA, -10L), class = c("data.table", "data.frame"))

这篇关于如何使用data.table在多个列(loci)中按组有效地计算等位基因频率(比例)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆