如何使用 data.table 跨多个列(基因座)按组有效计算等位基因频率(比例) [英] How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci)

查看:14
本文介绍了如何使用 data.table 跨多个列(基因座)按组有效计算等位基因频率(比例)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个等位基因身份的 data.table(行是个体,列是基因座),按单独的列分组.我想按组有效地计算每个基因座的等位基因频率(比例).示例数据表:

I have a data.table of allele identities (rows are individuals, columns are loci), grouped by a separate column. I want to calculate allele frequencies (proportions) for each locus efficiently, by group. An example data table:

    DT = data.table(Loc1=rep(c("G","T"),each=5), 
      Loc2=c("C","A"), Loc3=c("C","G","G","G",
      "C","G","G","G","G","G"), 
    Group=c(rep("G1",3),rep("G2",4),rep("G3",3)))
    for(i in 1:3)
        set(DT, sample(10,2), i, NA)
    > DT
        Loc1 Loc2 Loc3 Group
     1:    G   NA    C    G1
     2:    G    A    G    G1
     3:    G    C    G    G1
     4:   NA   NA   NA    G2
     5:    G    C   NA    G2
     6:    T    A    G    G2
     7:    T    C    G    G2
     8:    T    A    G    G3
     9:    T    C    G    G3
    10:   NA    A    G    G3

我遇到的问题是,当我尝试按组进行计算时,只有组中存在的等位基因 i.d.s 被识别,所以我很难找到可以告诉我的代码,例如基因座 1 的 G 在所有 3 组中的比例.简单的例子,计算每个基因座的第一个等位基因的总和(不是比例):

The problem I have is that when I try to do calculations by group, only the allele i.d.s present in the group are recognized, so I'm struggling to find code that can tell me e.g. the proportion of G's for locus 1 in all 3 groups. Simple example, calculating a sum (not proportion) for the first allele at each locus:

    > fun1<- function(x){sum(na.omit(x==unique(na.omit(x))[1]))}
    > DT[,lapply(.SD,fun1),by=Group,.SDcols=1:3]
       Group Loc1 Loc2 Loc3
    1:    G1    3    1    1
    2:    G2    1    2    2
    3:    G3    2    2    3

对于 G1,结果是 Loc1 有 3 个 G,但对于 G3,它显示 Loc1 有 2 个 T,而不是 G 的数量.在这种情况下,我想要两者的 G 数.所以关键问题是等位基因身份是由组确定的,而不是整个列.我尝试使用要在计算中使用的等位基因身份制作一个单独的表格,但无法弄清楚如何将其包含在 fun1 中,以便在上面的 lapply 中引用正确的单元格.等位基因身份表:

For G1 the result is that Loc1 has 3 G's, but for G3 it shows Loc1 has 2 T's, not the number of G's. I want the number of G's for both in this case. So the key problem is that the allele identities are determined by group, not over the whole column. I tried making a separate table with the allele identities I want to use in calculations, but can't figure out how to include it in fun1 so that the correct cells are referenced in lapply above. Allele identities table:

    > fun2<- function(x){sort(na.omit(unique(x)))}
    > allele.id<-data.table(DT[,lapply(.SD,fun2),.SDcols=1:3])
    > allele.id
       Loc1 Loc2 Loc3
    1:    G    A    C
    2:    T    C    G

推荐答案

首先将您的 data.table 转换为长格式可能是明智的.这将使它更容易用于进一步的计算(或使用 ggplot2 进行可视化).使用 data.tablemelt 函数(与 reshape2 包的 melt 函数相同)您可以从宽格式转换为长格式:

It's probably wise to transform your data.table into long format first. This will make it easier to use for further calculations (or making visualisations with ggplot2 for example). With the melt function of data.table (which works the same as the melt function of the reshape2 package) you can transform from wide to long format:

DT2 <- melt(DT, id = "Group", variable.name = "loci")

当您想在熔解操作期间删除 NA-值时,您可以在上述调用中添加 na.rm = TRUE (na.rm = FALSE 是默认行为).

When you want to remove the NA-values during the melt-operation, you can add na.rm = TRUE in the above call (na.rm = FALSE is the default behaviour).

然后你可以制作如下计数和比例变量:

Then you can make count and proportion variables as follows:

DT2 <- DT2[, .N, by = .(Group, loci, value)][, prop := N/sum(N), by = .(Group, loci)]

给出以下结果:

> DT2
    Group loci value N      prop
 1:    G1 Loc1     G 3 1.0000000
 2:    G2 Loc1    NA 1 0.2500000
 3:    G2 Loc1     G 1 0.2500000
 4:    G2 Loc1     T 2 0.5000000
 5:    G3 Loc1     T 2 0.6666667
 6:    G3 Loc1    NA 1 0.3333333
 7:    G1 Loc2    NA 1 0.3333333
 8:    G1 Loc2     A 1 0.3333333
 9:    G1 Loc2     C 1 0.3333333
10:    G2 Loc2    NA 1 0.2500000
11:    G2 Loc2     C 2 0.5000000
12:    G2 Loc2     A 1 0.2500000
13:    G3 Loc2     A 2 0.6666667
14:    G3 Loc2     C 1 0.3333333
15:    G1 Loc3     C 1 0.3333333
16:    G1 Loc3     G 2 0.6666667
17:    G2 Loc3    NA 2 0.5000000
18:    G2 Loc3     G 2 0.5000000
19:    G3 Loc3     G 3 1.0000000

如果你希望它以宽格式返回,你可以在多个变量上使用 dcast:

I you want it back in wide format, you can use dcast on multiple variables:

DT3 <- dcast(DT2, Group + loci ~ value, value.var = c("N", "prop"), fill = 0)

导致:

> DT3
   Group loci N_A N_C N_G N_T N_NA    prop_A    prop_C    prop_G    prop_T   prop_NA
1:    G1 Loc1   0   0   3   0    0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
2:    G1 Loc2   1   1   0   0    1 0.3333333 0.3333333 0.0000000 0.0000000 0.3333333
3:    G1 Loc3   0   1   2   0    0 0.0000000 0.3333333 0.6666667 0.0000000 0.0000000
4:    G2 Loc1   0   0   1   2    1 0.0000000 0.0000000 0.2500000 0.5000000 0.2500000
5:    G2 Loc2   1   2   0   0    1 0.2500000 0.5000000 0.0000000 0.0000000 0.2500000
6:    G2 Loc3   0   0   2   0    2 0.0000000 0.0000000 0.5000000 0.0000000 0.5000000
7:    G3 Loc1   0   0   0   2    1 0.0000000 0.0000000 0.0000000 0.6666667 0.3333333
8:    G3 Loc2   2   1   0   0    0 0.6666667 0.3333333 0.0000000 0.0000000 0.0000000
9:    G3 Loc3   0   0   3   0    0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000

<小时>

另一种直接的方法是在一次调用中使用 meltdcast (这是@Frank 答案第一部分的简化版本):


Another and straightforward approach is using melt and dcast in one call (which is a simplified version of the first part of @Frank's answer):

DT2 <- dcast(melt(DT, id="Group"), Group + variable ~ value)

给出:

> DT2
   Group variable A C G T NA
1:    G1     Loc1 0 0 3 0  0
2:    G1     Loc2 1 1 0 0  1
3:    G1     Loc3 0 1 2 0  0
4:    G2     Loc1 0 0 1 2  1
5:    G2     Loc2 1 2 0 0  1
6:    G2     Loc3 0 0 2 0  2
7:    G3     Loc1 0 0 0 2  1
8:    G3     Loc2 2 1 0 0  0
9:    G3     Loc3 0 0 3 0  0

因为dcast中默认的聚合函数是length,你会自动得到每个值的计数.

Because the default aggregation function in dcast is length, you will automatically get the counts for each of the values.

使用过的数据:

DT <- structure(list(Loc1 = c("G", "G", "G", NA, "G", "T", "T", "T", "T", NA), 
                     Loc2 = c(NA, "A", "C", NA, "C", "A", "C", "A", "C", "A"), 
                     Loc3 = c("C", "G", "G", NA, NA, "G", "G", "G", "G", "G"), 
                     Group = c("G1", "G1", "G1", "G2", "G2", "G2", "G2", "G3", "G3", "G3")), 
                .Names = c("Loc1", "Loc2", "Loc3", "Group"), row.names = c(NA, -10L), class = c("data.table", "data.frame"))

这篇关于如何使用 data.table 跨多个列(基因座)按组有效计算等位基因频率(比例)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆