使用数据表运行 100,000 Fisher's Exact Tests 比 apply 慢 [英] Using data table to run 100,000 Fisher's Exact Tests is slower than apply

查看:48
本文介绍了使用数据表运行 100,000 Fisher's Exact Tests 比 apply 慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

早上好,

我正在尝试使用 R 非常快速地对模拟遗传数据运行 100,000 次 Fisher 精确测试,最好在 30 秒内(因为我需要置换病例控制标签并迭代该过程 1,000 次,所以它运行了一夜).

I'm trying to use R to run 100,000 Fisher's exact tests on simulated genetic data very quickly, preferably in under 30 seconds (since I need to permute case-control labels and iterate the process 1,000 times, so it runs overnight).

我尝试使用数据表处理融化的、整齐的数据,其中包含大约 200,000,000 行和四列(受试者 ID、疾病状态、位置和值"[野生型等位基因的数量,一个 3 因子变量]).该函数按位置分组,然后对疾病值执行 Fisher 精确检验.

I tried using data tables on melted, tidy data, which contains about 200,000,000 rows and four columns (subject ID, disease status, position and 'value' [the number of wild-type alleles, a 3-factor variable]). The function groups by position, then performs Fisher exact tests on value against disease.

> head(casecontrol3)
   ident disease position value
1:     1       0    36044     2
2:     2       0    36044     2
3:     3       0    36044     1
4:     4       0    36044     1
5:     5       0    36044     2
6:     6       0    36044     1

> setkey(casecontrol3,position)
> system.time(casecontrol4  <- casecontrol3[,list(p=fisher.test(value,
+     factor(disease))$p.value), by=position])
   user  system elapsed 
215.430  11.878 229.148

> head(casecontrol4)
   position            p
1:    36044 6.263228e-40
2:    36495 1.155289e-68
3:    38411 7.842216e-19
4:    41083 1.272841e-69
5:    41866 2.264452e-09
6:    41894 9.833324e-10

然而,与在扁平、凌乱、病例对照表(100,000 行;列包含信息:疾病状态和野生型等位基因的数量)上使用简单的 apply 函数相比,它真的很慢,所以先应用 apply 函数将每一行转换为 2x3 病例对照表,并使用 Fisher 精确检验的矩阵语法).将数据从以前的(未熔化的)形式转换为这种形式(未显示)大约需要 20 秒的运行时间.

However, it's really slow in comparison to using a simple apply function on flattened, messy, case-control tables (100,000 rows; the columns contain info re: disease status and number of wild-type alleles, so the apply function first converts each row into a 2x3 case-control tables, and uses the matrix syntax of Fisher's exact test). It takes about 20 seconds of running time to convert the data from a previous (unmelted) form into this form (not shown).

> head(cctab)
     control_aa control_aA control_AA case_aa case_aA case_AA
[1,]        291        501        208     521     432      47
[2,]        213        518        269      23     392     585
[3,]        170        499        331     215     628     157
[4,]        657        308         35     269     619     112
[5,]        439        463         98     348     597      55
[6,]        410        480        110     323     616      61

> myfisher <- function(row){
+     contab <- matrix(as.integer(row),nrow=2,byrow=TRUE)
+     pval <- fisher.test(contab)$p.value
+     return(pval)
+ }

> system.time(tab <- apply(cctab,1,"myfisher"))
   user  system elapsed 
 28.846  10.989  40.173

> head(tab)
[1] 6.263228e-40 1.155289e-68 7.842216e-19 1.272841e-69 2.264452e-09 9.833324e-10

如您所见,使用 apply 比 data.table 快得多,这真的让我感到惊讶.结果完全一样:

As you can see, using apply is much faster than data.table, which really surprises me. And the results are exactly the same:

> identical(casecontrol4$p,tab)
[1] TRUE

有谁是使用 data.table 的专家知道我如何用它来加速我的代码?或者数据对我来说太大而无法以融化的形式使用它(排除使用 data.table、dplyr 等)?请注意,我还没有尝试过 dplyr,因为我听说 data.table 对于这样的大数据集更快.

Does anyone who is an expert at using data.table know how I could speed up my code with it? Or is the data just too big for me to use it in the melted form (which rules out using data.table, dplyr, etc)? Note that I haven't tried dplyr, as I've heard that data.table is faster for big data sets like this.

谢谢.

推荐答案

我建议另一种方法——在您的方法中添加 HPC 元素.

I would suggest another route -- adding an HPC element to your approach.

您可以使用多个 CPU 或 GPU 内核、扩展 AWS EC2 上的免费计算机集群、连接到 AWS EMR,或使用大量出色的 HPC 工具中的任何一种来简化您现有的代码.

You can use mutliple CPU or GPU cores, scale up a free cluster of computers on AWS EC2, connect to AWS EMR, or use any of a plethora of great HPC tools to faciliate your existing code.

检查我们的 CRAN HPC 任务视图 和这个 教程.

Check our the CRAN HPC Task View and this tutorial.

这篇关于使用数据表运行 100,000 Fisher's Exact Tests 比 apply 慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆