为数据框列表应用 sapply 或其他应用函数而不是嵌套的 for 循环 [英] applying sapply or other apply function instead of nested for loop for lists of data frames

查看:18
本文介绍了为数据框列表应用 sapply 或其他应用函数而不是嵌套的 for 循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框/数据列表'humanSplitratSplit`,它们的形式为

I have two data frames/ lists of datas 'humanSplitandratSplit` and they are of the form

> ratSplit$Kidney_F_GSM1328570
  ratGene        ratReplicate alignment RNAtype
1    Crot Kidney_F_GSM1328570         7     REV
2    Crot Kidney_F_GSM1328570        12     REV
3    Crot Kidney_F_GSM1328570         4     REV

> humanSplit$Fetal_Brain_408_AGTCAA_L009_R1_report.txt
   humanGene                            humanReplicate alignment RNAtype
53     ZFP28 Fetal_Brain_408_AGTCAA_L009_R1_report.txt         5     reg
55     RC3H1 Fetal_Brain_408_AGTCAA_L009_R1_report.txt         9     reg
56     IFI27 Fetal_Brain_408_AGTCAA_L009_R1_report.txt         4     reg

下面使用的另一个文件是以下形式的geneList:

And another file used below is geneList of the form:

ABAT,Abat
ABCA1,Abca1
ABCA12,Abca12
ABCA2,Abca2
ABCA3,Abca17
ABCA4,Abca4
ABCA5,Abca5

现在我想在一些数据操作之后在 ratSplithumanSplit 之间的所有元素对组合之间进行 Fisher 精确测试.并最终想将 Fisher 测试的结果写入 csv 文件中.现在我正在做双循环.但我想知道如何使用 sapply 或其他相关的东西来提高效率.

now I want to do fisher's exact test between all the elements pair combination between ratSplit and humanSplit after some data manipulation. And ultimately want to write the result of fisher's test in a csv file. Right now I am doing double for loop. But I am wondering how I can use sapply or other related things to make it more efficient.

目前我正在做以下事情:在这里我首先制作一个 data.frame result 在那里我存储/附加从成对的 Fisher 测试中获得的所有信息在每一步.最后,当整个循环完成后,我将最终的 result 写入 csv 文件中.我的理解是使用 sapply 我需要将循环内部转换成一个函数,然后调用 sapply.但我不确定优化它的最佳方法是什么.任何帮助将不胜感激

currently I am doing the following thing: Here I am first making a data.frame result where I store/append all the information got from fisher's test in a pair in each step. Then finally when the whole loop is done, I write the final result in a csv file. my understanding is to use sapply I need to transform the inside of the loop into a function and then call sapply. But I am not sure what's the best way to optimize it. Any help would be much appreciated

result <- data.frame(humanReplicate = "human_replicate", ratReplicate = "rat_replicate", pvalue = "p-value", alternative = "alternative_hypothesis", 
                     Conf.int1 = "conf.int1", Conf.int2 ="conf.int2", oddratio = "Odd_Ratio")
for(i in 1:length(ratSplit)) {
  for(j in 1:length(humanSplit)) {
    ratReplicateName <- names(ratSplit[i])
    humanReplicateName <- names(humanSplit[j])

    #merging above two based on the one-to-one gene mapping as in geneList defined above.
    mergedHumanData <-merge(geneList,humanSplit[[j]], by.x = "human", by.y = "humanGene")
    mergedRatData <- merge(geneList, ratSplit[[i]], by.x = "rat", by.y = "ratGene")

    mergedHumanData <- mergedHumanData[,c(1,2,4,5)] #rearrange column
    mergedRatData <- mergedRatData[,c(2,1,4,5)]  #rearrange column
    mergedHumanRatData <- rbind(mergedHumanData,mergedRatData) #now the columns are "human", "rat", "alignment", "RNAtype"

    agg <- aggregate(RNAtype ~ human+rat, data= mergedHumanRatData, FUN=getGeneType) #agg to make HmYn form
    HmRnTable <- table(agg$RNAtype) #table of HmRn ie RNAtype in human and rat.

    #now assign these numbers to variables HmYn. Consider cases when some form of HmRy is not present in the table. That's why
    #is.integer0 function is used
    HyRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRy"]), 0, HmRnTable[names(HmRnTable) == "HyRy"][[1]])
    HnRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRn"]), 0, HmRnTable[names(HmRnTable) == "HnRn"][[1]])
    HyRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRn"]), 0, HmRnTable[names(HmRnTable) == "HyRn"][[1]])
    HnRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRy"]), 0, HmRnTable[names(HmRnTable) == "HnRy"][[1]])

    contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)

    fisherTest <- fisher.test(contingencyTable) 
    newLine <- data.frame(t(c(humanReplicate = humanReplicateName, ratReplicate = ratReplicateName, pvalue = fisherTest$p,
                              alternative = fisherTest$alternative, Conf.int1 = fisherTest$conf.int[1], Conf.int2 =fisherTest$conf.int[2], 
                              oddratio = fisherTest$estimate[[1]])))


    result <-rbind(result,newLine)
  }
}

write.table(result, file = "newData5.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ",")

推荐答案

由于我们缺少 geneList,因此很难对此进行测试,但我推断该代码可以正常工作,因此您只需想要加快速度.以下是一些有用的提示:

It's difficult to test this since we're missing geneList, but I'm inferring that the code works correctly, so you merely want to speed things up. Here are some pointers to help:

  1. 不要像这样预定义 result.通过将每列的第一个条目设置为列名的字符串,您可以确保所有后续条目都将被强制转换为字符串.虽然这并不一定会咬你,因为你最终还是要写入 CSV,但它的形式很糟糕.如果您打算将其读回 R,它会咬您,因为第一行将用作列名,但第二行将全部是字符串,从而迫使后续数据也成为字符串.(然后你必须清理自己的数据,很浪费.)

  1. Don't predefine result like this. By setting the first entry of each column to be a string of the column name, you ensure that all subsequent entries will be coerced into strings. Though this doesn't necessarily bite you since you're eventually writing to a CSV anyway, it's bad form. It will bite you if you intend to read it back in to R, since the first row will be used as the column names, but the second row will all be strings forcing subsequent data to be strings as well. (You'll then have to clean your own data, wasteful.)

在脚本结束时,调用 rbind.这可能在一段时间内没有问题,但重复调用 rbind 将导致每次都复制整个 data.frame.随着更多行被附加,这将导致您的代码速度明显减慢.这可以通过下面列出的两种方法之一来解决.

At the end of your script, you call rbind. This might do alright for a while, but repeated calls to rbind will result in the whole data.frame being copied each time. With more rows being appended, this will cause a not-insignificant slowdown in your code. This can be fixed by one of two methods, listed below.

由于您使用每个 names(HmRnTable) == "HyRy" 两次,我的技巧是首先将其保存到向量(如果使用 which(...)),然后在HmRnTable的子集中使用这个变量.它可能会加快一点,但也可能使代码更容易阅读.事实上,您可以将这四项作业中的每一项都缩短为类似(未经测试):

Since you use each of names(HmRnTable) == "HyRy" twice, my technique is to first save that to a vector (or scalar if you use which(...)), and then use this variable in the subsetting of HmRnTable. It will likely speed things up a little but may also make the code a little easier to read. In fact, you can shorten each of those four assignments to something like (untested):

idx <- is.integer0(HmRnTable[names(HmRnTable) == 'HyRy'])
HyRy <- HmRnTable[idx][[1]]
HyRy[idx] <- 0
## repeat for HyRn, HnRy, and HnRn

  • 我强烈建议您将大部分代码放在一个函数中,该函数接受两个参数(ij)或四个参数(列表 1、索引 1、列表 2、索引 2).(你做什么取决于你有多少关于变量范围引用的 OCD.)我假设该函数会从字面上返回 fisher.test 的结果而无需按摩.这将在制作函数时更容易测试,并在稍后包含在此脚本中.我将在下面将函数引用为 myfisher(i,j).

  • I strongly urge you to take the majority of the code and put it in a function that takes either two arguments (i and j) or four arguments (list1, index1, list2, index2). (Which you do depends on how much OCD you have with regards to variable scope references.) I'll assume that the function will literally return the results from fisher.test with no massaging. This will make for much easier testing while making the function, and for inclusion later on in this script. I'll reference function as myfisher(i,j) below.

    我推断您有大量比较要运行(因为每次迭代确实不应该花费那么长时间).@Floo0 关于 outer 的评论可以起作用,expand.grid 也可以.无论哪种方式,您都是在将 list1 的每个元素与 list2 的每个元素进行比较.如果你开始:

    I'm inferring that you have a high number of comparisons to run (since each iteration really shouldn't take that long). @Floo0's comment about outer can work, as can expand.grid. Either way, you are comparing each element of list1 with each element of list2. If you start with:

    (eg <- expand.grid(i = 1:length(ratSplit),
                       j = 1:length(humanSplit)))
    ##    i j
    ## 1  1 1
    ## 2  2 1
    ## 3  3 1
    ## 4  4 1
    ## ...
    

    这为我们提供了一个简单的 data.frame,我们可以在其上使用 apply.不过,老实说,我喜欢这种情况下 ddply 的优雅,因为它很容易从一个数据框架一个数据框架,很容易.

    This gives us an easy data.frame on which we can use apply. Honestly, though, I like the elegance of ddply in this case, since it easily goes from a data.frame to a data.frame, easily.

    library(plyr)
    ret <- ddply(eg, .(i, j), function(df) {
        with(myfisher(df$i, df$j),
             data.frame(pv = p.value, ci1 = conf.int[1], ci2 = conf.int[2],
                        alt = alternative))
    })
    

    请注意,我明确没有包含 humanReplicateNameratReplicateName 因为它们可以添加在 ddply 之前(或之后,引用 ret$ 代替eg$) 与:

    Note that I expressly did not include humanReplicateName or ratReplicateName since those can be added before the ddply (or after, referencing ret$ instead of eg$) with:

    eg$ratReplicateName <- names(ratSplit[ eg$i ])
    eg$humanReplicateName <- names(humanSplit[ eg$i ])
    

    并且名称也会神奇地出现在输出中.在循环内部处理的更少.

    and the names will magically be in the output, too. Less to deal with inside the loop.

    到目前为止,这将生成一个 data.frame,然后您可以将其保存为 CSV.

    Up until now, this will result in one data.frame that you can then save to a CSV.

    我会再提出一个建议,根据运行时间的长短,这可能有点矫枉过正.我有时发现长时间运行的脚本会被打断,也许是因为我必须即时调整某些内容;我发现了一个错误;或者如果计算机必须重新启动.没有简单的方法来允许中游继续,但我已经使用了一些技术来缓解这种情况.

    I'll make one more recommendation that may be overkill depending on how long this will be running. I've found at times that my long-running scripts are interrupted, perhaps because I had to adjust something on-the-fly; I found a bug; or if the computer had to reboot. There is no easy way to allow mid-stream continuation, but I've worked some techniques to mitigate that.

    代替ddply,使用不假定返回值类型的d_ply.不要返回(单行)data.frame,而是立即将该行(带或不带标题,您的调用)保存到文件中.尽管下面的代码并不是真正的原子"代码,因此可能会发生竞争条件,但它足够健壮,可以满足我们的大部分需求:

    Instead of ddply, use d_ply which does not assume the type of the return value. Instead of returning a (single-row) data.frame, immediately save that one row (with or without headers, your call) to a file. Though the below code is not truly "atomic" and therefore race-conditions may occur, it's robust enough to meet most of our needs:

    library(plyr)
    savedir <- './output'
    ret <- ddply(eg, .(i, j), function(df) {
        fn <- file.path(savedir, sprintf('%s-%s.csv', df$i, df$j))
        if (! file.exists(fn)) {
            ret <- with(myfisher(df$i, df$j),
                        data.frame(pv = p.value, ci1 = conf.int[1], ci2 = conf.int[2],
                                   alt = alternative))
            write.table(ret, file = fn, sep = ",", append = FALSE, 
                        row.names = FALSE, col.names = TRUE)
        }
    })
    

    这样做的一个好处是,如果/当它被中断时,您要做的是查看./output/"中的所有文件,删除 0 字节的文件, 然后重新运行,它只会在丢失的文件上执行.哦,它变得更好了.

    One goodness of this is that if/when it is interrupted, the most you will have to do is look at all of the files within "./output/", delete files with 0 bytes, and then re-run and it will only execute on the missing files. Oh, and it gets better.

    并行化它.如果您需要走这么远(并且某些功能的改进不如其他功能),您可以改为在您的系统上使用多个内核.你可以这样做:

    Parallelizing it. If you need to go this far (and some functions don't see as much improvement as others), you can instead use multiple cores on your system. You could do something like this:

    library(parallel)
    cl <- makeCluster(detectCores() - 1) # I like to keep one free
    clusterEvalQ(cl, {
        load('myTwoLists.rda') # with ratSplit and humanSplit variables
        source('mycode.R') # with myfisher(i,j)
        ## any libraries you may want/need to add, if you get more advanced
    })
    eg <- expand.grid(i = 1:length(ratSplit),
                      j = 1:length(humanSplit))
    eg$ratReplicateName <- names(ratSplit[ eg$i ])
    eg$humanReplicateName <- names(humanSplit[ eg$i ])
    ign <- parApply(cl, eg, 1, function(r) {
        i <- r[1] ; j <- r[2]
        fn <- file.path(savedir, sprintf('%s-%s.csv', i, j)
        if (! file.exists(fn)) {
            ret <- with(myfisher(i, j),
                        data.frame(ratName = r[3], humanName = r[4],
                                   pv = p.value, ci1 = conf.int[1], ci2 = conf.int[2],
                                   alt = alternative))
            write.table(ret, file = fn, sep = ",", append = FALSE, 
                        row.names = FALSE, col.names = TRUE)
        }
    })
    stopCluster(cl)
    

    请注意,我们不再将行作为 data.frame 引用.如果 Hadley 的代码能够与并行本地工作,我会很高兴.(据我所知,它确实存在,但我还没有找到它!) 我过去使用 parallel 的项目都没有使用 ddply,反之亦然,所以我从来没有玩过使用 foreachddply(..., .parallel = TRUE).

    Notice that we aren't referencing the row as a data.frame anymore. I'd love it if Hadley's code worked natively with parallel. (For all I know, it does and I just haven't found it yet!) none of my past projects using parallel were using ddply, and vice versa, so I have never played with ddply(..., .parallel = TRUE) which uses foreach.

    Caveat Emptor:此代码尚未在此上下文中进行测试,尽管它几乎是从工作代码中复制/粘贴的.由于编辑,我可能会错位或缺少一个括号.希望有帮助!

    Caveat Emptor: this code has not been tested in this context, though it is almost copy/pasted from working code. Because of edits, I may be off-by-one or missing a paren. Hope it helps!

    这篇关于为数据框列表应用 sapply 或其他应用函数而不是嵌套的 for 循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆