应用sapply或其他应用函数而不是嵌套循环来获取数据框的列表 [英] applying sapply or other apply function instead of nested for loop for lists of data frames

查看:169
本文介绍了应用sapply或其他应用函数而不是嵌套循环来获取数据框的列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框/数据列表'humanSplit ratSplit`,它们的形式是

 > ratSplit $ Kidney_F_GSM1328570 
ratGene ratReplicate alignment RNAtype
1 Crot Kidney_F_GSM1328570 7 REV
2 Crot Kidney_F_GSM1328570 12 REV
3 Crot Kidney_F_GSM1328570 4 REV



 > humanSplit $ Fetal_Brain_408_AGTCAA_L009_R1_report.txt 
humanGene humanReplicate对准RNAtype
53 ZFP28 Fetal_Brain_408_AGTCAA_L009_R1_report.txt 5章第
55 RC3H1 Fetal_Brain_408_AGTCAA_L009_R1_report.txt 9章第
56 IFI27 Fetal_Brain_408_AGTCAA_L009_R1_report.txt 4章第

下面使用的另一个文件是以下形式的geneList:

<$ ABAT,Abat
ABCA1,Abca1
ABCA12,Abca12
ABCA2,Abca2
ABCA3,Abca17
ABCA4,Abca4
ABCA5,abca5

现在我想做fisher之间所有元素之间的精确测试, code> ratSplit 和 humanSplit 。最终想要在 csv 文件中写下fisher测试的结果。现在我正在做双循环。但我想知道如何使用 sapply 或其他相关的东西来提高效率。



目前我是做以下事情:在这里我首先制作一个 data.frame result 从每个步骤中的一对渔夫的测试。最后当整个循环完成时,我在 csv 文件中写入最后的 result 。我的理解是使用 sapply 我需要将循环内部转换为函数,然后调用sapply。但我不确定优化它的最佳方法是什么。任何帮助将不胜感激

 结果<  -  data.frame(humanReplicate =human_replicate,ratReplicate =rat_replicate, $ value =p-value,alternative =alternative_hypothesis,
Conf.int1 =conf.int1,Conf.int2 =conf.int2,oddratio =Odd_Ratio (in 1:length(humanSplit)){
ratReplicateName< - names(ratSplit [i])
humanReplicateName< - 名称(长度分段)){
(humanSplit [j])

#基于上面定义的基因列表中的一对一基因映射,合并成两个以上。
mergedHumanData< -merge(geneList,humanSplit [[j]],by.x =human,by.y =humanGene)
mergedRatData< - merge(geneList,ratSplit [ i]],by.x =rat,by.y =ratGene)

mergedHumanData< - mergedHumanData [,c(1,2,4,5)] #rearrange column
mergedRatData< - mergedRatData [,c(2,1,4,5)] #rearrange column
mergedHumanRatData< - rbind(mergedHumanData,mergedRatData)#现在列是人类,老鼠 ,比对,RNAtype

agg < - 聚合(RNAtype_human + rat,data = mergedHumanRatData,FUN = getGeneType)#agg使得HmYn形式为
HmRnTable <表(agg $ RNAtype)HmRn的#表,即人和大鼠的RNAtype。

#现在将这些数字分配给变量HmYn。考虑一些HmRy不存在的情况。这就是为什么使用
#is.integer0函数
HyRy < - ifelse(is.integer0(HmRnTable [names(HmRnTable)==HyRy]),0,HmRnTable [names(HmRnTable)= (HmRnTable)[HnRnTable] =HnRn]),0,HmRnTable [names(HmRnTable)== (HmRnTable)==HyRn] [[1]])
HyRn < - ifelse(is.integer0(HmRnTable)==HyRn]),0,HmRnTable [names(HmRnTable)==HyRn (HmRnTable)==HnRy]),0,HmRnTable [names(HmRnTable)==HnRy] [[HmRnTable] == [HnRy] (c(HnRn,HnRy,HyRn,HyRy),nrow = 2)


contingencyTable < - fisher.test( - 应急表)
newLine < - data.frame(t(c(humanReplicate = humanReplicateName,ratReplicate = ratReplicateName,pvalue = fisherTest $ p,
alternative = fisherTest $ alternative,Conf.int1 = fisherTest $ conf。 int [1],Conf.int2 = fisherTest $ conf.int [2],
oddratio = fisherTest $ estimate [[1]])))


结果< -rbind(result,newLine)
}
}

write.table(result,file =newData5.csv,row.names = FALSE,append = FALSE,col.names = TRUE,sep =,)


解决方案

这很难测试,因为我们缺少 geneList ,但我推断代码正常工作,所以你只是想加快速度。这里有一些帮助指针:
$ b $ ol
不要预定义 result 喜欢这个。通过将每列的第一个条目设置为列名称的字符串,确保所有后续条目将被强制转换为字符串。虽然这不一定会咬你,因为你最终还是写一个CSV,这是不好的形式。它会咬你,如果你打算把它读回R,因为第一行将被用作列名,但第二行将全部是强制后续数据为字符串的字符串。 (您将必须清理自己的数据,浪费资源。)

  • 在脚本结尾处,您可以调用 rbind 。这可能会好一段时间,但重复调用 rbind 将导致整个data.frame每次被复制。随着更多的行被追加,这将导致在你的代码不小的放缓。这可以通过下面列出的两种方法之一来解决。

  • 由于您使用了每个名称(HmRnTable)==HyRy 两次,我的技术是先保存到一个向量(或标量,如果您使用哪(...)),然后使用这个变量在 HmRnTable 的子集中。它可能会加速一些 ,但也可能使代码更容易阅读。事实上,你可以将这四个任务中的每一个都缩短为(未经测试):

      idx<  -  is.integer0( HmRnTable [Name(HmRnTable)=='HyRy'])
    HyRy < - HmRnTable [idx] [[1]]
    HyRy [idx] < - 0
    ## repeat for HyRn,HnRy和HnRn


  • 我强烈建议您将大部分代码把它放在一个函数中,该函数可以接受两个参数( i j )或者四个参数(list1,index1,list2 ,index2)。 (你所做的取决于你在变量范围引用方面有多少OCD)。我假定函数将从字面上返回结果 fisher.test 按摩。这将使得在制作该功能的过程中进行更简单的测试,并且稍后将包含在该脚本中。我将在下面引用函数 myfisher(i,j)

  • 我推断你有大量的比较运行(因为每个迭代真的不应该花那么长时间)。 @ Floo0关于 outer 的注释可以工作,就像 expand.grid 一样。无论哪种方式,您都将list1的每个元素与list2的每个元素进行比较。如果您从以下开始:

    $ p $ (例如< - expand.grid(i = 1:length(ratSplit),
    j = 1:length(humanSplit)))
    ## ij
    ## 1 1 1
    ## 2 2 1
    ## 3 3 1
    ## 4 4 1
    ## ...

    这给我们提供了一个简单的data.frame我们可以使用 apply 。老实说,在这种情况下,我喜欢 ddply 的优雅,因为它很容易从 data.frame >数据帧,很容易。

      library(plyr)
    ret < - ddply(例如。 i,j),函数(df){
    (myfisher(df $ i,df $ j),
    data.frame(pv = p.value,ci1 = conf.int [1], ci2 = conf.int [2],
    alt = alternative))
    })



    请注意,我明确没有包含 humanReplicateName ratReplicateName ,因为这些可以在ddply之前添加(或之后,引用 ret $ 而不是例如$ ):

    < $ rat $复制名称名称(ratSplit [例如$ i])
    例如$ humanReplicateName< - 名称(humanSplit [例如$ i])
    / code>

    并且名称也会奇迹般地出现在输出中。少于在循环内处理。

    直到现在,这将导致一个data.frame,然后您可以保存到CSV。



    我会再提出一个建议,这个建议可能会过度,这取决于运行的时间。我有时发现,我的长时间运行的脚本被打断了,也许是因为我不得不在飞行中调整某些东西;我发现了一个bug或者如果电脑不得不重启。没有简单的方法允许中游延续,但我已经采取了一些技巧来缓解这种情况。 ,使用 d_ply ,它不假定返回值的类型。而不是返回(单行)data.frame,立即保存一行(带或不带标题,您的调用)到一个文件。尽管下面的代码并不是真正的原子,因此竞态条件可能会出现,但它足够强大,可以满足我们大多数的需求:

    $ $ p (例如,。(i,j),函数(df){(p,j))。
    fn < - file.path(savedir,sprintf('%s-%s.csv',df $ i,df $ j))
    if(!file.exists(fn)){
    ret < - with(myfisher(df $ i,df $ j),
    data.frame(pv = p.value,ci1 = conf.int [1],ci2 = conf.int [ 2],
    alt = alternative))
    write.table(ret,file = fn,sep =,,append = FALSE,
    row.names = FALSE,col.names = TRUE)

    })

    当它被中断时,你要做的就是查看./output/中的所有文件,删除0字节的文件,然后重新运行,只执行在丢失的文件上。哦,它会变得更好。

  • 并行化。如果你需要这么做(有些功能并没有像其他功能那么强大),你可以改用系统上的多个内核。你可以这样做:

      library(parallel)
    cl < - makeCluster(detectCores() - 1 )#我喜欢保留一个空的
    clusterEvalQ(cl,{
    load('myTwoLists.rda')#with ratSplit和humanSplit变量
    source('mycode.R')#with myfisher (i,j)
    ##如果你得到更高级的
    ),
    例如< - expand.grid(i = 1:length( ratSplit),
    j = 1:length(humanSplit))
    例如$ ratReplicateName< - 名称(ratSplit [例如$ i])
    例如$ humanReplicateName< - (b)例如1,函数(r){
    i < - r [1]; j < - r [2]
    fn< ; - file.path(savedir,sprintf('%s-%s.csv',i,j)
    if(!file.exists(fn)){
    ret< - with(myfisher (i,j),
    data.frame(ratName = r [3],humanName = r [4],
    pv = p.value,ci1 = conf.int [1],ci2 = conf .int [2],
    alt = alternative))
    write.table(ret,file = fn,sep =,,append = FALSE,
    row.names = FALSE,col.names = TRUE)
    }

    stopCluster(cl)

    注意到我们没有引用该行作为data.frame了。如果Hadley的代码与本地平行工作,我会喜欢它。 编辑:我的过去的项目都没有使用 parallel code>使用 ddply ,反之亦然,所以我从来没有玩过 ddply(...,.parallel = TRUE),它使用 foreach


  • 警告Emptor :这个代码在这个上下文中没有经过测试,尽管它几乎是从工作代码复制/粘贴的。由于编辑的原因,我可能会错过或错过了一个教诲。希望它有帮助!


    I have two data frames/ lists of datas 'humanSplitandratSplit` and they are of the form

    > ratSplit$Kidney_F_GSM1328570
      ratGene        ratReplicate alignment RNAtype
    1    Crot Kidney_F_GSM1328570         7     REV
    2    Crot Kidney_F_GSM1328570        12     REV
    3    Crot Kidney_F_GSM1328570         4     REV
    

    and

    > humanSplit$Fetal_Brain_408_AGTCAA_L009_R1_report.txt
       humanGene                            humanReplicate alignment RNAtype
    53     ZFP28 Fetal_Brain_408_AGTCAA_L009_R1_report.txt         5     reg
    55     RC3H1 Fetal_Brain_408_AGTCAA_L009_R1_report.txt         9     reg
    56     IFI27 Fetal_Brain_408_AGTCAA_L009_R1_report.txt         4     reg
    

    And another file used below is geneList of the form:

    ABAT,Abat
    ABCA1,Abca1
    ABCA12,Abca12
    ABCA2,Abca2
    ABCA3,Abca17
    ABCA4,Abca4
    ABCA5,Abca5
    

    now I want to do fisher's exact test between all the elements pair combination between ratSplit and humanSplit after some data manipulation. And ultimately want to write the result of fisher's test in a csv file. Right now I am doing double for loop. But I am wondering how I can use sapply or other related things to make it more efficient.

    currently I am doing the following thing: Here I am first making a data.frame result where I store/append all the information got from fisher's test in a pair in each step. Then finally when the whole loop is done, I write the final result in a csv file. my understanding is to use sapply I need to transform the inside of the loop into a function and then call sapply. But I am not sure what's the best way to optimize it. Any help would be much appreciated

    result <- data.frame(humanReplicate = "human_replicate", ratReplicate = "rat_replicate", pvalue = "p-value", alternative = "alternative_hypothesis", 
                         Conf.int1 = "conf.int1", Conf.int2 ="conf.int2", oddratio = "Odd_Ratio")
    for(i in 1:length(ratSplit)) {
      for(j in 1:length(humanSplit)) {
        ratReplicateName <- names(ratSplit[i])
        humanReplicateName <- names(humanSplit[j])
    
        #merging above two based on the one-to-one gene mapping as in geneList defined above.
        mergedHumanData <-merge(geneList,humanSplit[[j]], by.x = "human", by.y = "humanGene")
        mergedRatData <- merge(geneList, ratSplit[[i]], by.x = "rat", by.y = "ratGene")
    
        mergedHumanData <- mergedHumanData[,c(1,2,4,5)] #rearrange column
        mergedRatData <- mergedRatData[,c(2,1,4,5)]  #rearrange column
        mergedHumanRatData <- rbind(mergedHumanData,mergedRatData) #now the columns are "human", "rat", "alignment", "RNAtype"
    
        agg <- aggregate(RNAtype ~ human+rat, data= mergedHumanRatData, FUN=getGeneType) #agg to make HmYn form
        HmRnTable <- table(agg$RNAtype) #table of HmRn ie RNAtype in human and rat.
    
        #now assign these numbers to variables HmYn. Consider cases when some form of HmRy is not present in the table. That's why
        #is.integer0 function is used
        HyRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRy"]), 0, HmRnTable[names(HmRnTable) == "HyRy"][[1]])
        HnRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRn"]), 0, HmRnTable[names(HmRnTable) == "HnRn"][[1]])
        HyRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRn"]), 0, HmRnTable[names(HmRnTable) == "HyRn"][[1]])
        HnRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRy"]), 0, HmRnTable[names(HmRnTable) == "HnRy"][[1]])
    
        contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
    
        fisherTest <- fisher.test(contingencyTable) 
        newLine <- data.frame(t(c(humanReplicate = humanReplicateName, ratReplicate = ratReplicateName, pvalue = fisherTest$p,
                                  alternative = fisherTest$alternative, Conf.int1 = fisherTest$conf.int[1], Conf.int2 =fisherTest$conf.int[2], 
                                  oddratio = fisherTest$estimate[[1]])))
    
    
        result <-rbind(result,newLine)
      }
    }
    
    write.table(result, file = "newData5.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ",")
    

    解决方案

    It's difficult to test this since we're missing geneList, but I'm inferring that the code works correctly, so you merely want to speed things up. Here are some pointers to help:

    1. Don't predefine result like this. By setting the first entry of each column to be a string of the column name, you ensure that all subsequent entries will be coerced into strings. Though this doesn't necessarily bite you since you're eventually writing to a CSV anyway, it's bad form. It will bite you if you intend to read it back in to R, since the first row will be used as the column names, but the second row will all be strings forcing subsequent data to be strings as well. (You'll then have to clean your own data, wasteful.)

    2. At the end of your script, you call rbind. This might do alright for a while, but repeated calls to rbind will result in the whole data.frame being copied each time. With more rows being appended, this will cause a not-insignificant slowdown in your code. This can be fixed by one of two methods, listed below.

    3. Since you use each of names(HmRnTable) == "HyRy" twice, my technique is to first save that to a vector (or scalar if you use which(...)), and then use this variable in the subsetting of HmRnTable. It will likely speed things up a little but may also make the code a little easier to read. In fact, you can shorten each of those four assignments to something like (untested):

      idx <- is.integer0(HmRnTable[names(HmRnTable) == 'HyRy'])
      HyRy <- HmRnTable[idx][[1]]
      HyRy[idx] <- 0
      ## repeat for HyRn, HnRy, and HnRn
      

    4. I strongly urge you to take the majority of the code and put it in a function that takes either two arguments (i and j) or four arguments (list1, index1, list2, index2). (Which you do depends on how much OCD you have with regards to variable scope references.) I'll assume that the function will literally return the results from fisher.test with no massaging. This will make for much easier testing while making the function, and for inclusion later on in this script. I'll reference function as myfisher(i,j) below.

    5. I'm inferring that you have a high number of comparisons to run (since each iteration really shouldn't take that long). @Floo0's comment about outer can work, as can expand.grid. Either way, you are comparing each element of list1 with each element of list2. If you start with:

      (eg <- expand.grid(i = 1:length(ratSplit),
                         j = 1:length(humanSplit)))
      ##    i j
      ## 1  1 1
      ## 2  2 1
      ## 3  3 1
      ## 4  4 1
      ## ...
      

      This gives us an easy data.frame on which we can use apply. Honestly, though, I like the elegance of ddply in this case, since it easily goes from a data.frame to a data.frame, easily.

      library(plyr)
      ret <- ddply(eg, .(i, j), function(df) {
          with(myfisher(df$i, df$j),
               data.frame(pv = p.value, ci1 = conf.int[1], ci2 = conf.int[2],
                          alt = alternative))
      })
      

      Note that I expressly did not include humanReplicateName or ratReplicateName since those can be added before the ddply (or after, referencing ret$ instead of eg$) with:

      eg$ratReplicateName <- names(ratSplit[ eg$i ])
      eg$humanReplicateName <- names(humanSplit[ eg$i ])
      

      and the names will magically be in the output, too. Less to deal with inside the loop.

      Up until now, this will result in one data.frame that you can then save to a CSV.

      I'll make one more recommendation that may be overkill depending on how long this will be running. I've found at times that my long-running scripts are interrupted, perhaps because I had to adjust something on-the-fly; I found a bug; or if the computer had to reboot. There is no easy way to allow mid-stream continuation, but I've worked some techniques to mitigate that.

    6. Instead of ddply, use d_ply which does not assume the type of the return value. Instead of returning a (single-row) data.frame, immediately save that one row (with or without headers, your call) to a file. Though the below code is not truly "atomic" and therefore race-conditions may occur, it's robust enough to meet most of our needs:

      library(plyr)
      savedir <- './output'
      ret <- ddply(eg, .(i, j), function(df) {
          fn <- file.path(savedir, sprintf('%s-%s.csv', df$i, df$j))
          if (! file.exists(fn)) {
              ret <- with(myfisher(df$i, df$j),
                          data.frame(pv = p.value, ci1 = conf.int[1], ci2 = conf.int[2],
                                     alt = alternative))
              write.table(ret, file = fn, sep = ",", append = FALSE, 
                          row.names = FALSE, col.names = TRUE)
          }
      })
      

      One goodness of this is that if/when it is interrupted, the most you will have to do is look at all of the files within "./output/", delete files with 0 bytes, and then re-run and it will only execute on the missing files. Oh, and it gets better.

    7. Parallelizing it. If you need to go this far (and some functions don't see as much improvement as others), you can instead use multiple cores on your system. You could do something like this:

      library(parallel)
      cl <- makeCluster(detectCores() - 1) # I like to keep one free
      clusterEvalQ(cl, {
          load('myTwoLists.rda') # with ratSplit and humanSplit variables
          source('mycode.R') # with myfisher(i,j)
          ## any libraries you may want/need to add, if you get more advanced
      })
      eg <- expand.grid(i = 1:length(ratSplit),
                        j = 1:length(humanSplit))
      eg$ratReplicateName <- names(ratSplit[ eg$i ])
      eg$humanReplicateName <- names(humanSplit[ eg$i ])
      ign <- parApply(cl, eg, 1, function(r) {
          i <- r[1] ; j <- r[2]
          fn <- file.path(savedir, sprintf('%s-%s.csv', i, j)
          if (! file.exists(fn)) {
              ret <- with(myfisher(i, j),
                          data.frame(ratName = r[3], humanName = r[4],
                                     pv = p.value, ci1 = conf.int[1], ci2 = conf.int[2],
                                     alt = alternative))
              write.table(ret, file = fn, sep = ",", append = FALSE, 
                          row.names = FALSE, col.names = TRUE)
          }
      })
      stopCluster(cl)
      

      Notice that we aren't referencing the row as a data.frame anymore. I'd love it if Hadley's code worked natively with parallel. (For all I know, it does and I just haven't found it yet!) Edit: none of my past projects using parallel were using ddply, and vice versa, so I have never played with ddply(..., .parallel = TRUE) which uses foreach.

    Caveat Emptor: this code has not been tested in this context, though it is almost copy/pasted from working code. Because of edits, I may be off-by-one or missing a paren. Hope it helps!

    这篇关于应用sapply或其他应用函数而不是嵌套循环来获取数据框的列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆