使用列作为参数在data.table中按行应用函数 [英] Apply function by row in data.table using columns as arguments

查看:112
本文介绍了使用列作为参数在data.table中按行应用函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图应用一个函数的行使用data.table与列作为参数。我目前正在使用此处但是,我的data.table是2700万行,有7列,所以apply操作需要很长时间,当我对许多输入文件递归运行时,作业占用所有可用RAM(32Gb)。这可能是我复制data.table多次,虽然我不知道这一点。



我想帮助使这个代码更高的内存效率,因为每个输入文件将大约30万行乘以7列,并且有30个输入文件要处理。我相当肯定,使用apply的行会减慢整个代码,所以更有效率的内存或使用矢量化函数的选择可能是更好的选择。



有很多麻烦试图写一个矢量化函数,它接受4列作为参数,并使用data.table逐行操作。在我的示例代码中的应用解决方案工作,但它非常缓慢。我尝试的一个选择是:

  cols = c(C,T,A,G) 
func1< -function(x)x [max1(x)]
datU [,high1a:= func1(cols),by = 1:nrow(datU)]

,但datU data.table输出的前6行如下所示:

 周期标签ID colA colB colC colG high1 high1a 
1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC
2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC
3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC
4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC
5 0 45513 -89.719 -504.643 1298.476 131.32 1298.476 colC
6 0 45513 -250.11 -30.862 1877.049 -184.772 1877.049 colC

这里是我的代码使用apply上面的high1列),但是太慢和内存密集:

 #从顶层目录中获取输入文件,搜索所有子目录
file_list< - list.files(pattern =* .test.txt,recursive = TRUE,full.names = TRUE)

#循环从子目录中循环读取文件,确定指定列中的最高值和第二高值,使用这些值创建新列

savelist = NULL
for(i in file_list){

datU< - fread(i)
name = dirname(i)

#每行最高和第二高(cols 4,5,6,7),以及最高和第二高值
maxn < - function(n)function(x)order(x,decrease = TRUE)[n]
max1 < - maxn(1)
max2 < 2)
colNum = c(4,5,6,7)
datU [,high1:= apply(datU [,colNum,with = FALSE],1,function(x)x [max1 x]])]
datU [,high2:= apply(datU [,colNum,with = FALSE],1,function(x)x [max2(x)差别:= high1-high2,by = 1:nrow(datU)]
datU [,folder:= name]
savelist [[i]]< -datU

}

#Create循环遍历文件夹和输出数据

sigout = NULL
for(i in savelist){

#做一些操作数据框架,然后合并输出
setkey(i,Cycle,folder)
Sums1 <-i [,sum(colA,colB,colC,colD),by = list循环,文件夹)]
MeanTot <-Sums [,round(sd(V1),3),by = ,3),by = list(Cycle,folder)]
Meandiff <-i [,list(meandiff = mean(difference))by by list(Cycle,folder)]
Meandiffsd& [,list(meandiff = sd(difference)),by = list(Cycle,folder)]

df1out <-merge(MeanTot,MeanTotsd,by = list b df2out< -merge(Meandiff,Meandiffsd,by = list(Cycle,folder))
sigout <-merge(df1out,df2out)

#Output values
write.table (sigout,Sigout.txt,append = TRUE,quote = FALSE,sep =,,row.names = FALSE,col.names = TRUE)
}

我会喜欢一些关于应用替代函数的例子,这将给我的列4,5,6的每行的最高和第二高的值,

解决方案

/ div>

您可以这样做:

  DF < -  read.table ID colA colB colC colG high1 high1a 
1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC
2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC
3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC
4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC
5 0 45513 -89.719 -504.643 1298.476 131.32 1298.476 colC
6 0 45513 -250.11 -30.862 1877.049 -184.772 1877.049 colC,标题= TRUE)

库(data.table)
setDT(DF)

maxTwo < - function(x){
ind < length(x) - (1:0)#所有行的索引相等,
#so它可以作为一个函数参数
#为了更好的效率
as.list(sort。 int(x,partial = ind)[ind])#partial sorted
}

DF [,paste0(max,1:2):= maxTwo ),
by = seq_len(nrow(DF)),.SDcols = 4:7]
DF [,diffMax:= max2 - max1]

# colB colC colG high1 high1a max1 max2 diffMax
#1:1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC -3.141 3740.916 3744.057
#2:2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC 357.071 2900.866 2543.795
#3:3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC 353.479 4036.636 3683.157
#4:4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC 384.945 4354.994 3970.049
#5:5 0 45513 -89.719 -504.643 1298.476 131.320 1298.476 colC 131.320 1298.476 1167.156
#6:6 0 45513 -250.110 -30.862 1877.049 -184.772 1877.049 colC -30.862 1877.049 1907.911

但是,你仍然会循环遍历这些行,这意味着 nrow 调用函数。你可以尝试Rcpp在编译的代码中循环。


I am trying to apply a function by row using data.table with columns as arguments. I am currently using apply as suggested here

However, my data.table is 27 million rows with 7 columns so the apply operation takes a very long time when I run it recursively on many input files, the job takes up all available RAM (32Gb). It's likely that I am copying the data.table multiple times, though I'm not sure about that.

I would like help making this code more memory efficient given that each input file will be ~30 million rows by 7 columns and there are 30 input files to process. I am fairly sure that the lines using apply are slowing down the whole code so alternatives that are more memory efficient or use vectorized functions would probably be better options.

I've had a lot of trouble trying to write a vectorized function that takes in 4 columns as arguments and operates on a row by row basis, using data.table. The apply solution in my example code works but it's very slow. One alternative I tried is:

cols=c("C","T","A","G")
func1<-function(x)x[max1(x)]
datU[,high1a:=func1(cols),by=1:nrow(datU)]

but the first 6 rows of the datU data.table output look like this:

    Cycle   Tab ID  colA    colB    colC    colG    high1   high1a
1   0   45513   -233.781    -84.087 -3.141  3740.916    3740.916    colC
2   0   45513   -103.561    -347.382    2900.866    357.071 2900.866    colC
3   0   45513   153.383 4036.636    353.479 -42.736 4036.636    colC
4   0   45513   -147.941    28.994  4354.994    384.945 4354.994    colC
5   0   45513   -89.719 -504.643    1298.476    131.32  1298.476    colC
6   0   45513   -250.11 -30.862 1877.049    -184.772    1877.049    colC

Here is my code using apply that works (it produced the high1 column above), but is too slow and memory intensive:

#Get input files from top directory, searching through all subdirectories
    file_list <- list.files(pattern = "*.test.txt", recursive=TRUE, full.names=TRUE)

#Make a loop to recursively read files from subdirectories, determine highest and second highest values in specified columns, create new column with those values

    savelist=NULL
    for (i in file_list) {

    datU <- fread(i)
    name=dirname(i)

    #Compute highest and second highest for each row (cols 4,5,6,7) and the difference between highest and second highest values
    maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]
    max1 <- maxn(1)
    max2 <- maxn(2)
    colNum=c(4,5,6,7)
    datU[,high1:=apply(datU[,colNum,with=FALSE],1,function(x)x[max1(x)])])
    datU[,high2:=apply(datU[,colNum,with=FALSE],1,function(x)x[max2(x)])]
    datU[,difference:=high1-high2,by=1:nrow(datU)]
    datU[,folder:=name]
    savelist[[i]]<-datU

}

#Create loop to iterate over folders and output data

sigout=NULL
for (i in savelist) {

   # Do some stuff to manipulate data frames, then merge them for output
setkey(i,Cycle,folder)
Sums1<-i[,sum(colA,colB,colC,colD),by=list(Cycle,folder)]
MeanTot<-Sums[,round(mean(V1),3),by=list(Cycle,folder)]
MeanTotsd<-Sums[,round(sd(V1),3),by=list(Cycle,folder)]
Meandiff<-i[,list(meandiff=mean(difference)),by=list(Cycle,folder)]
Meandiffsd<-i[,list(meandiff=sd(difference)),by=list(Cycle,folder)]

df1out<-merge(MeanTot,MeanTotsd,by=list(Cycle,folder))
df2out<-merge(Meandiff,Meandiffsd,by=list(Cycle,folder))
sigout<-merge(df1out,df2out)

#Output values 
write.table(sigout,"Sigout.txt",append=TRUE,quote=FALSE,sep=",",row.names=FALSE,col.names=TRUE)
}

I would love some examples concerning alternative functions to apply that will give me the highest and second highest values for each row for columns 4,5,6,7 which can be identified by index or alternatively by column name.

Thank you!

解决方案

You could do something like this:

DF <- read.table(text = "    Cycle   Tab ID  colA    colB    colC    colG    high1   high1a
1   0   45513   -233.781    -84.087 -3.141  3740.916    3740.916    colC
                 2   0   45513   -103.561    -347.382    2900.866    357.071 2900.866    colC
                 3   0   45513   153.383 4036.636    353.479 -42.736 4036.636    colC
                 4   0   45513   -147.941    28.994  4354.994    384.945 4354.994    colC
                 5   0   45513   -89.719 -504.643    1298.476    131.32  1298.476    colC
                 6   0   45513   -250.11 -30.862 1877.049    -184.772    1877.049    colC", header = TRUE)

library(data.table)
setDT(DF)

maxTwo <- function(x) {
  ind <- length(x) - (1:0) #the index is equal for all rows,
                           #so it could be made a function parameter
                           #for better efficiency
  as.list(sort.int(x, partial = ind)[ind]) #partial sorting
}

DF[, paste0("max", 1:2) := maxTwo(unlist(.SD)), 
    by = seq_len(nrow(DF)), .SDcols = 4:7]
DF[, diffMax := max2 - max1]

#   Cycle Tab    ID     colA     colB     colC     colG    high1 high1a    max1     max2  diffMax
#1:     1   0 45513 -233.781  -84.087   -3.141 3740.916 3740.916   colC  -3.141 3740.916 3744.057
#2:     2   0 45513 -103.561 -347.382 2900.866  357.071 2900.866   colC 357.071 2900.866 2543.795
#3:     3   0 45513  153.383 4036.636  353.479  -42.736 4036.636   colC 353.479 4036.636 3683.157
#4:     4   0 45513 -147.941   28.994 4354.994  384.945 4354.994   colC 384.945 4354.994 3970.049
#5:     5   0 45513  -89.719 -504.643 1298.476  131.320 1298.476   colC 131.320 1298.476 1167.156
#6:     6   0 45513 -250.110  -30.862 1877.049 -184.772 1877.049   colC -30.862 1877.049 1907.911

However, you'd still be looping over the rows, which means nrow calls to the function. You could try Rcpp to do the looping in compiled code.

这篇关于使用列作为参数在data.table中按行应用函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆