R将数据帧转换为输入文件 - 提高性能 [英] R convert data frame to input file - improve performance

查看:84
本文介绍了R将数据帧转换为输入文件 - 提高性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



数据集大约为1500 x 700,需要一段时间来循环直通数据框和我想知道是否有任何方法来加快进程。



我的数据框是这样的:

 > train2 
得分x1 x2 x3 x4 x5 ... x700
0 0 1 1 1 0 0
1 0 1 0 0 0 0
0 1 0 1 1 1 0
3 0 1 1 1 0 0
1 0 1 0 1 0 0
2 1 1 1 1 0 1
0 0 1 1 0 0 0
...。 。 。 。 。 。

在创建的文件中,我只包含非零的单元格。



因此,第1-3行的输出为:

  0 | x2:1 x3:1 x4:1 
1 | x2:1
0 | x1:1 x3:1 x4:1

我的当前代码像这样运行:

  pt1 < -  paste(train2 $ score,|,sep =)
collect1 < - c()$ (列车2)中的
$(b在1:nrow(列车2)中){
word1 < - pt1 [j]
[j,i]!= 0){
word1< - paste(word1,colnames(train2)[i],:,train2 [j,i],,sep =)$ (j %% 100 == 0){
print(j); flush.console(b
$ b)
collect1 < - c(collect1,word1)
()
gc()
}
}

需要3-4分钟。有没有什么明显的提高性能?编辑:循环完成后,产生的数据帧 collect1 用来创建一个文本文件:

  write(collect1,file =outPut1.txt)


解决方案

尝试引导操作,如下所示(我把'score'放在一个单独的变量并从'train3'中删除它,所以我不需要在匿名函数中对数据框进行子集化):

  score< ; train2 $ score 
train3 < - train2 [,-1]
cols< - colnames(train3)
res< - apply(train3,1,function(x){
idx < - x!= 0
nms < - cols [idx]
vals < - x [idx]
paste(nms,vals,sep =: ,collapse =)
})

out < - 粘贴(score,|,as.vector(res))
print(out)


I'm trying to convert a data frame from R to a text file.

The data set is ~ 1500 x 700 and it takes a while to loop thru the dataframe and I'm wondering if there's any way to speed up the process.

My data frame is like this:

>train2
score   x1    x2    x3     x4     x5 ...  x700
  0     0      1     1      1     0        0
  1     0      1     0      0     0        0
  0     1      0     1      1     1        0
  3     0      1     1      1     0        0
  1     0      1     0      1     0        0
  2     1      1     1      1     0        1
  0     0      1     1      0     0        0
 ...    .      .     .      .     .        .

In the created file I only include cells that are non-zero.

So the output for row 1-3 would be:

0 | x2:1 x3:1 x4:1
1 | x2:1
0 | x1:1 x3:1 x4:1

My current code runs like this:

pt1 <- paste(train2$score," | ",sep="")
  collect1 <- c()
  for(j in 1:nrow(train2)){
    word1 <- pt1[j]
    for(i in 10:ncol(train2)){
      if(train2[j,i] !=0){
        word1 <- paste(word1,colnames(train2)[i],":",train2[j,i], " ", sep="")                        
      }      
    }  
    collect1 <- c(collect1, word1)
    if(j %% 100 == 0){
      print(j);flush.console()    
      gc()
    }    
  }

Each run takes ~ 3-4 minutes. Is there anything obvious to improve the performance?

EDIT: after the loops are completed, the resulting data frame collect1 is used to create a text file using:

  write(collect1, file="outPut1.txt")

解决方案

Try vectoring the operation as follows (I put 'score' in a separate variable and removed it from 'train3' so I wouldn't have to subset the data frame in the anonymous function):

score  <- train2$score
train3 <- train2[, -1]
cols   <- colnames(train3)
res <- apply(train3, 1, function(x) {
  idx  <- x != 0
  nms  <- cols[idx]
  vals <- x[idx]
  paste(nms, vals, sep=":", collapse=" ")
})

out <- paste(score, "|", as.vector(res))
print(out)

这篇关于R将数据帧转换为输入文件 - 提高性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆