如何加快 tidyr 中的功能 [英] how can I speed a function in tidyr up

查看:34
本文介绍了如何加快 tidyr 中的功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这样的数据

    n <- 1e5
set.seed(24)
df1 <- data.frame(query_string = sample(sprintf("%06d", 100:1000), 
 n, replace=TRUE), id.x = sample(1:n), 
                  s_val = sample(paste0("F", 400:700), n, 
        replace=TRUE), id.y = sample(100:3000, n, replace=TRUE), 
            ID_col_n = sample(100:1e6, n, replace=TRUE), total_id = 1:n)

我使用扩展函数使用以下函数分配公共字符串

I use the spread function to assign common strings using the following function

library(tidyr)

res <- spread(resNik,s_val,value=query_string,fill=NA)

这很完美,但是当数据很大时,它就像永远不会结束.我不知道我的电脑是挂了还是还在运行,因为两个小时后仍然没有任何反应

This works perfectly but when the data is huge, it is like never going to end. I don't know if my computer is hanged or it is still running because after two hours still nothing coming up

我想知道是否有人可以帮助我使用另一个函数或其他比 spread 更快的函数?

I am wondering if one can help me to use another function or something else which works faster than spread?

推荐答案

基于 1e5dcast from data.table 的基准更快

Based on the benchmarks on 1e5 rows dcast from data.table is faster

library(data.table)
system.time({res1 <- spread(df1,s_val,value=query_string,fill=NA)})
# user  system elapsed 
#   1.50    0.25    1.75 


system.time({res2 <- dcast(setDT(df1), id.x+id.y + ID_col_n +total_id~s_val,
                                  value.var = "query_string")})
# user  system elapsed 
#   0.61    0.03    0.61 

res11 <- res1 %>%
           arrange(id.x)
res21 <- res2[order(id.x)]  

all.equal(as.data.frame(res11), as.data.frame(res21), check.attributes=FALSE)  
#[1] TRUE

差异随着行数的增加而增加,即从 'n' 变为 1e6

The difference is increased with the increase in the number of rows i.e. from changing 'n' to 1e6

system.time({res1 <- spread(df1,s_val,value=query_string,fill=NA)})
#   user  system elapsed 
# 28.64    3.17   31.91 
system.time({res2 <- dcast(setDT(df1), id.x+id.y + ID_col_n +total_id~s_val,
                                  value.var = "query_string")})
#   user  system elapsed 
#   5.22    1.08    6.21 

数据

n <- 1e5
set.seed(24)
df1 <- data.frame(query_string = sample(sprintf("%06d", 100:1000), 
 n, replace=TRUE), id.x = sample(1:n), 
                  s_val = sample(paste0("F", 400:700), n, 
        replace=TRUE), id.y = sample(100:3000, n, replace=TRUE), 
            ID_col_n = sample(100:1e6, n, replace=TRUE), total_id = 1:n)

这篇关于如何加快 tidyr 中的功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆