在R中更新数据集最快的方法是什么? [英] What is the fastest way to update a data set in R?

查看:80
本文介绍了在R中更新数据集最快的方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个20000 * 5数据集。当前,它以迭代方式进行处理,并且每次迭代都会不断更新数据集。

I have a 20000 * 5 data set. Currently it is being processed in an iterative manner and the data set gets updated continuously on every iteration.

data.frame中的单元会为每次迭代更新,并寻求一些帮助来更快地运行这些东西。由于这是一个很小的data.frame,所以我不确定data.table是否可以正常工作。

The cells in the data.frame gets updated for every iteration and looking for some help in running these things faster. Since this is a small data.frame I'm not sure if data.table would work fine.

以下是data.frame子分配的基准:

Here are the benchmarks for data.frame subassignment:

sessionInfo()
R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
set.seed(1234)
test <- data.frame(A = rep(LETTERS  , 800), B = rep(1:26, 800),    C=runif(20800), D=runif(20800) , E =rnorm(20800))
microbenchmark::microbenchmark(test[765,"C"] <- test[765,"C"] + 25)
Unit: microseconds
                                  expr     min       lq     mean   median       uq      max neval
 test[765, "C"] <- test[765, "C"] + 25 112.306 130.8485 979.4584 186.3025 197.7565 44556.15   100}

有没有比我发布的方法更快地实现上述功能的方法?

Is there a way to achieve the above function faster than what I have posted?

推荐答案

有趣的是,如果您使用的是data.table,乍一看似乎并没有更快。

Interestingly enough, if you're using a data.table it doesn't seem to be faster at first glance. Perhaps it's getting faster when using the assignment inside of a loop.

library(data.table)
library(microbenchmark)
dt <- data.table(test)

# Accessing the entry
dt[765, "C", with = FALSE] 

# Replacing the value with the new one
# Basic data.table syntax
dt[i =765, C := C + 25 ]

# Replacing the value with the new one
# using set() from data.table
set(dt, i = 765L, j = "C", value = dt[765L,C] + 25)

microbenchmark(
      a = set(dt, i = 765L, j = "C", value = dt[765L,C] + 25)
    , b = dt[i =765, C := C + 25 ]
    , c = test[765, "C"] <- test[765, "C"] + 25
    , times = 1000       
  )

微基准测试的结果:

                                                   expr     min      lq     mean  median       uq      max neval
 a = set(dt, i = 765L, j = "C", value = dt[765L, C] + 25) 236.357 46.621 266.4188 250.847 260.2050  572.630  1000
 b = dt[i = 765, `:=`(C, C + 25)]                         333.556 345.329 375.8690 351.668 362.6860 1603.482  1000
 c = test[765, "C"] <- test[765, "C"] + 25                73.051  81.805 129.1665  84.220  87.6915 1749.281  1000

这篇关于在R中更新数据集最快的方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆