快速有效地扩展R中的数据集 [英] Fast and efficient way to expand a dataset in R

查看：110 发布时间：2017/7/13 21:40:07 r data.table dplyr

本文介绍了快速有效地扩展R中的数据集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试使用不同列（Key2 - KeyX）中的值来扩展R中的数据集，然后使用公式中的列号来计算一些值。

我要展开的数据集的一部分示例

 年份Key2 Key3 Key4 Key5 ... 
 2001 150 105 140 140 
 2002 130 70 55 80 
 2003 590 375 355 385 
 ...

首选结果

i =索引号

col =列号（Key2 = 1，Key3 = 2等）

p =随机数

value =使用列号计算的值，p

  year i col p value 
 2001 1 1 0.7481282 4.0150810 
 2001 2 1 0.8449366 2.0735090 
 2001 ... 1 0.1906882 0.9534411 
 2001 150 1 0.8030162 3.7406410 
 2001 1 2 0.4147019 4.2246831 
 2001 2 2 0.3716995 1.8584977 
 2001 ... 2 0.5280272 2.6401361 
 2001 105 2 0.8030162 3.7406410 
 2001 1 3 0.7651376 3.8256881 
 2001 2 3 0.2298984 1.1494923 
 2001 ... 3 0.5607825 2.8039128 
 2001 140 3 0.7222644 3.61 13222 
等
 
 2002 1 1 0.1796613 0.8983065 
 2002 2 1 0.6390833 3.1954165 
 2002 ... 1 0.5280272 2.6401367 
 2002 130 1 0.4238842 2.1194210 
 2002 1 2 0.7651376 3.8256889 
 2002 2 2 0.2298984 1.1494928 
 2002 ... 2 0.5607825 2.8039125 
 2002 70 2 0.7222644 3.6113227 
 2002 1 3 0.7512801 3.7564000 
 2002 2 3 0.4484248 2.2421240 
 2002 ... 3 0.5662704 2.8313520 
 2002 55 3 0.7685377 3.8426884 
等

/ li>

我在R中使用以下代码，但是使用大型数据集非常慢。
我试图通过使用 rep（）将循环使用到最小，但是我仍然需要在代码中进行循环。

R是否有更快/更有效的方法？使用data.table？

  val < -  c（）; i<  -  c（）; cols<  -  c（）; p （在1:10中的年份）{
 for（n in 2:25）{
c < -  n-1 
 pu< -  runif（dataset1 [[year，n]]）
p <-C（p，pu）
 tmp < - （c-1）* 5 + 5 * pu 
 val< ; c（val，tmp）
 ## 
i <-C（i，1：dataset1 [[year，n]]）
 cols<  -  c（cols，rep c，dataset1 [[year，n]]））
 year<  -  c（year，rep（dataset1 [[year，1]]，dataset1 [[year，n]]））
} 
} 
 res.df<  -  data.frame（year = year，i = i，cols = cols，p = p，val = val）
 res.df<  -  setDT （res.df）

解决方案

问题的核心是将 Key 列中的值扩展为 i 。

这是另一个 data.table 解决方案，采用 melt（）但执行细节与 David的评论不同：

  library（data.table）
 DT<  -  data.table（dataset1）
 expanded<  -  melt（DT，id.vars =Year，variable =col）[，col：= rleid（col）] [
，。（i = seq_len（value） ，by =。（Year，col）] 
 expanded

  Year col i 
 1：2001 1 1 
 2：2001 1 2 
 3：2001 1 3 
 4：2001 1 4 
 5：2001 1 5 
 --- 
 2571：2003 4 381 
 2572：2003 4 382 
 2573：2003 4 383 
 2574：2003 4 384 
 2575：2003 4 385

其余的计算可以这样做（如果我已经理解了OP意图右）

  set.seed（123L）＃使结果可重复
 res.df<  -  expanded [ ，p：= runif（.N）] [，value：= 5 *（col  -  1L + p）] [] 
 res.df

 年份col ip value 
 1：2001 1 1 0.2875775 1.437888 
 2： 2001 1 2 0.7883051 3.941526 
 3：2001 1 3 0.4089769 2.044885 
 4：2001 1 4 0.8830174 4.415087 
 5：2001 1 5 0.9404673 4.702336 
 --- 
 2571 ：2003 4 381 0.4711072 17.355536 
 2572：2003 4 382 0.5323359 17.661680 
 2573：2003 4 383 0.3953954 16.976977 
 2574：2003 4 384 0.4544372 17.272186 
 2575：2003 4 385 0.1149009 15.574505

不同的方法

由于OP要求更快/更有效的方式，所以提出的三种不同的方法r正在进行基准测试：

David的 data.table 解决方案加上一个修改，确保结果与预期结果

ycw's tidyverse solution

我的 data.table 解决方案

基准代码

对于基准测试，使用 microbenchmark 包。

 库（magrittr）
 bm<  -  microbenchmark :: microbenchmark（
 david1 = {
 expanded_david1 < 
 setorder（
 melt（DT，id =Year，value =i，variable =col）[rep（1：.N，i）]，Year，col 
）[，i：= seq_len（.N），by =。（Year，col）] 
}，
 david2 = {
 expanded_david2<  -  
 setorder b $ b （$，$，= =i，变量=col）[，col：= as.integer（col）] [
 rep（1：.N，i）]， Year，col）[，i：= seq_len（.N），by =。（Year，col）] 
}，
 uwe = {
 expanded_uwe < -  
 melt （DT，id.vars =Year，variable =col）[，col：= rleid（col）] [
，。（i = seq_len（value）），by =。（Year，col ）] 
}，
 ycw = {
 expanded_ycw<  -  DT％>％
 tidyr :: gather（col，i， -  Year）％>％
 dplyr :: mutate（col = as.integer（sub（Key，，col）） -  1L）％>％
 dplyr :: rowwise（）％>％
 dplyr :: do（tibble :: data_frame（Year =。$ Year，col =。$ col，i = seq（1L，。$ i，1L）））％>％
 dplyr :: select ，i，col）％>％
 dplyr :: arrange（Year，col，i）
}，
 times = 100L 
）
 bm

请注意，引用 tidyverse 函数按顺序显式以避免因名称空间混乱引起的名称冲突。修改后的 david2 变量将因子转换为数字数。

定时小样本数据集

由OP提供的3年的小样本数据集和4 键列的时序如下：

 单位：微秒
 expr最小lq平均值uq max neval 
 david1 993.418 1161.4415 1260.4053 1244.320 1350.987 2000.805 100 
 david2 1261.500 1393.2760 1624.5298 1568.097 1703.837 5233.280 100 
 uwe 825.772 865.4175 979.2129 911.860 1084.226 1409.890 100 
 ycw 93063.262 97798.7005 100423.5148 99226.525 100599.600 205695.817 100

即使对于这个小问题大小， data.table 解决方案的速度比 tidyverse 方法的速度要快于解决方案的轻微优势 uwe p>

结果被检查相等：

  all.equal（expanded_david1 [，col：= as.integer（col）] [order（col，Year）]，expanded_uwe）
＃[1] TRUE 
 all.equal（expanded_david2 [order（col，Year）]， expand_uwe）
＃[1] TRUE 
 all.equal（expanded_ycw，expanded_uwe）
＃[1] TRUE

除了 david1 ，它返回因子而不是整数和不同的排序，所有四个结果是相同的。

较大的基准案例

表达OP的代码可以得出结论，他的生产数据集包含10年和24 键列。在样本数据集中， Key 值的总体平均值为215.使用这些参数，将创建一个较大的数据集：

  n_yr < -  10L 
 n_col<  -  24L 
 avg_key<  -  215L 
 col_names < -  sprintf（Key％02i ，1L + seq_len（n_col））
 DT<  -  data.table（Year = seq（2001L，by = 1L，length.out = n_yr））
 DT [，（col_names） avg_key]

较大的数据集返回51600行，这仍然是相当适中的大小，但是是20倍比小样本。时间如下：

 单位：毫秒
 expr min lq平均值uq max neval 
 david1 2.512805 2.648735 2.726743 2.697065 2.698576 3.076535 5 
 david2 2.791838 2.816758 2.998828 3.068605 3.075780 3.241160 5 
 uwe 1.329088 1.453312 1.585390 1.514857 1.634551 1.995142 5 
 ycw 1641.527166 1643.979936 1646.004905 1645.091158 1646.599219 1652.827047 5

对于这个问题大小， uwe 几乎是其他 data.table 实现的两倍。 tidyverse 方法仍然很慢。

 
I try to expand a dataset in R using the values from different columns (Key2 - KeyX) and then use the column number in a formula do compute some value. 

Example of a part of the dataset I want to expand
Year Key2 Key3 Key4 Key5 ...
2001  150  105  140  140
2002  130   70   55   80
2003  590  375  355  385
...
Preferred result.


i = index number 
col = column number (Key2 = 1, Key3 = 2, etc.) 
p = random number  
value = value calculated with the column number and p
year   i col         p     value
2001   1   1 0.7481282 4.0150810
2001   2   1 0.8449366 2.0735090
2001 ...   1 0.1906882 0.9534411
2001 150   1 0.8030162 3.7406410
2001   1   2 0.4147019 4.2246831
2001   2   2 0.3716995 1.8584977
2001 ...   2 0.5280272 2.6401361
2001 105   2 0.8030162 3.7406410
2001   1   3 0.7651376 3.8256881
2001   2   3 0.2298984 1.1494923
2001 ...   3 0.5607825 2.8039128
2001 140   3 0.7222644 3.6113222
etc.

2002   1   1 0.1796613 0.8983065
2002   2   1 0.6390833 3.1954165
2002 ...   1 0.5280272 2.6401367
2002 130   1 0.4238842 2.1194210
2002   1   2 0.7651376 3.8256889
2002   2   2 0.2298984 1.1494928
2002 ...   2 0.5607825 2.8039125
2002  70   2 0.7222644 3.6113227
2002   1   3 0.7512801 3.7564000
2002   2   3 0.4484248 2.2421240
2002 ...   3 0.5662704 2.8313520
2002  55   3 0.7685377 3.8426884
etc.



I use the following code in R, but it is very slow with a large dataset.
I tried to keep the use of loops to a minimum by using rep() but I still have to for-loops in the code.

Is there a faster / more efficient way to do this is R? Using data.table? 
val <- c(); i <- c(); cols <- c(); p <- c(); year <- c()
for (year in 1:10) {
  for (n in 2:25) {
      c <- n-1
      pu <- runif(dataset1[[year, n]])
      p <- c(p, pu )
      tmp <- (c-1)*5 + 5*pu
      val <- c(val, tmp)
      ##
      i <- c(i, 1:dataset1[[year, n]])
      cols <- c(cols, rep(c, dataset1[[year, n]]) )
      year <- c(year, rep(dataset1[[year,1]], dataset1[[year, n]]) )
  }
}
res.df <- data.frame(year=year, i=i, cols=cols, p=p, val=val)
res.df <- setDT(res.df)

 解决方案 
The core of the problem is the expansion of the values in the Key columns into i. 

Here is another data.table solution employing melt() but differing in implementation details from David's comment:
library(data.table)
DT <- data.table(dataset1)
expanded <- melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
  , .(i = seq_len(value)), by = .(Year, col)]
expanded



      Year col   i
   1: 2001   1   1
   2: 2001   1   2
   3: 2001   1   3
   4: 2001   1   4
   5: 2001   1   5
  ---             
2571: 2003   4 381
2572: 2003   4 382
2573: 2003   4 383
2574: 2003   4 384
2575: 2003   4 385

The remaining computations can be done like this (if I've understood OP's intention right)
set.seed(123L) # make results reproducable
res.df <- expanded[, p := runif(.N)][, value := 5 * (col - 1L + p)][]
res.df



      Year col   i         p     value
   1: 2001   1   1 0.2875775  1.437888
   2: 2001   1   2 0.7883051  3.941526
   3: 2001   1   3 0.4089769  2.044885
   4: 2001   1   4 0.8830174  4.415087
   5: 2001   1   5 0.9404673  4.702336
  ---                                 
2571: 2003   4 381 0.4711072 17.355536
2572: 2003   4 382 0.5323359 17.661680
2573: 2003   4 383 0.3953954 16.976977
2574: 2003   4 384 0.4544372 17.272186
2575: 2003   4 385 0.1149009 15.574505





Benchmarking the different approaches

As the OP is asking for a faster / more efficient way, the three different approaches proposed so far are being benchmarked:


David's data.table solution plus a modification which ensures the result is identical with the expected result
ycw's tidyverse solution
my data.table solution


Benchmark code

For benchmarking, the microbenchmark package is used.
library(magrittr)
bm <- microbenchmark::microbenchmark(
  david1 = {
    expanded_david1 <-
      setorder(
        melt(DT, id = "Year", value = "i", variable = "col")[rep(1:.N, i)], Year, col
      )[, i := seq_len(.N), by = .(Year, col)]
  },
  david2 = {
    expanded_david2 <-
      setorder(
        melt(DT, id = "Year", value = "i", variable = "col")[, col := as.integer(col)][
          rep(1:.N, i)], Year, col)[, i := seq_len(.N), by = .(Year, col)]
  },
  uwe = {
    expanded_uwe <- 
      melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
        , .(i = seq_len(value)), by = .(Year, col)]
  },
  ycw = {
    expanded_ycw <- DT %>%
      tidyr::gather(col, i, - Year) %>%
      dplyr::mutate(col = as.integer(sub("Key", "", col)) - 1L) %>%
      dplyr::rowwise() %>%
      dplyr::do(tibble::data_frame(Year = .$Year, col = .$col, i = seq(1L, .$i, 1L))) %>%
      dplyr::select(Year, i, col) %>%
      dplyr::arrange(Year, col, i)
  },
  times = 100L
)
bm
Note that references to tidyverse functions are made explicit in order to avoid name conflicts due to a cluttered name space. The modified david2 variant converts factors to numbers of levels.

Timing the small sample data set

With the small sample data set with 3 years and 4 Key columns provided by the OP the timings are as follows:

Unit: microseconds
   expr       min         lq        mean    median         uq        max neval
 david1   993.418  1161.4415   1260.4053  1244.320   1350.987   2000.805   100
 david2  1261.500  1393.2760   1624.5298  1568.097   1703.837   5233.280   100
    uwe   825.772   865.4175    979.2129   911.860   1084.226   1409.890   100
    ycw 93063.262 97798.7005 100423.5148 99226.525 100599.600 205695.817   100

Even for this small problem size, the data.table solutions are magnitudes faster than the tidyverse approach with slight advantages for solution uwe.

The results are checked to be equal:
all.equal(expanded_david1[, col := as.integer(col)][order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_david2[order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_ycw, expanded_uwe)
#[1] TRUE
Except for david1 which returns factors instead of integers and a different ordering, all four results are identical.

Larger benchmark case

Form OP's code it can be concluded that his production data set consists of 10 years and 24 Key columns. In the sample data set the overall mean of Key values is 215. With these parameters, a larger data set is being created:
n_yr <- 10L
n_col <- 24L
avg_key <- 215L
col_names <- sprintf("Key%02i", 1L + seq_len(n_col))
DT <- data.table(Year = seq(2001L, by = 1L, length.out = n_yr))
DT[, (col_names) := avg_key]
The larger data set returns 51600 rows which is still of rather moderate size but is 20 times larger than the small sample. Timings are as follows:

Unit: milliseconds
   expr         min          lq        mean      median          uq         max neval
 david1    2.512805    2.648735    2.726743    2.697065    2.698576    3.076535     5
 david2    2.791838    2.816758    2.998828    3.068605    3.075780    3.241160     5
    uwe    1.329088    1.453312    1.585390    1.514857    1.634551    1.995142     5
    ycw 1641.527166 1643.979936 1646.004905 1645.091158 1646.599219 1652.827047     5

For this problem size, uwe is nearly twice as fast as the other data.table implementations. The tidyverse approach is still magnitudes slower.

                        这篇关于快速有效地扩展R中的数据集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

快速有效地扩展R中的数据集 [英] Fast and efficient way to expand a dataset in R

问题描述

不同的方法

基准代码

定时小样本数据集

较大的基准案例

Benchmarking the different approaches

Benchmark code

Timing the small sample data set

Larger benchmark case

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

快速有效地扩展R中的数据集 [英] Fast and efficient way to expand a dataset in R

问题描述

不同的方法

基准代码

定时小样本数据集

较大的基准案例

Benchmarking the different approaches

Benchmark code

Timing the small sample data set

Larger benchmark case

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭