快速有效地扩展R中的数据集 [英] Fast and efficient way to expand a dataset in R

查看:110
本文介绍了快速有效地扩展R中的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用不同列(Key2 - KeyX)中的值来扩展R中的数据集,然后使用公式中的列号来计算一些值。



我要展开的数据集的一部分示例

 年份Key2 Key3 Key4 Key5 ... 
2001 150 105 140 140
2002 130 70 55 80
2003 590 375 355 385
...

首选结果




  • i =索引号

  • col =列号(Key2 = 1,Key3 = 2等)

  • p =随机数

  • value =使用列号计算的值,p

      year i col p value 
    2001 1 1 0.7481282 4.0150810
    2001 2 1 0.8449366 2.0735090
    2001 ... 1 0.1906882 0.9534411
    2001 150 1 0.8030162 3.7406410
    2001 1 2 0.4147019 4.2246831
    2001 2 2 0.3716995 1.8584977
    2001 ... 2 0.5280272 2.6401361
    2001 105 2 0.8030162 3.7406410
    2001 1 3 0.7651376 3.8256881
    2001 2 3 0.2298984 1.1494923
    2001 ... 3 0.5607825 2.8039128
    2001 140 3 0.7222644 3.61 13222


    2002 1 1 0.1796613 0.8983065
    2002 2 1 0.6390833 3.1954165
    2002 ... 1 0.5280272 2.6401367
    2002 130 1 0.4238842 2.1194210
    2002 1 2 0.7651376 3.8256889
    2002 2 2 0.2298984 1.1494928
    2002 ... 2 0.5607825 2.8039125
    2002 70 2 0.7222644 3.6113227
    2002 1 3 0.7512801 3.7564000
    2002 2 3 0.4484248 2.2421240
    2002 ... 3 0.5662704 2.8313520
    2002 55 3 0.7685377 3.8426884

    / li>



我在R中使用以下代码,但是使用大型数据集非常慢。
我试图通过使用 rep()将循环使用到最小,但是我仍然需要在代码中进行循环。



R是否有更快/更有效的方法?使用data.table?

  val < -  c(); i<  -  c(); cols<  -  c(); p (在1:10中的年份){
for(n in 2:25){
c < - n-1
pu< - runif(dataset1 [[year,n]])
p <-C(p,pu)
tmp < - (c-1)* 5 + 5 * pu
val< ; c(val,tmp)
##
i <-C(i,1:dataset1 [[year,n]])
cols< - c(cols,rep c,dataset1 [[year,n]]))
year< - c(year,rep(dataset1 [[year,1]],dataset1 [[year,n]]))
}
}
res.df< - data.frame(year = year,i = i,cols = cols,p = p,val = val)
res.df< - setDT (res.df)


解决方案

问题的核心是将 Key 列中的值扩展为 i



这是另一个 data.table 解决方案,采用 melt()但执行细节与 David的评论不同

  library(data.table)
DT< - data.table(dataset1)
expanded< - melt(DT,id.vars =Year,variable =col)[,col:= rleid(col)] [
,。(i = seq_len(value) ,by =。(Year,col)]
expanded




  Year col i 
1:2001 1 1
2:2001 1 2
3:2001 1 3
4:2001 1 4
5:2001 1 5
---
2571:2003 4 381
2572:2003 4 382
2573:2003 4 383
2574:2003 4 384
2575:2003 4 385


其余的计算可以这样做(如果我已经理解了OP意图右)

  set.seed(123L)#使结果可重复
res.df< - expanded [ ,p:= runif(.N)] [,value:= 5 *(col - 1L + p)] []
res.df




 年份col ip value 
1:2001 1 1 0.2875775 1.437888
2: 2001 1 2 0.7883051 3.941526
3:2001 1 3 0.4089769 2.044885
4:2001 1 4 0.8830174 4.415087
5:2001 1 5 0.9404673 4.702336
---
2571 :2003 4 381 0.4711072 17.355536
2572:2003 4 382 0.5323359 17.661680
2573:2003 4 383 0.3953954 16.976977
2574:2003 4 384 0.4544372 17.272186
2575:2003 4 385 0.1149009 15.574505







不同的方法



由于OP要求更快/更有效的方式,所以提出的三种不同的方法r正在进行基准测试:





基准代码



对于基准测试,使用 microbenchmark 包。

 库(magrittr)
bm< - microbenchmark :: microbenchmark(
david1 = {
expanded_david1 <
setorder(
melt(DT,id =Year,value =i,variable =col)[rep(1:.N,i)],Year,col
)[,i:= seq_len(.N),by =。(Year,col)]
},
david2 = {
expanded_david2< -
setorder b $ b ($,$,= =i,变量=col)[,col:= as.integer(col)] [
rep(1:.N,i)], Year,col)[,i:= seq_len(.N),by =。(Year,col)]
},
uwe = {
expanded_uwe < -
melt (DT,id.vars =Year,variable =col)[,col:= rleid(col)] [
,。(i = seq_len(value)),by =。(Year,col )]
},
ycw = {
expanded_ycw< - DT%>%
tidyr :: gather(col,i, - Year)%>%
dplyr :: mutate(col = as.integer(sub(Key,,col)) - 1L)%>%
dplyr :: rowwise()%>%
dplyr :: do(tibble :: data_frame(Year =。$ Year,col =。$ col,i = seq(1L,。$ i,1L)))%>%
dplyr :: select ,i,col)%>%
dplyr :: arrange(Year,col,i)
},
times = 100L

bm

请注意,引用 tidyverse 函数按顺序显式以避免因名称空间混乱引起的名称冲突。修改后的 david2 变量将因子转换为数字数。



定时小样本数据集



由OP提供的3年的小样本数据集和4 列的时序如下:


 单位:微秒
expr最小lq平均值uq max neval
david1 993.418 1161.4415 1260.4053 1244.320 1350.987 2000.805 100
david2 1261.500 1393.2760 1624.5298 1568.097 1703.837 5233.280 100
uwe 825.772 865.4175 979.2129 911.860 1084.226 1409.890 100
ycw 93063.262 97798.7005 100423.5148 99226.525 100599.600 205695.817 100


即使对于这个小问题大小, data.table 解决方案的速度比 tidyverse 方法的速度要快于解决方案的轻微优势 uwe p>

结果被检查相等:

  all.equal(expanded_david1 [,col:= as.integer(col)] [order(col,Year)],expanded_uwe)
#[1] TRUE
all.equal(expanded_david2 [order(col,Year)], expand_uwe)
#[1] TRUE
all.equal(expanded_ycw,expanded_uwe)
#[1] TRUE

除了 david1 ,它返回因子而不是整数和不同的排序,所有四个结果是相同的。



较大的基准案例



表达OP的代码可以得出结论,他的生产数据集包含10年和24 列。在样本数据集中, Key 值的总体平均值为215.使用这些参数,将创建一个较大的数据集:

  n_yr < -  10L 
n_col< - 24L
avg_key< - 215L
col_names < - sprintf(Key%02i ,1L + seq_len(n_col))
DT< - data.table(Year = seq(2001L,by = 1L,length.out = n_yr))
DT [,(col_names) avg_key]

较大的数据集返回51600行,这仍然是相当适中的大小,但是是20倍比小样本。时间如下:


 单位:毫秒
expr min lq平均值uq max neval
david1 2.512805 2.648735 2.726743 2.697065 2.698576 3.076535 5
david2 2.791838 2.816758 2.998828 3.068605 3.075780 3.241160 5
uwe 1.329088 1.453312 1.585390 1.514857 1.634551 1.995142 5
ycw 1641.527166 1643.979936 1646.004905 1645.091158 1646.599219 1652.827047 5


对于这个问题大小, uwe 几乎是其他 data.table 实现的两倍。 tidyverse 方法仍然很慢。


I try to expand a dataset in R using the values from different columns (Key2 - KeyX) and then use the column number in a formula do compute some value.

Example of a part of the dataset I want to expand

Year Key2 Key3 Key4 Key5 ...
2001  150  105  140  140
2002  130   70   55   80
2003  590  375  355  385
...

Preferred result.

  • i = index number
  • col = column number (Key2 = 1, Key3 = 2, etc.)
  • p = random number
  • value = value calculated with the column number and p

    year   i col         p     value
    2001   1   1 0.7481282 4.0150810
    2001   2   1 0.8449366 2.0735090
    2001 ...   1 0.1906882 0.9534411
    2001 150   1 0.8030162 3.7406410
    2001   1   2 0.4147019 4.2246831
    2001   2   2 0.3716995 1.8584977
    2001 ...   2 0.5280272 2.6401361
    2001 105   2 0.8030162 3.7406410
    2001   1   3 0.7651376 3.8256881
    2001   2   3 0.2298984 1.1494923
    2001 ...   3 0.5607825 2.8039128
    2001 140   3 0.7222644 3.6113222
    etc.
    
    2002   1   1 0.1796613 0.8983065
    2002   2   1 0.6390833 3.1954165
    2002 ...   1 0.5280272 2.6401367
    2002 130   1 0.4238842 2.1194210
    2002   1   2 0.7651376 3.8256889
    2002   2   2 0.2298984 1.1494928
    2002 ...   2 0.5607825 2.8039125
    2002  70   2 0.7222644 3.6113227
    2002   1   3 0.7512801 3.7564000
    2002   2   3 0.4484248 2.2421240
    2002 ...   3 0.5662704 2.8313520
    2002  55   3 0.7685377 3.8426884
    etc.
    

I use the following code in R, but it is very slow with a large dataset. I tried to keep the use of loops to a minimum by using rep() but I still have to for-loops in the code.

Is there a faster / more efficient way to do this is R? Using data.table?

val <- c(); i <- c(); cols <- c(); p <- c(); year <- c()
for (year in 1:10) {
  for (n in 2:25) {
      c <- n-1
      pu <- runif(dataset1[[year, n]])
      p <- c(p, pu )
      tmp <- (c-1)*5 + 5*pu
      val <- c(val, tmp)
      ##
      i <- c(i, 1:dataset1[[year, n]])
      cols <- c(cols, rep(c, dataset1[[year, n]]) )
      year <- c(year, rep(dataset1[[year,1]], dataset1[[year, n]]) )
  }
}
res.df <- data.frame(year=year, i=i, cols=cols, p=p, val=val)
res.df <- setDT(res.df)

解决方案

The core of the problem is the expansion of the values in the Key columns into i.

Here is another data.table solution employing melt() but differing in implementation details from David's comment:

library(data.table)
DT <- data.table(dataset1)
expanded <- melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
  , .(i = seq_len(value)), by = .(Year, col)]
expanded

      Year col   i
   1: 2001   1   1
   2: 2001   1   2
   3: 2001   1   3
   4: 2001   1   4
   5: 2001   1   5
  ---             
2571: 2003   4 381
2572: 2003   4 382
2573: 2003   4 383
2574: 2003   4 384
2575: 2003   4 385

The remaining computations can be done like this (if I've understood OP's intention right)

set.seed(123L) # make results reproducable
res.df <- expanded[, p := runif(.N)][, value := 5 * (col - 1L + p)][]
res.df

      Year col   i         p     value
   1: 2001   1   1 0.2875775  1.437888
   2: 2001   1   2 0.7883051  3.941526
   3: 2001   1   3 0.4089769  2.044885
   4: 2001   1   4 0.8830174  4.415087
   5: 2001   1   5 0.9404673  4.702336
  ---                                 
2571: 2003   4 381 0.4711072 17.355536
2572: 2003   4 382 0.5323359 17.661680
2573: 2003   4 383 0.3953954 16.976977
2574: 2003   4 384 0.4544372 17.272186
2575: 2003   4 385 0.1149009 15.574505


Benchmarking the different approaches

As the OP is asking for a faster / more efficient way, the three different approaches proposed so far are being benchmarked:

Benchmark code

For benchmarking, the microbenchmark package is used.

library(magrittr)
bm <- microbenchmark::microbenchmark(
  david1 = {
    expanded_david1 <-
      setorder(
        melt(DT, id = "Year", value = "i", variable = "col")[rep(1:.N, i)], Year, col
      )[, i := seq_len(.N), by = .(Year, col)]
  },
  david2 = {
    expanded_david2 <-
      setorder(
        melt(DT, id = "Year", value = "i", variable = "col")[, col := as.integer(col)][
          rep(1:.N, i)], Year, col)[, i := seq_len(.N), by = .(Year, col)]
  },
  uwe = {
    expanded_uwe <- 
      melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
        , .(i = seq_len(value)), by = .(Year, col)]
  },
  ycw = {
    expanded_ycw <- DT %>%
      tidyr::gather(col, i, - Year) %>%
      dplyr::mutate(col = as.integer(sub("Key", "", col)) - 1L) %>%
      dplyr::rowwise() %>%
      dplyr::do(tibble::data_frame(Year = .$Year, col = .$col, i = seq(1L, .$i, 1L))) %>%
      dplyr::select(Year, i, col) %>%
      dplyr::arrange(Year, col, i)
  },
  times = 100L
)
bm

Note that references to tidyverse functions are made explicit in order to avoid name conflicts due to a cluttered name space. The modified david2 variant converts factors to numbers of levels.

Timing the small sample data set

With the small sample data set with 3 years and 4 Key columns provided by the OP the timings are as follows:

Unit: microseconds
   expr       min         lq        mean    median         uq        max neval
 david1   993.418  1161.4415   1260.4053  1244.320   1350.987   2000.805   100
 david2  1261.500  1393.2760   1624.5298  1568.097   1703.837   5233.280   100
    uwe   825.772   865.4175    979.2129   911.860   1084.226   1409.890   100
    ycw 93063.262 97798.7005 100423.5148 99226.525 100599.600 205695.817   100

Even for this small problem size, the data.table solutions are magnitudes faster than the tidyverse approach with slight advantages for solution uwe.

The results are checked to be equal:

all.equal(expanded_david1[, col := as.integer(col)][order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_david2[order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_ycw, expanded_uwe)
#[1] TRUE

Except for david1 which returns factors instead of integers and a different ordering, all four results are identical.

Larger benchmark case

Form OP's code it can be concluded that his production data set consists of 10 years and 24 Key columns. In the sample data set the overall mean of Key values is 215. With these parameters, a larger data set is being created:

n_yr <- 10L
n_col <- 24L
avg_key <- 215L
col_names <- sprintf("Key%02i", 1L + seq_len(n_col))
DT <- data.table(Year = seq(2001L, by = 1L, length.out = n_yr))
DT[, (col_names) := avg_key]

The larger data set returns 51600 rows which is still of rather moderate size but is 20 times larger than the small sample. Timings are as follows:

Unit: milliseconds
   expr         min          lq        mean      median          uq         max neval
 david1    2.512805    2.648735    2.726743    2.697065    2.698576    3.076535     5
 david2    2.791838    2.816758    2.998828    3.068605    3.075780    3.241160     5
    uwe    1.329088    1.453312    1.585390    1.514857    1.634551    1.995142     5
    ycw 1641.527166 1643.979936 1646.004905 1645.091158 1646.599219 1652.827047     5

For this problem size, uwe is nearly twice as fast as the other data.table implementations. The tidyverse approach is still magnitudes slower.

这篇关于快速有效地扩展R中的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆