快速有效地扩展R中的数据集 [英] Fast and efficient way to expand a dataset in R
问题描述
我要展开的数据集的一部分示例
年份Key2 Key3 Key4 Key5 ...
2001 150 105 140 140
2002 130 70 55 80
2003 590 375 355 385
...
首选结果
- i =索引号
- col =列号(Key2 = 1,Key3 = 2等)
- p =随机数
-
value =使用列号计算的值,p
year i col p value
2001 1 1 0.7481282 4.0150810
2001 2 1 0.8449366 2.0735090
2001 ... 1 0.1906882 0.9534411
2001 150 1 0.8030162 3.7406410
2001 1 2 0.4147019 4.2246831
2001 2 2 0.3716995 1.8584977
2001 ... 2 0.5280272 2.6401361
2001 105 2 0.8030162 3.7406410
2001 1 3 0.7651376 3.8256881
2001 2 3 0.2298984 1.1494923
2001 ... 3 0.5607825 2.8039128
2001 140 3 0.7222644 3.61 13222
等
2002 1 1 0.1796613 0.8983065
2002 2 1 0.6390833 3.1954165
2002 ... 1 0.5280272 2.6401367
2002 130 1 0.4238842 2.1194210
2002 1 2 0.7651376 3.8256889
2002 2 2 0.2298984 1.1494928
2002 ... 2 0.5607825 2.8039125
2002 70 2 0.7222644 3.6113227
2002 1 3 0.7512801 3.7564000
2002 2 3 0.4484248 2.2421240
2002 ... 3 0.5662704 2.8313520
2002 55 3 0.7685377 3.8426884
等
/ li>
我在R中使用以下代码,但是使用大型数据集非常慢。
我试图通过使用 rep()
将循环使用到最小,但是我仍然需要在代码中进行循环。
R是否有更快/更有效的方法?使用data.table?
val < - c(); i< - c(); cols< - c(); p (在1:10中的年份){
for(n in 2:25){
c < - n-1
pu< - runif(dataset1 [[year,n]])
p <-C(p,pu)
tmp < - (c-1)* 5 + 5 * pu
val< ; c(val,tmp)
##
i <-C(i,1:dataset1 [[year,n]])
cols< - c(cols,rep c,dataset1 [[year,n]]))
year< - c(year,rep(dataset1 [[year,1]],dataset1 [[year,n]]))
}
}
res.df< - data.frame(year = year,i = i,cols = cols,p = p,val = val)
res.df< - setDT (res.df)
问题的核心是将 Key
列中的值扩展为 i
。
这是另一个 data.table
解决方案,采用 melt()
但执行细节与 David的评论不同:
library(data.table)
DT< - data.table(dataset1)
expanded< - melt(DT,id.vars =Year,variable =col)[,col:= rleid(col)] [
,。(i = seq_len(value) ,by =。(Year,col)]
expanded
Year col i
1:2001 1 1
2:2001 1 2
3:2001 1 3
4:2001 1 4
5:2001 1 5
---
2571:2003 4 381
2572:2003 4 382
2573:2003 4 383
2574:2003 4 384
2575:2003 4 385
其余的计算可以这样做(如果我已经理解了OP意图右)
set.seed(123L)#使结果可重复
res.df< - expanded [ ,p:= runif(.N)] [,value:= 5 *(col - 1L + p)] []
res.df
年份col ip value
1:2001 1 1 0.2875775 1.437888
2: 2001 1 2 0.7883051 3.941526
3:2001 1 3 0.4089769 2.044885
4:2001 1 4 0.8830174 4.415087
5:2001 1 5 0.9404673 4.702336
---
2571 :2003 4 381 0.4711072 17.355536
2572:2003 4 382 0.5323359 17.661680
2573:2003 4 383 0.3953954 16.976977
2574:2003 4 384 0.4544372 17.272186
2575:2003 4 385 0.1149009 15.574505
不同的方法
由于OP要求更快/更有效的方式,所以提出的三种不同的方法r正在进行基准测试:
- David的
data.table
解决方案加上一个修改,确保结果与预期结果 - ycw's
tidyverse
solution - 我的
data.table
解决方案
基准代码
对于基准测试,使用 microbenchmark
包。
库(magrittr)
bm< - microbenchmark :: microbenchmark(
david1 = {
expanded_david1 <
setorder(
melt(DT,id =Year,value =i,variable =col)[rep(1:.N,i)],Year,col
)[,i:= seq_len(.N),by =。(Year,col)]
},
david2 = {
expanded_david2< -
setorder b $ b ($,$,= =i,变量=col)[,col:= as.integer(col)] [
rep(1:.N,i)], Year,col)[,i:= seq_len(.N),by =。(Year,col)]
},
uwe = {
expanded_uwe < -
melt (DT,id.vars =Year,variable =col)[,col:= rleid(col)] [
,。(i = seq_len(value)),by =。(Year,col )]
},
ycw = {
expanded_ycw< - DT%>%
tidyr :: gather(col,i, - Year)%>%
dplyr :: mutate(col = as.integer(sub(Key,,col)) - 1L)%>%
dplyr :: rowwise()%>%
dplyr :: do(tibble :: data_frame(Year =。$ Year,col =。$ col,i = seq(1L,。$ i,1L)))%>%
dplyr :: select ,i,col)%>%
dplyr :: arrange(Year,col,i)
},
times = 100L
)
bm
请注意,引用 tidyverse
函数按顺序显式以避免因名称空间混乱引起的名称冲突。修改后的 david2
变量将因子转换为数字数。
定时小样本数据集
由OP提供的3年的小样本数据集和4 键
列的时序如下:
单位:微秒
expr最小lq平均值uq max neval
david1 993.418 1161.4415 1260.4053 1244.320 1350.987 2000.805 100
david2 1261.500 1393.2760 1624.5298 1568.097 1703.837 5233.280 100
uwe 825.772 865.4175 979.2129 911.860 1084.226 1409.890 100
ycw 93063.262 97798.7005 100423.5148 99226.525 100599.600 205695.817 100
即使对于这个小问题大小, data.table
解决方案的速度比 tidyverse
方法的速度要快于解决方案的轻微优势 uwe
p>
结果被检查相等:
all.equal(expanded_david1 [,col:= as.integer(col)] [order(col,Year)],expanded_uwe)
#[1] TRUE
all.equal(expanded_david2 [order(col,Year)], expand_uwe)
#[1] TRUE
all.equal(expanded_ycw,expanded_uwe)
#[1] TRUE
除了 david1
,它返回因子而不是整数和不同的排序,所有四个结果是相同的。
较大的基准案例
表达OP的代码可以得出结论,他的生产数据集包含10年和24 键
列。在样本数据集中, Key
值的总体平均值为215.使用这些参数,将创建一个较大的数据集:
n_yr < - 10L
n_col< - 24L
avg_key< - 215L
col_names < - sprintf(Key%02i ,1L + seq_len(n_col))
DT< - data.table(Year = seq(2001L,by = 1L,length.out = n_yr))
DT [,(col_names) avg_key]
较大的数据集返回51600行,这仍然是相当适中的大小,但是是20倍比小样本。时间如下:
单位:毫秒
expr min lq平均值uq max neval
david1 2.512805 2.648735 2.726743 2.697065 2.698576 3.076535 5
david2 2.791838 2.816758 2.998828 3.068605 3.075780 3.241160 5
uwe 1.329088 1.453312 1.585390 1.514857 1.634551 1.995142 5
ycw 1641.527166 1643.979936 1646.004905 1645.091158 1646.599219 1652.827047 5
对于这个问题大小, uwe
几乎是其他
data.table
实现的两倍。 tidyverse
方法仍然很慢。
I try to expand a dataset in R using the values from different columns (Key2 - KeyX) and then use the column number in a formula do compute some value.
Example of a part of the dataset I want to expand
Year Key2 Key3 Key4 Key5 ...
2001 150 105 140 140
2002 130 70 55 80
2003 590 375 355 385
...
Preferred result.
- i = index number
- col = column number (Key2 = 1, Key3 = 2, etc.)
- p = random number
value = value calculated with the column number and p
year i col p value 2001 1 1 0.7481282 4.0150810 2001 2 1 0.8449366 2.0735090 2001 ... 1 0.1906882 0.9534411 2001 150 1 0.8030162 3.7406410 2001 1 2 0.4147019 4.2246831 2001 2 2 0.3716995 1.8584977 2001 ... 2 0.5280272 2.6401361 2001 105 2 0.8030162 3.7406410 2001 1 3 0.7651376 3.8256881 2001 2 3 0.2298984 1.1494923 2001 ... 3 0.5607825 2.8039128 2001 140 3 0.7222644 3.6113222 etc. 2002 1 1 0.1796613 0.8983065 2002 2 1 0.6390833 3.1954165 2002 ... 1 0.5280272 2.6401367 2002 130 1 0.4238842 2.1194210 2002 1 2 0.7651376 3.8256889 2002 2 2 0.2298984 1.1494928 2002 ... 2 0.5607825 2.8039125 2002 70 2 0.7222644 3.6113227 2002 1 3 0.7512801 3.7564000 2002 2 3 0.4484248 2.2421240 2002 ... 3 0.5662704 2.8313520 2002 55 3 0.7685377 3.8426884 etc.
I use the following code in R, but it is very slow with a large dataset.
I tried to keep the use of loops to a minimum by using rep()
but I still have to for-loops in the code.
Is there a faster / more efficient way to do this is R? Using data.table?
val <- c(); i <- c(); cols <- c(); p <- c(); year <- c()
for (year in 1:10) {
for (n in 2:25) {
c <- n-1
pu <- runif(dataset1[[year, n]])
p <- c(p, pu )
tmp <- (c-1)*5 + 5*pu
val <- c(val, tmp)
##
i <- c(i, 1:dataset1[[year, n]])
cols <- c(cols, rep(c, dataset1[[year, n]]) )
year <- c(year, rep(dataset1[[year,1]], dataset1[[year, n]]) )
}
}
res.df <- data.frame(year=year, i=i, cols=cols, p=p, val=val)
res.df <- setDT(res.df)
The core of the problem is the expansion of the values in the Key
columns into i
.
Here is another data.table
solution employing melt()
but differing in implementation details from David's comment:
library(data.table)
DT <- data.table(dataset1)
expanded <- melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
, .(i = seq_len(value)), by = .(Year, col)]
expanded
Year col i 1: 2001 1 1 2: 2001 1 2 3: 2001 1 3 4: 2001 1 4 5: 2001 1 5 --- 2571: 2003 4 381 2572: 2003 4 382 2573: 2003 4 383 2574: 2003 4 384 2575: 2003 4 385
The remaining computations can be done like this (if I've understood OP's intention right)
set.seed(123L) # make results reproducable
res.df <- expanded[, p := runif(.N)][, value := 5 * (col - 1L + p)][]
res.df
Year col i p value 1: 2001 1 1 0.2875775 1.437888 2: 2001 1 2 0.7883051 3.941526 3: 2001 1 3 0.4089769 2.044885 4: 2001 1 4 0.8830174 4.415087 5: 2001 1 5 0.9404673 4.702336 --- 2571: 2003 4 381 0.4711072 17.355536 2572: 2003 4 382 0.5323359 17.661680 2573: 2003 4 383 0.3953954 16.976977 2574: 2003 4 384 0.4544372 17.272186 2575: 2003 4 385 0.1149009 15.574505
Benchmarking the different approaches
As the OP is asking for a faster / more efficient way, the three different approaches proposed so far are being benchmarked:
- David's
data.table
solution plus a modification which ensures the result is identical with the expected result - ycw's
tidyverse
solution - my
data.table
solution
Benchmark code
For benchmarking, the microbenchmark
package is used.
library(magrittr)
bm <- microbenchmark::microbenchmark(
david1 = {
expanded_david1 <-
setorder(
melt(DT, id = "Year", value = "i", variable = "col")[rep(1:.N, i)], Year, col
)[, i := seq_len(.N), by = .(Year, col)]
},
david2 = {
expanded_david2 <-
setorder(
melt(DT, id = "Year", value = "i", variable = "col")[, col := as.integer(col)][
rep(1:.N, i)], Year, col)[, i := seq_len(.N), by = .(Year, col)]
},
uwe = {
expanded_uwe <-
melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
, .(i = seq_len(value)), by = .(Year, col)]
},
ycw = {
expanded_ycw <- DT %>%
tidyr::gather(col, i, - Year) %>%
dplyr::mutate(col = as.integer(sub("Key", "", col)) - 1L) %>%
dplyr::rowwise() %>%
dplyr::do(tibble::data_frame(Year = .$Year, col = .$col, i = seq(1L, .$i, 1L))) %>%
dplyr::select(Year, i, col) %>%
dplyr::arrange(Year, col, i)
},
times = 100L
)
bm
Note that references to tidyverse
functions are made explicit in order to avoid name conflicts due to a cluttered name space. The modified david2
variant converts factors to numbers of levels.
Timing the small sample data set
With the small sample data set with 3 years and 4 Key
columns provided by the OP the timings are as follows:
Unit: microseconds expr min lq mean median uq max neval david1 993.418 1161.4415 1260.4053 1244.320 1350.987 2000.805 100 david2 1261.500 1393.2760 1624.5298 1568.097 1703.837 5233.280 100 uwe 825.772 865.4175 979.2129 911.860 1084.226 1409.890 100 ycw 93063.262 97798.7005 100423.5148 99226.525 100599.600 205695.817 100
Even for this small problem size, the data.table
solutions are magnitudes faster than the tidyverse
approach with slight advantages for solution uwe
.
The results are checked to be equal:
all.equal(expanded_david1[, col := as.integer(col)][order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_david2[order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_ycw, expanded_uwe)
#[1] TRUE
Except for david1
which returns factors instead of integers and a different ordering, all four results are identical.
Larger benchmark case
Form OP's code it can be concluded that his production data set consists of 10 years and 24 Key
columns. In the sample data set the overall mean of Key
values is 215. With these parameters, a larger data set is being created:
n_yr <- 10L
n_col <- 24L
avg_key <- 215L
col_names <- sprintf("Key%02i", 1L + seq_len(n_col))
DT <- data.table(Year = seq(2001L, by = 1L, length.out = n_yr))
DT[, (col_names) := avg_key]
The larger data set returns 51600 rows which is still of rather moderate size but is 20 times larger than the small sample. Timings are as follows:
Unit: milliseconds expr min lq mean median uq max neval david1 2.512805 2.648735 2.726743 2.697065 2.698576 3.076535 5 david2 2.791838 2.816758 2.998828 3.068605 3.075780 3.241160 5 uwe 1.329088 1.453312 1.585390 1.514857 1.634551 1.995142 5 ycw 1641.527166 1643.979936 1646.004905 1645.091158 1646.599219 1652.827047 5
For this problem size, uwe
is nearly twice as fast as the other data.table
implementations. The tidyverse
approach is still magnitudes slower.
这篇关于快速有效地扩展R中的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!