根据另一个数据框汇总一个数据框 [英] Summing a dataframe based on another dataframe
问题描述
我有10年中10个地点的每日降雨量数据
I have daily data of rainfall from 10 locations across 10 years
set.seed(123)
df <- data.frame(loc.id = rep(1:10, each = 10*365),years = rep(rep(2001:2010,each = 365),times = 10),
day = rep(rep(1:365,times = 10),times = 10), rain = runif(min = 0 , max = 35, 10*10*365))
我有一个单独的数据框,其中有某些日子,我希望使用这些日子来汇总 df $ c中的降雨量$ c>
I have a separate data frame that has certain days using which I want to sum the rainfall in df
df.ref <- data.frame(loc.id = rep(1:10, each = 10),
years = rep(2001:2010,times = 10),
index1 = rep(250,times = 10*10),
index2 = sample(260:270, size = 10*10,replace = T),
index3 = sample(280:290, size = 10*10,replace = T),
index4 = sample(291:300, size= 10*10,replace = T))
df.ref
loc.id years index1 index2 index3 index4
1: 1 2001 250 264 280 296
2: 1 2002 250 269 284 298
3: 1 2003 250 268 289 293
4: 1 2004 250 266 281 295
5: 1 2005 250 260 289 293
我想要的是 df.ref $中的行c $ c>,使用
。例如: df.ref
中的索引
值和
汇总<$中的降雨量索引1到索引2,索引1到索引3和索引1到索引4之间的c $ c> df
What I want to is for row in in df.ref
, use the index
values in df.ref
and
sum the rainfall in df
between index1 to index2, index1 to index3 and index1 to index4. For example:
使用 df.ref
,对于loc.id = 1和year == 2001,求和 df
的降雨量从250到264、250到280、250到296(如 df.ref
所示) b $ b同样,对于2002年,对于loc.id = 1,降雨的总和从250到269、250到284、250到298。
Using df.ref
, for loc.id = 1 and year == 2001, sum the rainfall in df
from 250 to 264, 250 to 280, 250 to 296 (as shown in df.ref
)
Similarly, for year 2002, for loc.id = 1, sum the rainfall from 250 to 269, 250 to 284, 250 to 298.
:
library(dplyr)
ptm <- proc.time()
dat <- df.ref %>% left_join(df)
index1.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index2) %>% summarise(sum.rain1 = sum(rain))
index2.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index3) %>% summarise(sum.rain2 = sum(rain))
index3.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index4) %>% summarise(sum.rain3 = sum(rain))
all.index <- index1.cal %>% left_join(index2.cal) %>% left_join(index3.cal))
proc.time() - ptm
user system elapsed
2.36 0.64 3.06
自从我实际使用 df.ref $ c $之后,我希望使代码更快c>很大。谁能告诉我如何更快地这样做。
I am looking to make my code faster since my actual df.ref
is quite large. Could anyone advise me how to make this quicker.
推荐答案
从 data.table
包比 dplyr :: left_join
(幻灯片 | 视频)
每个值在 df
中,找到 df.ref
rain 值>在索引1
和索引2
之间的天
。然后根据 loc.id
和 years $ c $计算
rain
的总和c>。
For each value in df
, find all the rain
values in df.ref
that have day
in between index 1
and index 2
. Then calculate the summation of rain
based on loc.id
and years
.
df1 <- unique(df[df.ref
, .(rain)
, on = .(loc.id, years, day >= index1, day <= index2)
, by = .EACHI][
][
, c("sum_1") := .(sum(rain)), by = .(loc.id, years)][
# remove all redundant columns
, day := NULL][
, day := NULL][
, rain := NULL])
df2 <- unique(df[df.ref
, .(rain)
, on = .(loc.id, years, day >= index1, day <= index3)
, by = .EACHI][
][
, c("sum_2") := .(sum(rain)), by = .(loc.id, years)][
, day := NULL][
, day := NULL][
, rain := NULL])
df3 <- unique(df[df.ref
, .(rain)
, on = .(loc.id, years, day >= index1, day <= index4)
, by = .EACHI][
][
, c("sum_3") := .(sum(rain)), by = .(loc.id, years)][
, day := NULL][
, day := NULL][
, rain := NULL])
将所有三个data.tables合并在一起
Merge all three data.tables together
df1[df2, on = .(loc.id, years)][
df3, on = .(loc.id, years)]
loc.id years sum_1 sum_2 sum_3
1: 1 1950 104159.11 222345.4 271587.1
2: 1 1951 118689.90 257450.2 347624.3
3: 1 1952 99262.27 212923.7 280877.6
4: 1 1953 72435.50 192072.7 251593.6
5: 1 1954 104021.19 242525.3 326463.4
6: 1 1955 93436.32 232653.1 304921.4
7: 1 1956 89122.79 190424.4 255535.0
8: 1 1957 135658.11 262918.7 346361.4
9: 1 1958 80064.18 220454.8 292966.4
10: 1 1959 114231.19 273181.0 349489.2
11: 2 1950 94360.69 238296.8 301751.8
12: 2 1951 93845.50 195273.7 289686.0
13: 2 1952 107692.53 245019.4 308093.7
14: 2 1953 86650.14 257225.1 332674.1
15: 2 1954 104085.83 238859.4 286350.7
16: 2 1955 101602.16 223107.3 300958.4
17: 2 1956 73912.77 198087.2 276590.1
18: 2 1957 117780.86 228299.8 305348.5
19: 2 1958 98625.45 220902.6 291583.7
20: 2 1959 109851.38 266745.2 324246.8
[ reached getOption("max.print") -- omitted 81 rows ]
比较处理时间和使用的内存
Compare processing time and memory used
> time_dplyr; time_datatable
user system elapsed
2.17 0.27 2.61
user system elapsed
0.45 0.00 0.69
rowname Class MB
1 dat data.frame 508
2 df3 data.table 26
3 df2 data.table 20
4 df1 data.table 9
测试约100年的数据时, dplyr
使用了超过50 GB的内存,而 data.table
仅消耗5 GB。 dplyr
也花费了大约四倍的时间来完成。
When testing for about 100 years of data, dplyr
used more than 50 GB of memory while data.table
consumed only 5 GB. dplyr
also took about 4 times longer to finish.
'data.frame': 3650000 obs. of 4 variables:
$ loc.id: int 1 1 1 1 1 1 1 1 1 1 ...
$ years : int 1860 1860 1860 1860 1860 1860 1860 1860 1860 1860 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ rain : num 10.1 27.6 14.3 30.9 32.9 ...
'data.frame': 3650000 obs. of 6 variables:
$ loc.id: int 1 1 1 1 1 1 1 1 1 1 ...
$ years : int 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 ...
$ index1: num 250 250 250 250 250 250 250 250 250 250 ...
$ index2: int 270 265 262 267 266 265 262 268 260 268 ...
$ index3: int 290 287 286 289 281 285 286 285 284 283 ...
$ index4: int 298 297 296 295 298 294 296 298 298 300 ...
> time_dplyr; time_datatable
user system elapsed
95.010 33.704 128.722
user system elapsed
26.175 3.147 29.312
rowname Class MB
1 dat data.frame 50821
2 df3 data.table 2588
3 df2 data.table 2004
4 df1 data.table 888
5 df.ref data.table 97
6 df data.table 70
如果我将年数增加到150,则 dplyr
在具有256 GB RAM的HPC群集节点上达到收支平衡
If I increased the number of years to 150, dplyr
broke even on a HPC cluster node with 256 GB RAM
Error in left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches) :
negative length vectors are not allowed
Calls: %>% ... left_join -> left_join.tbl_df -> left_join_impl -> .Call
Execution halted
这篇关于根据另一个数据框汇总一个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!