根据另一个数据框汇总一个数据框 [英] Summing a dataframe based on another dataframe

查看:87
本文介绍了根据另一个数据框汇总一个数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有10年中10个地点的每日降雨量数据

I have daily data of rainfall from 10 locations across 10 years

set.seed(123)

df <- data.frame(loc.id = rep(1:10, each = 10*365),years = rep(rep(2001:2010,each = 365),times = 10),
             day = rep(rep(1:365,times = 10),times = 10), rain = runif(min = 0 , max = 35, 10*10*365))

我有一个单独的数据框,其中有某些日子,我希望使用这些日子来汇总 df

I have a separate data frame that has certain days using which I want to sum the rainfall in df

df.ref <- data.frame(loc.id = rep(1:10, each = 10), 
                 years = rep(2001:2010,times = 10),
                 index1 = rep(250,times = 10*10),
                 index2 = sample(260:270, size = 10*10,replace = T),
                 index3 = sample(280:290, size = 10*10,replace = T),
                 index4 = sample(291:300, size= 10*10,replace = T))

df.ref

    loc.id years index1 index2 index3 index4
1:      1  2001    250    264    280    296
2:      1  2002    250    269    284    298
3:      1  2003    250    268    289    293
4:      1  2004    250    266    281    295
5:      1  2005    250    260    289    293

我想要的是 df.ref ,使用 df.ref 中的索引值和
汇总<$中的降雨量索引1到索引2,索引1到索引3和索引1到索引4之间的c $ c> df
。例如:

What I want to is for row in in df.ref, use the index values in df.ref and sum the rainfall in df between index1 to index2, index1 to index3 and index1 to index4. For example:

使用 df.ref ,对于loc.id = 1和year == 2001,求和 df 的降雨量从250到264、250到280、250到296(如 df.ref 所示) b $ b同样,对于2002年,对于loc.id = 1,降雨的总和从250到269、250到284、250到298。

Using df.ref, for loc.id = 1 and year == 2001, sum the rainfall in df from 250 to 264, 250 to 280, 250 to 296 (as shown in df.ref) Similarly, for year 2002, for loc.id = 1, sum the rainfall from 250 to 269, 250 to 284, 250 to 298.

library(dplyr)  

ptm <- proc.time()

dat <- df.ref  %>% left_join(df)

index1.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index2) %>% summarise(sum.rain1  = sum(rain))

index2.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index3) %>% summarise(sum.rain2 = sum(rain))

index3.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index4) %>% summarise(sum.rain3 = sum(rain))

all.index <- index1.cal %>% left_join(index2.cal) %>% left_join(index3.cal))

 proc.time() - ptm

user  system elapsed 
2.36    0.64    3.06 

自从我实际使用 df.ref 很大。谁能告诉我如何更快地这样做。

I am looking to make my code faster since my actual df.ref is quite large. Could anyone advise me how to make this quicker.

推荐答案

data.table 包比 dplyr :: left_join 幻灯片 | 视频

每个值在 df 中,找到 df.ref rain 值>在索引1 索引2 之间的 。然后根据 loc.id years rain 的总和c>。

For each value in df, find all the rain values in df.ref that have day in between index 1 and index 2. Then calculate the summation of rain based on loc.id and years.

df1 <- unique(df[df.ref
                , .(rain)
                , on = .(loc.id, years, day >= index1, day <= index2)
                , by = .EACHI][
                  ][
                  , c("sum_1") := .(sum(rain)), by = .(loc.id, years)][
                  # remove all redundant columns
                  , day := NULL][
                  , day := NULL][
                  , rain := NULL])

df2 <- unique(df[df.ref
                , .(rain)
                , on = .(loc.id, years, day >= index1, day <= index3)
                , by = .EACHI][
                  ][
                  , c("sum_2") := .(sum(rain)), by = .(loc.id, years)][
                  , day := NULL][
                  , day := NULL][
                  , rain := NULL])

df3 <- unique(df[df.ref
                , .(rain)
                , on = .(loc.id, years, day >= index1, day <= index4)
                , by = .EACHI][
                  ][
                  , c("sum_3") := .(sum(rain)), by = .(loc.id, years)][
                  , day := NULL][
                  , day := NULL][
                  , rain := NULL])

将所有三个data.tables合并在一起

Merge all three data.tables together

df1[df2, on = .(loc.id, years)][
  df3, on = .(loc.id, years)]

     loc.id years     sum_1    sum_2    sum_3
  1:      1  1950 104159.11 222345.4 271587.1
  2:      1  1951 118689.90 257450.2 347624.3
  3:      1  1952  99262.27 212923.7 280877.6
  4:      1  1953  72435.50 192072.7 251593.6
  5:      1  1954 104021.19 242525.3 326463.4
  6:      1  1955  93436.32 232653.1 304921.4
  7:      1  1956  89122.79 190424.4 255535.0
  8:      1  1957 135658.11 262918.7 346361.4
  9:      1  1958  80064.18 220454.8 292966.4
 10:      1  1959 114231.19 273181.0 349489.2
 11:      2  1950  94360.69 238296.8 301751.8
 12:      2  1951  93845.50 195273.7 289686.0
 13:      2  1952 107692.53 245019.4 308093.7
 14:      2  1953  86650.14 257225.1 332674.1
 15:      2  1954 104085.83 238859.4 286350.7
 16:      2  1955 101602.16 223107.3 300958.4
 17:      2  1956  73912.77 198087.2 276590.1
 18:      2  1957 117780.86 228299.8 305348.5
 19:      2  1958  98625.45 220902.6 291583.7
 20:      2  1959 109851.38 266745.2 324246.8
 [ reached getOption("max.print") -- omitted 81 rows ]

比较处理时间和使用的内存

Compare processing time and memory used

> time_dplyr; time_datatable
   user  system elapsed 
   2.17    0.27    2.61 
   user  system elapsed 
   0.45    0.00    0.69 

  rowname      Class  MB
1     dat data.frame 508
2     df3 data.table  26
3     df2 data.table  20
4     df1 data.table   9

测试约100年的数据时, dplyr 使用了超过50 GB的内存,而 data.table 仅消耗5 GB。 dplyr 也花费了大约四倍的时间来完成。

When testing for about 100 years of data, dplyr used more than 50 GB of memory while data.table consumed only 5 GB. dplyr also took about 4 times longer to finish.

'data.frame':   3650000 obs. of  4 variables:
 $ loc.id: int  1 1 1 1 1 1 1 1 1 1 ...
 $ years : int  1860 1860 1860 1860 1860 1860 1860 1860 1860 1860 ...
 $ day   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ rain  : num  10.1 27.6 14.3 30.9 32.9 ...
'data.frame':   3650000 obs. of  6 variables:
 $ loc.id: int  1 1 1 1 1 1 1 1 1 1 ...
 $ years : int  1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 ...
 $ index1: num  250 250 250 250 250 250 250 250 250 250 ...
 $ index2: int  270 265 262 267 266 265 262 268 260 268 ...
 $ index3: int  290 287 286 289 281 285 286 285 284 283 ...
 $ index4: int  298 297 296 295 298 294 296 298 298 300 ...

> time_dplyr; time_datatable
   user  system elapsed
 95.010  33.704 128.722
   user  system elapsed
 26.175   3.147  29.312

  rowname      Class    MB
1     dat data.frame 50821
2     df3 data.table  2588
3     df2 data.table  2004
4     df1 data.table   888
5  df.ref data.table    97
6      df data.table    70

如果我将年数增加到150,则 dplyr 在具有256 GB RAM的HPC群集节点上达到收支平衡

If I increased the number of years to 150, dplyr broke even on a HPC cluster node with 256 GB RAM

Error in left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches) :
  negative length vectors are not allowed
Calls: %>% ... left_join -> left_join.tbl_df -> left_join_impl -> .Call
Execution halted

这篇关于根据另一个数据框汇总一个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆