Groupby并将df中的两列转换为矩阵R [英] Groupby and transform two columns in df into matrix R

查看:267
本文介绍了Groupby并将df中的两列转换为矩阵R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将下面的data.frame转换为矩阵,其中每小时发生的每个自行车站ID的数量都被计算在内。

 

> dim(test)
[1] 80623 5

head测试,n = 10)
bikeid end.station.id start.station.id diff.time小时
1 16052 244 322 6544 14
2 16052 284 432 3406 21
3 16052 461 519 33416 3
4 16052 228 519 26876 13
5 16052 72 435 388 17
6 16052 319 127 27702 11
7 16052 282 2002 33882 4
8 16052 524 2021 2525 10
9 16052 387 351 2397 12
10 16052 388 526 32507 13


输出应该如下所示。

 
> sample2
start.station.id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1 72 44 1 42 22 9 33 39 47 12 30 39 52 43 45 40 62 9 35 24 43 65 59 58 34
2 79 21 11 2 42 5 18 57 64 32 47 61 43 65 38 46 61 48 29 58 22 35 4 50 31
3 82 19 44 7 52 14 19 3 30 25 60 33 60 48 54 25 24 42 62 13 51 23 43 54 7
4 83 45 60 64 5 0 3 54 16 48 67 49 20 59 21 24 38 42 62 38 24 1 35 16 4
5 116 27 62 64 44 55 65 23 13 36 0 62 54 61 6 16 7 58 41 29 1 34 58 35 67
6 119 45 30 41 26 7 39 16 55 28 53 42 9 5 31 18 16 14 37 17 14 16 17 23 50
7 120 3 2 7 53 21 33 31 48 19 50 35 47 8 17 30 9 49 4 48 28 52 9 57 55
8 127 33 44 47 42 6 46 39 30 39 28 19 57 53 41 45 55 9 27 42 19 43 24 37 55
9 137 53 11 60 1 66 37 16 5 2 58 0 46 33 0 60 54 25 66 65 40 36 47 58 40
10 143 61 1 50 62 57 33 12 15 27 19 65 48 12 55 64 14 22 13 12 57 45 13 66 56 66 56

我是忠告ed使用类似于以下公式的公式:

 矩阵<  -  test%>%
group_by(start.station .id,小时)%>%
汇总(sum = nrow)%>%
差价(小时,美元)

但不知道如何正确编写代码使用 data.table

  library(data.table)#1.9.6 + 
setDT(test)
dcast(test [,.N,by =。(start.station.id,hour)],
start.station.id〜hour,value.var = N)

或者(更慢,但更干净):

  dcast(test,start.station.id〜hour,fun.aggregate = length,value.var =hour)




$ p $ .seed(10932)
NN < - 1e6
test< - data.table(start.station.id = sample(1000,NN,T),
hour = sample(24 ,NN,T))

library(microbenchmark)

microbenchmark(times = 100L,
preagg = dcast(test [,.N,by =。(start.station.id,hour)],
start.station.id〜hour,value。 var $),
postagg = dcast(test,start.station.id〜hour,
fun.aggregate = length,value.var =hour))

单位:毫秒
expr分钟lq平均值中位数uq max neval
preagg 55.83240 59.88939 66.56289 61.37408 64.37049 166.8902 100
postagg 91.16012 93.68588 101.17297 96.04823 101.20717 203.4270 100

第一个更快的原因是操作 test [,.N,by = vars] 已经在 data.table 中进行了优化。


I would like to convert the following data.frame into a matrix where the the number of each bike station id that occurs per hour is counted.


> dim(test)
[1] 80623     5

head(test, n = 10)
   bikeid end.station.id start.station.id diff.time hour
1   16052            244              322      6544   14
2   16052            284              432      3406   21
3   16052            461              519     33416    3
4   16052            228              519     26876   13
5   16052             72              435       388   17
6   16052            319              127     27702   11
7   16052            282             2002     33882    4
8   16052            524             2021      2525   10
9   16052            387              351      2397   12
10  16052            388              526     32507   13


The output should look like this.

> sample2
   start.station.id  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1                72 44  1 42 22  9 33 39 47 12 30 39 52 43 45 40 62  9 35 24 43 65 59 58 34
2                79 21 11  2 42  5 18 57 64 32 47 61 43 65 38 46 61 48 29 58 22 35  4 50 31
3                82 19 44  7 52 14 19  3 30 25 60 33 60 48 54 25 24 42 62 13 51 23 43 54  7
4                83 45 60 64  5  0  3 54 16 48 67 49 20 59 21 24 38 42 62 38 24  1 35 16  4
5               116 27 62 64 44 55 65 23 13 36  0 62 54 61  6 16  7 58 41 29  1 34 58 35 67
6               119 45 30 41 26  7 39 16 55 28 53 42  9  5 31 18 16 14 37 17 14 16 17 23 50
7               120  3  2  7 53 21 33 31 48 19 50 35 47  8 17 30  9 49  4 48 28 52  9 57 55
8               127 33 44 47 42  6 46 39 30 39 28 19 57 53 41 45 55  9 27 42 19 43 24 37 55
9               137 53 11 60  1 66 37 16  5  2 58  0 46 33  0 60 54 25 66 65 40 36 47 58 40
10              143 61  1 50 62 57 33 12 15 27 19 65 48 12 55 64 14 22 13 12 57 45 13 66 56 66 56

I was advised to use a formula similar to :

matrix <- test %>% 
  group_by(start.station.id, hour)%>%
  summarise(sum = nrow) %>%
  spread(hour, nrow) 

but do not know how to code it properly

解决方案

Using data.table:

library(data.table) #1.9.6+
setDT(test)
dcast(test[ , .N, by = .(start.station.id, hour)],
      start.station.id ~ hour, value.var = "N")

Alternatively (slower, though cleaner):

dcast(test, start.station.id ~ hour, fun.aggregate = length, value.var = "hour")

Testing on some fake data:

set.seed(10932)
NN <- 1e6
test <- data.table(start.station.id = sample(1000, NN, T),
                   hour = sample(24, NN, T))

library(microbenchmark)

microbenchmark(times = 100L,
               preagg = dcast(test[ , .N, by = .(start.station.id, hour)],
                              start.station.id ~ hour, value.var = "N"),
               postagg = dcast(test, start.station.id ~ hour, 
                               fun.aggregate = length, value.var = "hour"))

Unit: milliseconds
    expr      min       lq      mean   median        uq      max neval
  preagg 55.83240 59.88939  66.56289 61.37408  64.37049 166.8902   100
 postagg 91.16012 93.68588 101.17297 96.04823 101.20717 203.4270   100

The reason the first is faster is that the operation test[ , .N, by = vars] has been optimized in data.table.

这篇关于Groupby并将df中的两列转换为矩阵R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆