Groupby并将df中的两列转换为矩阵R [英] Groupby and transform two columns in df into matrix R

查看：267 发布时间：2018/5/30 14:28:18 r matrix group-by transform

本文介绍了Groupby并将df中的两列转换为矩阵R的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想将下面的data.frame转换为矩阵，其中每小时发生的每个自行车站ID的数量都被计算在内。

 
 
> dim（test）
 [1] 80623 5 
 
 head测试，n = 10）
 bikeid end.station.id start.station.id diff.time小时
 1 16052 244 322 6544 14 
 2 16052 284 432 3406 21 
 3 16052 461 519 33416 3 
 4 16052 228 519 26876 13 
 5 16052 72 435 388 17 
 6 16052 319 127 27702 11 
 7 16052 282 2002 33882 4 
 8 16052 524 2021 2525 10 
 9 16052 387 351 2397 12 
 10 16052 388 526 32507 13

输出应该如下所示。

 
> sample2 
 start.station.id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
 1 72 44 1 42 22 9 33 39 47 12 30 39 52 43 45 40 62 9 35 24 43 65 59 58 34 
 2 79 21 11 2 42 5 18 57 64 32 47 61 43 65 38 46 61 48 29 58 22 35 4 50 31 
 3 82 19 44 7 52 14 19 3 30 25 60 33 60 48 54 25 24 42 62 13 51 23 43 54 7 
 4 83 45 60 64 5 0 3 54 16 48 67 49 20 59 21 24 38 42 62 38 24 1 35 16 4 
 5 116 27 62 64 44 55 65 23 13 36 0 62 54 61 6 16 7 58 41 29 1 34 58 35 67 
 6 119 45 30 41 26 7 39 16 55 28 53 42 9 5 31 18 16 14 37 17 14 16 17 23 50 
 7 120 3 2 7 53 21 33 31 48 19 50 35 47 8 17 30 9 49 4 48 28 52 9 57 55 
 8 127 33 44 47 42 6 46 39 30 39 28 19 57 53 41 45 55 9 27 42 19 43 24 37 55 
 9 137 53 11 60 1 66 37 16 5 2 58 0 46 33 0 60 54 25 66 65 40 36 47 58 40 
 10 143 61 1 50 62 57 33 12 15 27 19 65 48 12 55 64 14 22 13 12 57 45 13 66 56 66 56

我是忠告ed使用类似于以下公式的公式：

 矩阵<  -  test％>％
 group_by（start.station .id，小时）％>％
汇总（sum = nrow）％>％
差价（小时，美元）

但不知道如何正确编写代码使用 data.table ：

  library（data.table）＃1.9.6 + 
 setDT（test）
 dcast（test [，.N，by =。（start.station.id，hour）]，
 start.station.id〜hour，value.var = N）

或者（更慢，但更干净）：

  dcast（test，start.station.id〜hour，fun.aggregate = length，value.var =hour）
  
  
 
 
 $ p $  .seed（10932）
 NN < -  1e6 
 test<  -  data.table（start.station.id = sample（1000，NN，T），
 hour = sample（24 ，NN，T））
 
 library（microbenchmark）
 
 microbenchmark（times = 100L，
 preagg = dcast（test [，.N，by =。（start.station.id，hour）]，
 start.station.id〜hour，value。 var $），
 postagg = dcast（test，start.station.id〜hour，
 fun.aggregate = length，value.var =hour））
 
单位：毫秒
 expr分钟lq平均值中位数uq max neval 
 preagg 55.83240 59.88939 66.56289 61.37408 64.37049 166.8902 100 
 postagg 91.16012 93.68588 101.17297 96.04823 101.20717 203.4270 100

第一个更快的原因是操作 test [，.N，by = vars] 已经在 data.table 中进行了优化。

I would like to convert the following data.frame into a matrix where the the number of each bike station id that occurs per hour is counted.
> dim(test) [1] 80623 5 head(test, n = 10) bikeid end.station.id start.station.id diff.time hour 1 16052 244 322 6544 14 2 16052 284 432 3406 21 3 16052 461 519 33416 3 4 16052 228 519 26876 13 5 16052 72 435 388 17 6 16052 319 127 27702 11 7 16052 282 2002 33882 4 8 16052 524 2021 2525 10 9 16052 387 351 2397 12 10 16052 388 526 32507 13
The output should look like this.
> sample2 start.station.id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 72 44 1 42 22 9 33 39 47 12 30 39 52 43 45 40 62 9 35 24 43 65 59 58 34 2 79 21 11 2 42 5 18 57 64 32 47 61 43 65 38 46 61 48 29 58 22 35 4 50 31 3 82 19 44 7 52 14 19 3 30 25 60 33 60 48 54 25 24 42 62 13 51 23 43 54 7 4 83 45 60 64 5 0 3 54 16 48 67 49 20 59 21 24 38 42 62 38 24 1 35 16 4 5 116 27 62 64 44 55 65 23 13 36 0 62 54 61 6 16 7 58 41 29 1 34 58 35 67 6 119 45 30 41 26 7 39 16 55 28 53 42 9 5 31 18 16 14 37 17 14 16 17 23 50 7 120 3 2 7 53 21 33 31 48 19 50 35 47 8 17 30 9 49 4 48 28 52 9 57 55 8 127 33 44 47 42 6 46 39 30 39 28 19 57 53 41 45 55 9 27 42 19 43 24 37 55 9 137 53 11 60 1 66 37 16 5 2 58 0 46 33 0 60 54 25 66 65 40 36 47 58 40 10 143 61 1 50 62 57 33 12 15 27 19 65 48 12 55 64 14 22 13 12 57 45 13 66 56 66 56
I was advised to use a formula similar to :
matrix <- test %>% group_by(start.station.id, hour)%>% summarise(sum = nrow) %>% spread(hour, nrow)
but do not know how to code it properly
解决方案
Using data.table:
library(data.table) #1.9.6+ setDT(test) dcast(test[ , .N, by = .(start.station.id, hour)], start.station.id ~ hour, value.var = "N")
Alternatively (slower, though cleaner):
dcast(test, start.station.id ~ hour, fun.aggregate = length, value.var = "hour")
Testing on some fake data:
set.seed(10932) NN <- 1e6 test <- data.table(start.station.id = sample(1000, NN, T), hour = sample(24, NN, T)) library(microbenchmark) microbenchmark(times = 100L, preagg = dcast(test[ , .N, by = .(start.station.id, hour)], start.station.id ~ hour, value.var = "N"), postagg = dcast(test, start.station.id ~ hour, fun.aggregate = length, value.var = "hour")) Unit: milliseconds expr min lq mean median uq max neval preagg 55.83240 59.88939 66.56289 61.37408 64.37049 166.8902 100 postagg 91.16012 93.68588 101.17297 96.04823 101.20717 203.4270 100
The reason the first is faster is that the operation test[ , .N, by = vars] has been optimized in data.table.

这篇关于Groupby并将df中的两列转换为矩阵R的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Groupby并将df中的两列转换为矩阵R [英] Groupby and transform two columns in df into matrix R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Groupby并将df中的两列转换为矩阵R [英] Groupby and transform two columns in df into matrix R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭