按组汇总数据 [英] Summarize data.table by group

查看:109
本文介绍了按组汇总数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理R中的一个巨大数据表,其中包含由不同来源获取的每月多个位置的温度测量值.

I am working with a huge data table in R containing monthly measurements of temperature for multiple locations, taken by different sources.

数据集如下:

library(data.table)

# Generate random data:
loc <- 1:10
dates <- seq(as.Date("2000-01-01"), as.Date("2004-12-31"), by="month")
mods <- c("A","B", "C", "D", "E")
temp <- runif(length(loc)*length(dates)*length(mods), min=0, max=30)
df <- data.table(expand.grid(Location=loc,Date=dates,Model=mods),Temperature=temp)

因此,基本上,对于位置1,我有2000年1月至2004年12月之间由模型A进行的测量.然后,我具有了模型B进行的测量.对于模型C,D和E依此类推.从位置2到位置10.

So basically, for location 1, I have measurements from january 2000 to december 2004 taken by model A. Then, I have measurements made by model B. And so on for models C, D and E. And then, so on for location 2 to location 10.

我需要做的是获取所有模型的平均温度,而不是进行五个不同的温度测量(来自模型).

What I need to do is, instead of having five different temperature measurements (from the models), to take the mean temperature for all the models.

结果,对于每个位置和每个日期,我将不会有五个,而只有一个温度测量值(这将是一个多模型平均值).

As a result, I would have, for each location and each date, not five but ONLY ONE temperature measurement (that would be a multi-model mean).

我尝试过:

df2 <- df[, Mean:=mean(Temperature), by=list(Model, Location, Date)]

它没有按我预期的那样工作.我至少希望生成的数据表是原始表的行数的1/5,因为我将五个度量汇总为一个度量.

which didn't work as I expected. I would at least expect the resulting data table to be 1/5th the number of rows of the original table, since I am summarizing five measurements into a single one.

我在做什么错了?

推荐答案

我认为您没有正确生成测试数据.函数 expand.grid() 接受所有参数的笛卡尔积.我不确定为什么您在expand.grid()调用中包括了Temperature=temp自变量;复制每个键组合的每个温度值,从而得到具有900万行的data.table(这是(10*60*5)^2).我认为您希望每个键一个温度值,这将导致10*60*5行:

I don't think you generated your test data correctly. The function expand.grid() takes a cartesian product of all arguments. I'm not sure why you included the Temperature=temp argument in the expand.grid() call; that duplicates each temperature value for every single key combination, resulting in a data.table with 9 million rows (this is (10*60*5)^2). I think you intended one temperature value per key, which should result in 10*60*5 rows:

df <- data.table(expand.grid(Location=loc,Date=dates,Model=mods),Temperature=temp);
df;
##       Location       Date Model Temperature
##    1:        1 2000-01-01     A    2.469751
##    2:        2 2000-01-01     A   16.103135
##    3:        3 2000-01-01     A    7.147051
##    4:        4 2000-01-01     A   10.301937
##    5:        5 2000-01-01     A   16.760238
##   ---
## 2996:        6 2004-12-01     E   26.293968
## 2997:        7 2004-12-01     E    8.446528
## 2998:        8 2004-12-01     E   29.003001
## 2999:        9 2004-12-01     E   12.076765
## 3000:       10 2004-12-01     E   28.410980

如果正确,您可以使用以下方法在各个模型之间生成均值:

If this is correct, you can generate the means across models with this:

df[,.(Mean=mean(Temperature)),.(Location,Date)];
##      Location       Date      Mean
##   1:        1 2000-01-01  9.498497
##   2:        2 2000-01-01 11.744622
##   3:        3 2000-01-01 15.691228
##   4:        4 2000-01-01 11.457154
##   5:        5 2000-01-01  8.897931
##  ---
## 596:        6 2004-12-01 17.587000
## 597:        7 2004-12-01 19.555963
## 598:        8 2004-12-01 15.710465
## 599:        9 2004-12-01 15.322790
## 600:       10 2004-12-01 20.240392

请注意,:=运算符实际上并未聚合.它仅添加,修改或删除原始data.table中的列.可以使用汇总计算的重复项添加新列(或覆盖旧列)(例如,请参见

Note that the := operator does not actually aggregate. It only adds, modifies, or deletes columns in the original data.table. It is possible to add a new column (or overwrite an old column) with duplications of an aggregated calculation (e.g. see http://www.r-bloggers.com/two-of-my-favorite-data-table-features/), but that's not what you want.

通常,当汇总数据表时,必须生成一个 new 表,该表每个汇总键减少为一行. :=运算符不执行此操作.

In general, when you aggregate a table of data, you are necessarily producing a new table that is reduced to one row per aggregation key. The := operator does not do this.

相反,我们需要在data.table上运行常规索引操作,并根据所需的聚合键(将自动包含在输出data.table中)进行分组,然后添加j参数,该参数将每组评估一次.结果将是原始表的简化版本,所有j参数评估的结果都将与它们各自的聚合键合并.由于我们的j自变量得出每个组的标量值,因此我们的结果将是每个Location/Date聚合键一行.

Instead, we need to run a normal index operation on the data.table, grouping by the required aggregation key (which will automatically be included in the output data.table), and add to that the j argument which will be evaluated once for each group. The result will be a reduced version of the original table, with the results of all j argument evaluations merged with their respective aggregation keys. Since our j argument results in a scalar value for each group, our result will be one row per Location/Date aggregation key.

这篇关于按组汇总数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆