与R中的data.table聚合 [英] aggregation with data.table in R

查看:101
本文介绍了与R中的data.table聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个练习在于通过将因子与data.table中的数据组合在一起来计算数值向量。以下面的数据表为例:

The exercise consists in aggregating a numeric vector of values by a combination of factors with data.table in R. Take the following data table as example:

require (data.table)
require (plyr)
dtb <- data.table (cbind (expand.grid (month = rep (month.abb[1:3], each = 3),
                                       fac = letters[1:3]),
                          value = rnorm (27)))

请注意,'month'和'fac'的每个唯一组合显示三次。所以,当我尝试通过这些因素来平均值时,我应该期望一个数据框架,有9个唯一的行:

Notice that every unique combination of 'month' and 'fac' shows up three times. So, when I try to average values by both these factors, I should expect a data frame with 9 unique rows:

(agg1 <- ddply (dtb, c ("month", "fac"), function (dfr) mean (dfr$value)))
  month fac          V1
1   Jan   a -0.36030953
2   Jan   b -0.58444588
3   Jan   c -0.15472876
4   Feb   a -0.05674483
5   Feb   b  0.26415972
6   Feb   c -1.62346772
7   Mar   a  0.24560510
8   Mar   b  0.82548140
9   Mar   c  0.18721114

但是,在与数据聚合时。表,我继续获得由两个因素的每个冗余组合提供的结果:

However, when aggregating with data.table, I keep getting the results provided by every redundant combination of the two factors:

(agg2 <- dtb[, value := mean (value), by = list (month, fac)])
    month fac       value
 1:   Jan   a -0.36030953
 2:   Jan   a -0.36030953
 3:   Jan   a -0.36030953
 4:   Feb   a -0.05674483
 5:   Feb   a -0.05674483
 6:   Feb   a -0.05674483
 7:   Mar   a  0.24560510
 8:   Mar   a  0.24560510
 9:   Mar   a  0.24560510
10:   Jan   b -0.58444588
11:   Jan   b -0.58444588
12:   Jan   b -0.58444588
13:   Feb   b  0.26415972
14:   Feb   b  0.26415972
15:   Feb   b  0.26415972
16:   Mar   b  0.82548140
17:   Mar   b  0.82548140
18:   Mar   b  0.82548140
19:   Jan   c -0.15472876
20:   Jan   c -0.15472876
21:   Jan   c -0.15472876
22:   Feb   c -1.62346772
23:   Feb   c -1.62346772
24:   Feb   c -1.62346772
25:   Mar   c  0.18721114
26:   Mar   c  0.18721114
27:   Mar   c  0.18721114
    month fac       value

有没有一种优雅的方法将每个结果与数据表的唯一组合将这些结果折叠成一行?

Is there an elegant way to collapse these results to one row per unique combination of factors with data table?

推荐答案

问题(和推理)与汇总值被刚刚计算。

The issue (and reasoning) is related to the fact that aggregated value is being assigned not just calculated.

如果你看一个data.table有更多的列而不仅仅是用于计算的列,这是更容易观察到这一点。因此,让我们添加一个新列
$ b dtb [,newCol:= LETTERS [seq(length(value)],然后添加一个新的列

It is easier to observe this in action if you look at a data.table with more columns than just the ones being used for the computation.

# Therefore, let's add a new column
dtb[, newCol := LETTERS[seq(length(value))]

请注意,如果我们只想输出然后在 RHS 上的表达式,因为它是很好的。

Note that if we just want to output the computed value, then expression on the RHS as you have it is just fine.

# This gives the expected results
dtb[, mean (value), by = list (month, fac)]

# This on the other hand assigns the respective values to *each* row
dtb[, value := mean (value), by = list (month, fac)]

换句话说,数据被子集化以仅返回唯一值。

,如果要将此值保存回 SAME 数据表(这是使用:= 运算符时会发生的情况)
i (由defualt的所有行)中标识的所有行将被分配一个值。 (当你查看带有附加列的输出时,这是有意义的)

In other words, the data is being subsetted to only return unique values.
However, if you want to save this value back into the SAME data table (which is what happens when using := operator) then all rows that are identified in i (all rows by defualt) will be assigned a value. (which, when you look at the output with additional columns, makes sense)

然后将这个data.table复制到agg仍然发送所有的行。

Then copying this data.table to agg still sends through all the rows.

因此,如果要复制到新表格,只需只有原始表格中唯一的行,您可以

Therefore, if you want to copy to a new table, only those rows from your original table that are unique, you can

a.  wrap the original table inside `unique()` before assigning it
b.  assign the table, above, that is returned when you 
    are not assigning the RHS output (which is what @Arun suggested)

a。的示例是:

 agg2 <- unique(dtb[, value := mean (value), by = list (month, fac)])






以下示例可能有助于说明。



复制并粘贴此项,因为输出被省略)

  # SAMPLE DATA, as above
  library(data.table)
  dtb.bak <- data.table (expand.grid (month = rep (month.abb[1:3], each = 3), fac = letters[1:3]), value = rnorm (27))

  #  METHOD 1  # 
  #------------#
  dtb <- copy(dtb.bak)  # restore, from sample data.


  dtb[, value := mean (value), by = list (month, fac)]
  dtb

  # this is what you would like to assign
  unique(dtb)


  #  METHOD 2  # 
  #------------#
  dtb <- copy(dtb.bak)  # restore, from sample data.

  # this is what you would like to assign
  # next two lines are the same, only differnce is column name
  dtb[, mean (value), by = list (month, fac)]
  dtb[, list("mean" = mean (value)), by = list (month, fac)]  # quote marks added for clarity

  # dtb is unchanged. 
  dtb



  # NOW COMPARE THE SAME TWO METHODS, BUT IF THERE IS AN ADDITIOANL COLUMN
  dtb.bak[, newCol := rep(c("A", "B", "A"), length(value)/3)]


  dtb1 <- copy(dtb.bak)  # restore, from sample data.
  dtb2 <- copy(dtb.bak)  # restore, from sample data.


  # Method 1
  dtb1[, value := mean (value), by = list (month, fac)]
  dtb1
  unique(dtb1)

  #  METHOD 2  # 
  dtb2[, list("mean" = mean (value)), by = list (month, fac)]  # quote marks added for clarity
  dtb2

  # METHOD 2, WITH ADDED COLUMNS IN list() in `j`
  dtb2[, list("mean" = mean (value), newCol), by = list (month, fac)]  # quote marks added for clarity
  # notice this has more columns thatn 
  unique(dtb1)

这篇关于与R中的data.table聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆