R data.table - 按列分组包括列表 [英] R data.table - group by column includes list

查看:19
本文介绍了R data.table - 按列分组包括列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用 R 中 data.table 包的分组功能.

I try to use the group by function of the data.table package in R.

start <- as.Date('2014-1-1')
end <- as.Date('2014-1-6')
time.span <- seq(start, end, "days")
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=c('a','a','b','b','a','b'))

        date  value group
1   2014-01-01  1   a
2   2014-01-02  2   a
3   2014-01-03  3   b
4   2014-01-04  4   b
5   2014-01-05  5   a
6   2014-01-06  6   b

a[,mean(value),by=group]
> group      V1
 1:   a    2.6667
 2:   b    4.3333

这很好用.

由于我正在使用 Dates,因此可能会发生特殊日期不仅有一个组,而且有两个组.

Since i am working with Dates it can happen that a special date not only has one but two groups.

a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b'))

        date   value  group
1   2014-01-01  1   a
2   2014-01-02  2   c("a", "b")
3   2014-01-03  3   b
4   2014-01-04  4   b
5   2014-01-05  5   a
6   2014-01-06  6   b

a[,mean(value),by=group]
> Error in `[.data.table`(a, , mean(value), by = group) : 
  The items in the 'by' or 'keyby' list are length (1,2,1,1,1,1). Each must be same length as rows in x or number of rows returned by i (6).

我希望将两组的分组日期用于计算 a 组和 b 组的平均值.

I would like that the group date with both groups will be used for calculating the mean of group a as well as of group b.

预期结果:

mean a: 2.6667
mean b: 3.75

data.table 包可以实现吗?

Is that possible with the data.table package?

更新

感谢 akrun,我最初的问题已解决.在拆分"data.table 并在我的情况下计算不同的因素(基于组)之后,我需要 data.table 以其原始"形式返回,并根据日期具有唯一的行.到目前为止我的解决方案:

Thx to akrun my initial issue is solved. After "splitting" the data.table and in my case calculate different factors (based on the groups) i need the data.table back in its "original" form with unique rows based on the date. My solution so far:

a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b'))
b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)]

       date   value  group
1   2014-01-01  1   a
2   2014-01-02  2   a
3   2014-01-02  2   b
4   2014-01-03  3   b
5   2014-01-04  4   b
6   2014-01-05  5   a
7   2014-01-06  6   b

# creates new column with mean based on group
b[,factor := mean(value), by=group] 

#creates new data.table c without duplicate rows (based on date) + if a row has group a & b it creates the product of their factors
c <- b[,.(value = unique(value), group = list(group), factor = prod(factor)),by=date]

date     value  group       factor
01/01/14    1   a           2.666666667
02/01/14    2   c("a", "b") 10
03/01/14    3   b           3.75
04/01/14    4   b           3.75
05/01/14    5   a           2.666666667
06/01/14    6   b           3.75

我想这不是完美的方法,但它确实有效.有什么建议可以让我做得更好吗?

I guess it is not the perfect way to do it, but it works. Any suggestions how i could do it better?

替代解决方案(真的很慢!!!):

Alternative solution (really slow!!!):

d <- a[rep(1:nrow(a), lengths(group))][,group:=unlist(a$group)][, mean(value), by = group]
for(i in 1:NROW(a)){
   y1 <- 1
   for(j in a[i,group][[1]]){
       y1 <- y1 * d[group==j, V1]
   }
   a[i, factor := y1]
}

迄今为止我最快的解决方案:

# split rows that more than one group
b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)]
# calculate mean of different groups
b <- b[,factor := mean(value), by=group]
# only keep date + factor columns
b <- b[,.(date, factor)]
# summarise rows by date 
b <- b[,lapply(.SD,prod), by=date]
# add summarised factor column to initial data.table
c <- merge(a,b,by='date')

有没有机会让它更快?

推荐答案

一种选择是按行顺序分组,我们 unlist list 列('group'), paste list 元素在一起 (toString(..)), 使用 中的 cSplitsplitstackshapedirection='long' 将其重塑为'long'格式,然后使用'grp'作为'value'列的mean分组变量.

One option would be to group by the row sequence, we unlist the list column ('group'), paste the list elements together (toString(..)), use cSplit from splitstackshape with direction='long' to reshape it into 'long' format, and then get the mean of the 'value' column using 'grp' as the grouping variable.

library(data.table)
library(splitstackshape)
a[, grp:= toString(unlist(group)), 1:nrow(a)]
cSplit(a, 'grp', ', ', 'long')[, mean(value), grp]
#  grp       V1
#1:   a 2.666667
#2:   b 3.750000

刚刚意识到使用 splitstackshape 的另一个选项将是 listCol_l 其中 unlist 是一个 list 列长格式.由于输出是 data.table,我们可以使用 data.table 方法计算 mean.得到 mean 的方式要紧凑得多.

Just realized that another option using splitstackshape would be listCol_l which unlists a list column into long form. As the output is a data.table, we can use the data.table methods to calculate the mean. It is much more compact to get the mean.

 listCol_l(a, 'group')[, mean(value), group_ul]
 #  group_ul       V1
 #1:        a 2.666667
 #2:        b 3.750000

<小时>

或者不使用 splitstackshape 的另一个选项是通过 list 元素的 length 复制数据集的行.lengthssapply(group, length) 的便捷包装器,而且速度更快.然后,我们通过 unlist 将a"数据集中的原始group"更改为group"列,并获取value"的 mean,按group"分组.


Or another option without using splitstackshape would be to replicate the rows of the dataset by the length of the list element. The lengths is a convenient wrapper for sapply(group, length) and is much faster. Then, we change the 'group' column by unlisting the original 'group' from 'a' dataset and get the mean of 'value', grouped by 'group'.

 a[rep(1:nrow(a), lengths(group))][,
        group:=unlist(a$group)][, mean(value), by = group]
 #  group       V1
 #1:     a 2.666667
 #2:     b 3.750000

这篇关于R data.table - 按列分组包括列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆