R data.table - 按列分组包括列表 [英] R data.table - group by column includes list
问题描述
我尝试使用 R 中 data.table 包的分组功能.
I try to use the group by function of the data.table package in R.
start <- as.Date('2014-1-1')
end <- as.Date('2014-1-6')
time.span <- seq(start, end, "days")
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=c('a','a','b','b','a','b'))
date value group
1 2014-01-01 1 a
2 2014-01-02 2 a
3 2014-01-03 3 b
4 2014-01-04 4 b
5 2014-01-05 5 a
6 2014-01-06 6 b
a[,mean(value),by=group]
> group V1
1: a 2.6667
2: b 4.3333
这很好用.
由于我正在使用 Dates,因此可能会发生特殊日期不仅有一个组,而且有两个组.
Since i am working with Dates it can happen that a special date not only has one but two groups.
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b'))
date value group
1 2014-01-01 1 a
2 2014-01-02 2 c("a", "b")
3 2014-01-03 3 b
4 2014-01-04 4 b
5 2014-01-05 5 a
6 2014-01-06 6 b
a[,mean(value),by=group]
> Error in `[.data.table`(a, , mean(value), by = group) :
The items in the 'by' or 'keyby' list are length (1,2,1,1,1,1). Each must be same length as rows in x or number of rows returned by i (6).
我希望将两组的分组日期用于计算 a 组和 b 组的平均值.
I would like that the group date with both groups will be used for calculating the mean of group a as well as of group b.
预期结果:
mean a: 2.6667
mean b: 3.75
data.table 包可以实现吗?
Is that possible with the data.table package?
更新
感谢 akrun,我最初的问题已解决.在拆分"data.table 并在我的情况下计算不同的因素(基于组)之后,我需要 data.table 以其原始"形式返回,并根据日期具有唯一的行.到目前为止我的解决方案:
Thx to akrun my initial issue is solved. After "splitting" the data.table and in my case calculate different factors (based on the groups) i need the data.table back in its "original" form with unique rows based on the date. My solution so far:
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b'))
b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)]
date value group
1 2014-01-01 1 a
2 2014-01-02 2 a
3 2014-01-02 2 b
4 2014-01-03 3 b
5 2014-01-04 4 b
6 2014-01-05 5 a
7 2014-01-06 6 b
# creates new column with mean based on group
b[,factor := mean(value), by=group]
#creates new data.table c without duplicate rows (based on date) + if a row has group a & b it creates the product of their factors
c <- b[,.(value = unique(value), group = list(group), factor = prod(factor)),by=date]
date value group factor
01/01/14 1 a 2.666666667
02/01/14 2 c("a", "b") 10
03/01/14 3 b 3.75
04/01/14 4 b 3.75
05/01/14 5 a 2.666666667
06/01/14 6 b 3.75
我想这不是完美的方法,但它确实有效.有什么建议可以让我做得更好吗?
I guess it is not the perfect way to do it, but it works. Any suggestions how i could do it better?
替代解决方案(真的很慢!!!):
Alternative solution (really slow!!!):
d <- a[rep(1:nrow(a), lengths(group))][,group:=unlist(a$group)][, mean(value), by = group]
for(i in 1:NROW(a)){
y1 <- 1
for(j in a[i,group][[1]]){
y1 <- y1 * d[group==j, V1]
}
a[i, factor := y1]
}
迄今为止我最快的解决方案:
# split rows that more than one group
b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)]
# calculate mean of different groups
b <- b[,factor := mean(value), by=group]
# only keep date + factor columns
b <- b[,.(date, factor)]
# summarise rows by date
b <- b[,lapply(.SD,prod), by=date]
# add summarised factor column to initial data.table
c <- merge(a,b,by='date')
有没有机会让它更快?
推荐答案
一种选择是按行顺序分组,我们 unlist
list
列('group'), paste
list
元素在一起 (toString(..)
), 使用 中的
与 cSplit
splitstackshapedirection='long'
将其重塑为'long'格式,然后使用'grp'作为'value'列的mean
分组变量.
One option would be to group by the row sequence, we unlist
the list
column ('group'), paste
the list
elements together (toString(..)
), use cSplit
from splitstackshape
with direction='long'
to reshape it into 'long' format, and then get the mean
of the 'value' column using 'grp' as the grouping variable.
library(data.table)
library(splitstackshape)
a[, grp:= toString(unlist(group)), 1:nrow(a)]
cSplit(a, 'grp', ', ', 'long')[, mean(value), grp]
# grp V1
#1: a 2.666667
#2: b 3.750000
刚刚意识到使用 splitstackshape
的另一个选项将是 listCol_l
其中 unlist
是一个 list
列长格式.由于输出是 data.table
,我们可以使用 data.table
方法计算 mean
.得到 mean
的方式要紧凑得多.
Just realized that another option using splitstackshape
would be listCol_l
which unlist
s a list
column into long form. As the output is a data.table
, we can use the data.table
methods to calculate the mean
. It is much more compact to get the mean
.
listCol_l(a, 'group')[, mean(value), group_ul]
# group_ul V1
#1: a 2.666667
#2: b 3.750000
<小时>
或者不使用 splitstackshape
的另一个选项是通过 list
元素的 length
复制数据集的行.lengths
是 sapply(group, length)
的便捷包装器,而且速度更快.然后,我们通过 unlist
将a"数据集中的原始group"更改为group"列,并获取value"的 mean
,按group"分组.
Or another option without using splitstackshape
would be to replicate the rows of the dataset by the length
of the list
element. The lengths
is a convenient wrapper for sapply(group, length)
and is much faster. Then, we change the 'group' column by unlist
ing the original 'group' from 'a' dataset and get the mean
of 'value', grouped by 'group'.
a[rep(1:nrow(a), lengths(group))][,
group:=unlist(a$group)][, mean(value), by = group]
# group V1
#1: a 2.666667
#2: b 3.750000
这篇关于R data.table - 按列分组包括列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!