R data.table - 按列包括列表 [英] R data.table - group by column includes list
问题描述
我尝试使用data.table包中的group by函数。
start< - as。日期('2014-1-1')
end< - as.Date('2014-1-6')
time.span< - seq(start,end,days)
a < - data.table(date = time.span,value = c(1,2,3,4,5,6),group = c('a','a','b' 'b','a','b'))
日期值组
1 2014-01-01 1 a
2 2014-01-02 2 a
3 2014-01-03 3 b
4 2014-01-04 4 b
5 2014-01-05 5 a
6 2014-01-06 6 b
a [,mean(value),by = group]
>组V1
1:a 2.6667
2:b 4.3333
。
由于我与日期一起工作,特殊的日期不仅可以有一个但是两个组。
a< - data.table(date = time.span,value = c(1,2,3,4,5,6),group = list('a' c('a','b'),'b','b','a','b'))
日期值组
1 2014-01-01 1 a
2 2014-01-02 2 c(a,b)
3 2014-01-03 3 b
4 2014-01-04 4 b
5 2014-01-05 5 a
6 2014-01-06 6 b
a [,mean(value),by = group]
>错误在`[.data.table`(a,,mean(value),by = group):
'by'或'keyby'列表中的项目是length(1,2,1,1, 1,1)。每个都必须与x中的行或i(6)返回的行数相同的长度。
我希望与两个组的组日期将用于计算组a的平均值以及b组。
预期结果:
:2.6667
平均值b:3.75
data.table包是否可能?
更新
Thx to akrun我的初始问题已解决。在拆分data.table和在我的情况下计算不同的因素(基于组)我需要data.table回到其原始的形式与基于日期的唯一行。我的解决方案到目前为止:
a< - data.table(date = time.span,value = c ,3,4,5,6),group = list('a',c('a','b'),'b','b','a','b'))
b < - a [rep(1:nrow(a),lengths(group))] [,group:= unlist(a $ group)]
日期值组
1 2014- 01-01 1 a
2 2014-01-02 2 a
3 2014-01-02 2 b
4 2014-01-03 3 b
5 2014-01- 04 4 b
6 2014-01-05 5 a
7 2014-01-06 6 b
#创建一个基于组
b [因子:=平均值(值),by = group]
#creates新数据表c没有重复行(基于日期)+如果一行有组a& b它创建它们的因子的乘积
c < - b [,。(value = unique(value),group = list(group),factor = prod(factor)),by = date]
日期值组因子
01/01/14 1 a 2.666666667
02/01/14 2 c(a,b)10
03/01/14 3 b 3.75
04/01/14 4 b 3.75
05/01/14 5 a 2.666666667
06/01/14 6 b 3.75
我想这不是完美的方法,但它的工作原理。任何建议我如何做得更好?
替代解决方案(真慢!):
d < - a [rep(1:nrow(a),lengths(group))] [,group:= unlist(a $ group) group] $ [
] for(i in 1:NROW(a)){
y1 <-1
for(j in a [i,group] [[1]]){
y1 <-y1 * d [group == j,V1]
}
a [i,factor:= y1]
}
$ b我最快的解决方案是:
#拆分多个组的行
b < - a [rep(1:nrow(a),lengths(group))] [,group:= unlist(a $ group)]
#计算不同组的平均值
b < - b [,factor:= mean(value),by = group]
#仅保留日期和因子列
b < [,。(date,factor)]
#按日期汇总行
b < - b [,lapply(.SD,prod),by = date]
#添加汇总因子列initial data.table
c < - merge(a,b,by ='date')
< <>一个选项是按行顺序分组,我们<$ c $
c> unlist列表
列('group'),粘贴
$ c> list 元素(toString(..)
),使用cSplit
使用
direction ='long'
将其重新形成为'long'格式,然后获取。
library(data.table)
库(splitstackshape)
a [,grp:= toString (group)),1:nrow(a)]
cSplit(a,'grp',',','long')[,mean(value),grp]
#grp V1
#1:a 2.666667
#2:b 3.750000
使用
splitstackshape
将是listCol_l
其中unlist
salist
列转换为长格式。由于输出是data.table
,我们可以使用data.table
方法计算表示
。意味着
更紧凑。listCol_l ,'group')[,mean(value),group_ul]
#group_ul V1
#1:a 2.666667
#2:b 3.750000
或者不使用
splitstackshape
以通过list
元素的length
复制数据集的行。lengths
是sapply(group,length)
的一个方便的包装,速度更快。然后,我们通过unlist
从a数据集更改group列,并获得平均值$ c $
a [rep(1:nrow group))] [,
group:= unlist(a $ group)] [,mean(value),by = group]
#group V1
#1:a 2.666667
#2:b 3.750000
I try to use the group by function of the data.table package in R.
start <- as.Date('2014-1-1') end <- as.Date('2014-1-6') time.span <- seq(start, end, "days") a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=c('a','a','b','b','a','b')) date value group 1 2014-01-01 1 a 2 2014-01-02 2 a 3 2014-01-03 3 b 4 2014-01-04 4 b 5 2014-01-05 5 a 6 2014-01-06 6 b a[,mean(value),by=group] > group V1 1: a 2.6667 2: b 4.3333
This works fine.
Since i am working with Dates it can happen that a special date not only has one but two groups.
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b')) date value group 1 2014-01-01 1 a 2 2014-01-02 2 c("a", "b") 3 2014-01-03 3 b 4 2014-01-04 4 b 5 2014-01-05 5 a 6 2014-01-06 6 b a[,mean(value),by=group] > Error in `[.data.table`(a, , mean(value), by = group) : The items in the 'by' or 'keyby' list are length (1,2,1,1,1,1). Each must be same length as rows in x or number of rows returned by i (6).
I would like that the group date with both groups will be used for calculating the mean of group a as well as of group b.
Expected results:
mean a: 2.6667 mean b: 3.75
Is that possible with the data.table package?
Update
Thx to akrun my initial issue is solved. After "splitting" the data.table and in my case calculate different factors (based on the groups) i need the data.table back in its "original" form with unique rows based on the date. My solution so far:
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b')) b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)] date value group 1 2014-01-01 1 a 2 2014-01-02 2 a 3 2014-01-02 2 b 4 2014-01-03 3 b 5 2014-01-04 4 b 6 2014-01-05 5 a 7 2014-01-06 6 b # creates new column with mean based on group b[,factor := mean(value), by=group] #creates new data.table c without duplicate rows (based on date) + if a row has group a & b it creates the product of their factors c <- b[,.(value = unique(value), group = list(group), factor = prod(factor)),by=date] date value group factor 01/01/14 1 a 2.666666667 02/01/14 2 c("a", "b") 10 03/01/14 3 b 3.75 04/01/14 4 b 3.75 05/01/14 5 a 2.666666667 06/01/14 6 b 3.75
I guess it is not the perfect way to do it, but it works. Any suggestions how i could do it better?
Alternative solution (really slow!!!):
d <- a[rep(1:nrow(a), lengths(group))][,group:=unlist(a$group)][, mean(value), by = group] for(i in 1:NROW(a)){ y1 <- 1 for(j in a[i,group][[1]]){ y1 <- y1 * d[group==j, V1] } a[i, factor := y1] }
My fastest solution so far:
# split rows that more than one group b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)] # calculate mean of different groups b <- b[,factor := mean(value), by=group] # only keep date + factor columns b <- b[,.(date, factor)] # summarise rows by date b <- b[,lapply(.SD,prod), by=date] # add summarised factor column to initial data.table c <- merge(a,b,by='date')
Any chance to make it faster?
解决方案One option would be to group by the row sequence, we
unlist
thelist
column ('group'),paste
thelist
elements together (toString(..)
), usecSplit
fromsplitstackshape
withdirection='long'
to reshape it into 'long' format, and then get themean
of the 'value' column using 'grp' as the grouping variable.library(data.table) library(splitstackshape) a[, grp:= toString(unlist(group)), 1:nrow(a)] cSplit(a, 'grp', ', ', 'long')[, mean(value), grp] # grp V1 #1: a 2.666667 #2: b 3.750000
Just realized that another option using
splitstackshape
would belistCol_l
whichunlist
s alist
column into long form. As the output is adata.table
, we can use thedata.table
methods to calculate themean
. It is much more compact to get themean
.listCol_l(a, 'group')[, mean(value), group_ul] # group_ul V1 #1: a 2.666667 #2: b 3.750000
Or another option without using
splitstackshape
would be to replicate the rows of the dataset by thelength
of thelist
element. Thelengths
is a convenient wrapper forsapply(group, length)
and is much faster. Then, we change the 'group' column byunlist
ing the original 'group' from 'a' dataset and get themean
of 'value', grouped by 'group'.a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)][, mean(value), by = group] # group V1 #1: a 2.666667 #2: b 3.750000
这篇关于R data.table - 按列包括列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!