R data.table - 按列包括列表 [英] R data.table - group by column includes list

查看:120
本文介绍了R data.table - 按列包括列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用data.table包中的group by函数。

  start<  -  as。日期('2014-1-1')
end< - as.Date('2014-1-6')
time.span< - seq(start,end,days)
a < - data.table(date = time.span,value = c(1,2,3,4,5,6),group = c('a','a','b' 'b','a','b'))

日期值组
1 2014-01-01 1 a
2 2014-01-02 2 a
3 2014-01-03 3 b
4 2014-01-04 4 b
5 2014-01-05 5 a
6 2014-01-06 6 b

a [,mean(value),by = group]
>组V1
1:a 2.6667
2:b 4.3333



由于我与日期一起工作,特殊的日期不仅可以有一个但是两个组。

  a<  -  data.table(date = time.span,value = c(1,2,3,4,5,6),group = list('a' c('a','b'),'b','b','a','b'))

日期值组
1 2014-01-01 1 a
2 2014-01-02 2 c(a,b)
3 2014-01-03 3 b
4 2014-01-04 4 b
5 2014-01-05 5 a
6 2014-01-06 6 b

a [,mean(value),by = group]
>错误在`[.data.table`(a,,mean(value),by = group):
'by'或'keyby'列表中的项目是length(1,2,1,1, 1,1)。每个都必须与x中的行或i(6)返回的行数相同的长度。

我希望与两个组的组日期将用于计算组a的平均值以及b组。



预期结果:

  :2.6667 
平均值b:3.75

data.table包是否可能?



更新



Thx to akrun我的初始问题已解决。在拆分data.table和在我的情况下计算不同的因素(基于组)我需要data.table回到其原始的形式与基于日期的唯一行。我的解决方案到目前为止:

  a<  -  data.table(date = time.span,value = c ,3,4,5,6),group = list('a',c('a','b'),'b','b','a','b'))
b < - a [rep(1:nrow(a),lengths(group))] [,group:= unlist(a $ group)]

日期值组
1 2014- 01-01 1 a
2 2014-01-02 2 a
3 2014-01-02 2 b
4 2014-01-03 3 b
5 2014-01- 04 4 b
6 2014-01-05 5 a
7 2014-01-06 6 b

#创建一个基于组
b [因子:=平均值(值),by = group]

#creates新数据表c没有重复行(基于日期)+如果一行有组a& b它创建它们的因子的乘积
c < - b [,。(value = unique(value),group = list(group),factor = prod(factor)),by = date]

日期值组因子
01/01/14 1 a 2.666666667
02/01/14 2 c(a,b)10
03/01/14 3 b 3.75
04/01/14 4 b 3.75
05/01/14 5 a 2.666666667
06/01/14 6 b 3.75

我想这不是完美的方法,但它的工作原理。任何建议我如何做得更好?



替代解决方案(真慢!):

  d < -  a [rep(1:nrow(a),lengths(group))] [,group:= unlist(a $ group) group] $ [
] for(i in 1:NROW(a)){
y1 <-1
for(j in a [i,group] [[1]]){
y1 <-y1 * d [group == j,V1]
}
a [i,factor:= y1]
}



$ b

我最快的解决方案是

 #拆分多个组的行
b < - a [rep(1:nrow(a),lengths(group))] [,group:= unlist(a $ group)]
#计算不同组的平均值
b < - b [,factor:= mean(value),by = group]
#仅保留日期和因子列
b < [,。(date,factor)]
#按日期汇总行
b < - b [,lapply(.SD,prod),by = date]
#添加汇总因子列initial data.table
c < - merge(a,b,by ='date')



< <>

一个选项是按行顺序分组,我们<$ c $

c> unlist 列表列('group'),粘贴 $ c> list 元素( toString(..)),使用 cSplit 使用 direction ='long'将其重新形成为'long'格式,然后获取

  library(data.table)
库(splitstackshape)
a [,grp:= toString (group)),1:nrow(a)]
cSplit(a,'grp',',','long')[,mean(value),grp]
#grp V1
#1:a 2.666667
#2:b 3.750000

使用 splitstackshape 将是 listCol_l 其中 unlist sa list 列转换为长格式。由于输出是 data.table ,我们可以使用 data.table 方法计算表示意味着更紧凑。

  listCol_l ,'group')[,mean(value),group_ul] 
#group_ul V1
#1:a 2.666667
#2:b 3.750000






或者不使用 splitstackshape 以通过 list 元素的 length 复制数据集的行。 lengths sapply(group,length)的一个方便的包装,速度更快。然后,我们通过 unlist 从a数据集更改group列,并获得平均值

  a [rep(1:nrow group))] [,
group:= unlist(a $ group)] [,mean(value),by = group]
#group V1
#1:a 2.666667
#2:b 3.750000


I try to use the group by function of the data.table package in R.

start <- as.Date('2014-1-1')
end <- as.Date('2014-1-6')
time.span <- seq(start, end, "days")
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=c('a','a','b','b','a','b'))

        date  value group
1   2014-01-01  1   a
2   2014-01-02  2   a
3   2014-01-03  3   b
4   2014-01-04  4   b
5   2014-01-05  5   a
6   2014-01-06  6   b

a[,mean(value),by=group]
> group      V1
 1:   a    2.6667
 2:   b    4.3333

This works fine.

Since i am working with Dates it can happen that a special date not only has one but two groups.

a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b'))

        date   value  group
1   2014-01-01  1   a
2   2014-01-02  2   c("a", "b")
3   2014-01-03  3   b
4   2014-01-04  4   b
5   2014-01-05  5   a
6   2014-01-06  6   b

a[,mean(value),by=group]
> Error in `[.data.table`(a, , mean(value), by = group) : 
  The items in the 'by' or 'keyby' list are length (1,2,1,1,1,1). Each must be same length as rows in x or number of rows returned by i (6).

I would like that the group date with both groups will be used for calculating the mean of group a as well as of group b.

Expected results:

mean a: 2.6667
mean b: 3.75

Is that possible with the data.table package?

Update

Thx to akrun my initial issue is solved. After "splitting" the data.table and in my case calculate different factors (based on the groups) i need the data.table back in its "original" form with unique rows based on the date. My solution so far:

a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b'))
b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)]

       date   value  group
1   2014-01-01  1   a
2   2014-01-02  2   a
3   2014-01-02  2   b
4   2014-01-03  3   b
5   2014-01-04  4   b
6   2014-01-05  5   a
7   2014-01-06  6   b

# creates new column with mean based on group
b[,factor := mean(value), by=group] 

#creates new data.table c without duplicate rows (based on date) + if a row has group a & b it creates the product of their factors
c <- b[,.(value = unique(value), group = list(group), factor = prod(factor)),by=date]

date     value  group       factor
01/01/14    1   a           2.666666667
02/01/14    2   c("a", "b") 10
03/01/14    3   b           3.75
04/01/14    4   b           3.75
05/01/14    5   a           2.666666667
06/01/14    6   b           3.75

I guess it is not the perfect way to do it, but it works. Any suggestions how i could do it better?

Alternative solution (really slow!!!):

d <- a[rep(1:nrow(a), lengths(group))][,group:=unlist(a$group)][, mean(value), by = group]
for(i in 1:NROW(a)){
   y1 <- 1
   for(j in a[i,group][[1]]){
       y1 <- y1 * d[group==j, V1]
   }
   a[i, factor := y1]
}

My fastest solution so far:

# split rows that more than one group
b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)]
# calculate mean of different groups
b <- b[,factor := mean(value), by=group]
# only keep date + factor columns
b <- b[,.(date, factor)]
# summarise rows by date 
b <- b[,lapply(.SD,prod), by=date]
# add summarised factor column to initial data.table
c <- merge(a,b,by='date')

Any chance to make it faster?

解决方案

One option would be to group by the row sequence, we unlist the list column ('group'), paste the list elements together (toString(..)), use cSplit from splitstackshape with direction='long' to reshape it into 'long' format, and then get the mean of the 'value' column using 'grp' as the grouping variable.

library(data.table)
library(splitstackshape)
a[, grp:= toString(unlist(group)), 1:nrow(a)]
cSplit(a, 'grp', ', ', 'long')[, mean(value), grp]
#  grp       V1
#1:   a 2.666667
#2:   b 3.750000

Just realized that another option using splitstackshape would be listCol_l which unlists a list column into long form. As the output is a data.table, we can use the data.table methods to calculate the mean. It is much more compact to get the mean.

 listCol_l(a, 'group')[, mean(value), group_ul]
 #  group_ul       V1
 #1:        a 2.666667
 #2:        b 3.750000


Or another option without using splitstackshape would be to replicate the rows of the dataset by the length of the list element. The lengths is a convenient wrapper for sapply(group, length) and is much faster. Then, we change the 'group' column by unlisting the original 'group' from 'a' dataset and get the mean of 'value', grouped by 'group'.

 a[rep(1:nrow(a), lengths(group))][,
        group:=unlist(a$group)][, mean(value), by = group]
 #  group       V1
 #1:     a 2.666667
 #2:     b 3.750000

这篇关于R data.table - 按列包括列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆