跨多个列汇总data.table [英] Summarize a data.table across multiple columns

查看:201
本文介绍了跨多个列汇总data.table的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在多个列之间总结具有不可靠数据的数据表



具体来说,给定

 字段< -  c(country,language)
dt< - data.table(user = c(rep(3,5),rep(4,5)),
behavior = c(rep(FALSE,5) ),
country = c(rep(1,4),rep(2,6)),
language = c(rep(6,6),rep(5,4) $ b event = 1:10,key = c(user,fields))
dt
#用户行为国家语言事件
#1:3 FALSE 1 6 1
#2:3 FALSE 1 6 2
#3:3 FALSE 1 6 3
#4:3 FALSE 1 6 4
5:3 FALSE 2 6 5
#6 :4 TRUE 2 5 7
#7:4 TRUE 2 5 8
#8:4 TRUE 2 5 9
#9:4 TRUE 2 5 10
#10:4 TRUE 2 6 6

我想获得

 #user behavior country.name country.support language.name language.support 
#1:3 FALSE 1 0.8 6 1.0
#2:4 TRUE 2 1.0 5 0.8

(此处为 x .name对于用户 x ,最常见的 .support 是观察到这个顶部 x 的共享事件)



c $ c> fields :

 行为)> 0,by = user]#有行为至少一次
setnames(用户,V1,行为)
dt.out < - dt [,.N,by = list国家)
] [,list(country [which.max(N)],max(N)/ sum(N))by $ user
setnames(dt.out,c ,V2),paste0(country,c(。name,.support)))
users< - users [dt.out]
dt.out< - dt [,.N,by = list(user,language)
] [,list(language [which.max(N)],max(N)/ sum(N)),by = user]
setnames(dt.out,c(V1,V2),paste0(language,c(。name,.support)))
users& dt.out]
users
#用户行为country.name country.support language.name language.support
#1:3 FALSE 1 0.8 6 1.0
#2:4 TRUE 2 1.0 5 0.8

字段的实际数量是5,我想避免为每个字段分别重复相同的代码,如果我修改 fields ,必须编辑此函数。
请注意,是此问题的实质内容,请向我们解释支持计算其他地方



引用问题中,我的数据集有10 ^ 7行,所以我真的需要一个解决方案;如果我可以避免不必要的复制,如 users< - users [dt.out]

 字段<  -  c(国家/地区,language)
dt< - data.table(user = c(rep(3,5),rep(4,5)),
behavior = c(rep(FALSE,5 ),rep(TRUE,5)),
country = c(rep(1,4),rep(2,6)),
language = c 5,4)),
event = 1:10,key = c(user,fields))

CalculateSupport< function(dt,name){
x < - dt [,.N,by = eval(paste0('user,',name))]
setnames(x,name,'name')
x < (name,c(',...,name'),name(),name .name,。support)))
x
}

用户< - dt [,sum(behavior) 0,by = user]
setnames(用户,V1,行为)

减少(function(x,name)x [CalculateSupport(dt,name)用户)

会导致

 用户行为country.name country.support language.name language.support 
1:3 FALSE 1 0.8 6 1.0
2:4 TRUE 2 1.0 5 0.8

PS请认真对待里卡多对你的问题的评论。 SO是充满了美好的人谁愿意帮助,但你必须善待和尊重他们。


How do I summarize a data.table with unreliable data across multiple columns?

Specifically, given

fields <- c("country","language")
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
                 behavior=c(rep(FALSE,5),rep(TRUE,5)),
                 country=c(rep(1,4),rep(2,6)),
                 language=c(rep(6,6),rep(5,4)),
                 event=1:10, key=c("user",fields))
dt
#     user behavior country language event
#  1:    3    FALSE       1        6     1
#  2:    3    FALSE       1        6     2
#  3:    3    FALSE       1        6     3
#  4:    3    FALSE       1        6     4
#  5:    3    FALSE       2        6     5
#  6:    4     TRUE       2        5     7
#  7:    4     TRUE       2        5     8
#  8:    4     TRUE       2        5     9
#  9:    4     TRUE       2        5    10
# 10:    4     TRUE       2        6     6

I want to get

#    user behavior country.name country.support language.name language.support
# 1:    3    FALSE            1             0.8             6              1.0
# 2:    4     TRUE            2             1.0             5              0.8

(here the x.name is the most common x for the user and x.support is the share events where this top x was observed)

without having to go through both fields by hand like this:

users <- dt[, sum(behavior) > 0, by=user] # have behavior at least once
setnames(users, "V1", "behavior")
dt.out <- dt[, .N, by=list(user,country)
             ][, list(country[which.max(N)],max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"),  paste0("country",c(".name", ".support")))
users <- users[dt.out]
dt.out <- dt[, .N, by=list(user,language)
             ][, list(language[which.max(N)], max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"),  paste0("language",c(".name", ".support")))
users <- users[dt.out]
users
#    user behavior country.name country.support language.name language.support
# 1:    3    FALSE            1             0.8             6              1.0
# 2:    4     TRUE            2             1.0             5              0.8

The actual number of fields is 5 and I want to avoid having to repeat the same code for each field separately, and have to edit this function if I ever modify fields. Please note that this is the substance of this question, the support computation was kindly explained to me elsewhere.

As in the referenced question, my data set has about 10^7 rows, so I really need a solution that scales; it would also be nice if I could avoid unnecessary copying like in users <- users[dt.out].

解决方案

Does this solve your problem?

fields <- c("country","language")
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
           behavior=c(rep(FALSE,5),rep(TRUE,5)),
           country=c(rep(1,4),rep(2,6)),
           language=c(rep(6,6),rep(5,4)),
           event=1:10, key=c("user",fields))

CalculateSupport <- function(dt, name) {
  x <- dt[, .N, by = eval(paste0('user,', name))]
  setnames(x, name, 'name')
  x <- x[, list(name[which.max(N)], max(N)/sum(N)), by = user]
  setnames(x, c('V1', 'V2'), paste0(name, c(".name", ".support")))
  x
}

users <- dt[, sum(behavior) > 0, by=user] 
setnames(users, "V1", "behavior")

Reduce(function(x, name) x[CalculateSupport(dt, name)], fields, users)

results in

   user behavior country.name country.support language.name language.support
1:    3    FALSE            1             0.8             6              1.0
2:    4     TRUE            2             1.0             5              0.8

P.S. Please please take Ricardo's comment to your question seriously. SO is full of wonderful people who are willing to help but you have to treat them nicely and with respect.

这篇关于跨多个列汇总data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆