跨多个列汇总data.table [英] Summarize a data.table across multiple columns
问题描述
如何在多个列之间总结具有不可靠数据的数据表?
具体来说,给定
字段< - c(country,language)
dt< - data.table(user = c(rep(3,5),rep(4,5)),
behavior = c(rep(FALSE,5) ),
country = c(rep(1,4),rep(2,6)),
language = c(rep(6,6),rep(5,4) $ b event = 1:10,key = c(user,fields))
dt
#用户行为国家语言事件
#1:3 FALSE 1 6 1
#2:3 FALSE 1 6 2
#3:3 FALSE 1 6 3
#4:3 FALSE 1 6 4
5:3 FALSE 2 6 5
#6 :4 TRUE 2 5 7
#7:4 TRUE 2 5 8
#8:4 TRUE 2 5 9
#9:4 TRUE 2 5 10
#10:4 TRUE 2 6 6
我想获得
#user behavior country.name country.support language.name language.support
#1:3 FALSE 1 0.8 6 1.0
#2:4 TRUE 2 1.0 5 0.8
(此处为 x .name对于
是观察到这个顶部 x 的共享事件)用户
和 x ,最常见的
.support
c $ c> fields :
行为)> 0,by = user]#有行为至少一次
setnames(用户,V1,行为)
dt.out < - dt [,.N,by = list国家)
] [,list(country [which.max(N)],max(N)/ sum(N))by $ user
setnames(dt.out,c ,V2),paste0(country,c(。name,.support)))
users< - users [dt.out]
dt.out< - dt [,.N,by = list(user,language)
] [,list(language [which.max(N)],max(N)/ sum(N)),by = user]
setnames(dt.out,c(V1,V2),paste0(language,c(。name,.support)))
users& dt.out]
users
#用户行为country.name country.support language.name language.support
#1:3 FALSE 1 0.8 6 1.0
#2:4 TRUE 2 1.0 5 0.8
字段的实际数量
是5,我想避免为每个字段分别重复相同的代码,如果我修改 fields
,必须编辑此函数。
请注意,此是此问题的实质内容,请向我们解释支持计算其他地方。
在引用问题中,我的数据集有10 ^ 7行,所以我真的需要一个解决方案;如果我可以避免不必要的复制,如 users< - users [dt.out]
。
字段< - c(国家/地区,language)
dt< - data.table(user = c(rep(3,5),rep(4,5)),
behavior = c(rep(FALSE,5 ),rep(TRUE,5)),
country = c(rep(1,4),rep(2,6)),
language = c 5,4)),
event = 1:10,key = c(user,fields))
CalculateSupport< function(dt,name){
x < - dt [,.N,by = eval(paste0('user,',name))]
setnames(x,name,'name')
x < (name,c(',...,name'),name(),name .name,。support)))
x
}
用户< - dt [,sum(behavior) 0,by = user]
setnames(用户,V1,行为)
减少(function(x,name)x [CalculateSupport(dt,name)用户)
会导致
用户行为country.name country.support language.name language.support
1:3 FALSE 1 0.8 6 1.0
2:4 TRUE 2 1.0 5 0.8
PS请认真对待里卡多对你的问题的评论。 SO是充满了美好的人谁愿意帮助,但你必须善待和尊重他们。
How do I summarize a data.table with unreliable data across multiple columns?
Specifically, given
fields <- c("country","language")
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
behavior=c(rep(FALSE,5),rep(TRUE,5)),
country=c(rep(1,4),rep(2,6)),
language=c(rep(6,6),rep(5,4)),
event=1:10, key=c("user",fields))
dt
# user behavior country language event
# 1: 3 FALSE 1 6 1
# 2: 3 FALSE 1 6 2
# 3: 3 FALSE 1 6 3
# 4: 3 FALSE 1 6 4
# 5: 3 FALSE 2 6 5
# 6: 4 TRUE 2 5 7
# 7: 4 TRUE 2 5 8
# 8: 4 TRUE 2 5 9
# 9: 4 TRUE 2 5 10
# 10: 4 TRUE 2 6 6
I want to get
# user behavior country.name country.support language.name language.support
# 1: 3 FALSE 1 0.8 6 1.0
# 2: 4 TRUE 2 1.0 5 0.8
(here the x.name
is the most common x for the user
and x.support
is the share events where this top x was observed)
without having to go through both fields
by hand like this:
users <- dt[, sum(behavior) > 0, by=user] # have behavior at least once
setnames(users, "V1", "behavior")
dt.out <- dt[, .N, by=list(user,country)
][, list(country[which.max(N)],max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"), paste0("country",c(".name", ".support")))
users <- users[dt.out]
dt.out <- dt[, .N, by=list(user,language)
][, list(language[which.max(N)], max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"), paste0("language",c(".name", ".support")))
users <- users[dt.out]
users
# user behavior country.name country.support language.name language.support
# 1: 3 FALSE 1 0.8 6 1.0
# 2: 4 TRUE 2 1.0 5 0.8
The actual number of fields
is 5 and I want to avoid having to repeat the same code for each field separately, and have to edit this function if I ever modify fields
.
Please note that this is the substance of this question, the support computation was kindly explained to me elsewhere.
As in the referenced question, my data set has about 10^7 rows, so I really need a solution that scales; it would also be nice if I could avoid unnecessary copying like in users <- users[dt.out]
.
Does this solve your problem?
fields <- c("country","language")
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
behavior=c(rep(FALSE,5),rep(TRUE,5)),
country=c(rep(1,4),rep(2,6)),
language=c(rep(6,6),rep(5,4)),
event=1:10, key=c("user",fields))
CalculateSupport <- function(dt, name) {
x <- dt[, .N, by = eval(paste0('user,', name))]
setnames(x, name, 'name')
x <- x[, list(name[which.max(N)], max(N)/sum(N)), by = user]
setnames(x, c('V1', 'V2'), paste0(name, c(".name", ".support")))
x
}
users <- dt[, sum(behavior) > 0, by=user]
setnames(users, "V1", "behavior")
Reduce(function(x, name) x[CalculateSupport(dt, name)], fields, users)
results in
user behavior country.name country.support language.name language.support
1: 3 FALSE 1 0.8 6 1.0
2: 4 TRUE 2 1.0 5 0.8
P.S. Please please take Ricardo's comment to your question seriously. SO is full of wonderful people who are willing to help but you have to treat them nicely and with respect.
这篇关于跨多个列汇总data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!