如何对data.table中的.SD列执行逐行操作 [英] How to do row wise operations on .SD columns in data.table
问题描述
虽然我以前想过这个,但我仍然发现自己在stackoverflow上搜索(而无法找到)这个语法,所以...
使用 .SD
和 .SDcols
对data.table列的子集执行行操作。我永远不会记得操作是否需要 sapply
, lapply
,或者如果属于 .SD
。
例如,假设您有两个季度的10名学生的数据。在两个季度,他们有两个考试和期末考试。您将如何对从q1开始的列取直接平均值?
由于过于琐碎的例子很讨厌,我还想计算一个加权平均值, q2? (权重= q2的25%25%和50%)
library(data.table)
set.seed(10)
dt< - data.table(id = paste0(student_,sprintf(%02.f,1:10)),
q1_exam1 = round rnorm(10,.78,.05),2),
q1_exam2 = round(rnorm(10,.68,.02),2),
q1_final = round(rnorm ,.08),2),
q2_exam1 = round(rnorm(10,.78,.05),2),
q2_exam2 = round(rnorm(10,.68,.10),2 ),
q2_final = round(rnorm(10,.88,.04),2))
dt
# dt
#id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final
#1:student_01 0.78 0.70 0.83 0.69 0.79 0.86
#2:student_02 0.77 0.70 0.71 0.78 0.60 0.87
#3:student_03 0.71 0.68 0.83 0.83 0.60 0.93
#4:student_04 0.75 0.70 0.71 0.79 0.76 0.97
#5:student_05 0.79 0.69 0.78 0.71 0.58 0.90
#6:student_06 0.80 0.68 0.85 0.71 0.68 0.91
#7:student_07 0.72 0.66 0.82 0.80 0.70 0.84
#8:student_08 0.76 0.68 0.81 0.69 0.65 0.90
#9:student_09 0.70 0.70 0.87 0.76 0.61 0.85
#10:student_10 0.77 0.69 0.86 0.75 0.75 0.89
,大部分来自评论:
沿行
应用 b
$ b
OP的方法对于逐行操作使用 apply(。,1,。)
,但是这是不鼓励的,因为它不必要地强制将data.table转换为矩阵。 lapply
/ sapply
也不合适,因为它们被设计为分别在每个列上工作,而不是组合它们。
rowMeans
,类似命名的函数也强制转换为矩阵。
按行拆分
如@Jaap所说,您可以使用 by = 1:nrow
这种方法来自eddi 可能是最有效的,如果你必须保持你的宽格式的数据:
jwts = list(
q1_AVG = c(q1_exam1 = 1,q1_exam2 = 1,q1_final = 1)/ 3,
q2_WAVG = c(q1_exam1 = 1/4,q2_exam2 = 1/4,q2_final = 1/2)
)
(newj in names(jwts)){
w = jwts [[newj]]
dt [,(newj):= Reduce [[x]] * w [x])]
}
到矩阵并允许不同的加权规则(不像 rowMeans
)。
>
如@alexis_laz建议,您可以使用不同的结构,例如
获得清晰度和效率。 #reshape
m = melt(dt,id.vars =id,value.name =score)[,
c(quarter,exam): = tstrsplit(variable,_)] [,variable:= NULL]
#输入你的加权规则
w = unique(m [,c(quarter,exam) ,with = FALSE])
w [quarter ==q1,wt:= 1 / .N]
w [quarter ==q2& exam ==final,wt:= .5]
w [quarter ==q2&考试!=final,wt:=(1-.5)/.N]
#合并和计算
m [w,on = c )] [,sum(score * wt),by =。(id,quarter)]
在任何情况下,你应该将你的权重规则存储在明确的地方而不是输入到如果你想扩大季度的数量。
Although I've figured this out before, I still find myself searching (and unable to find) this syntax on stackoverflow, so...
I want to do row wise operations on a subset of the data.table's columns, using .SD
and .SDcols
. I can never remember if the operations need an sapply
, lapply
, or if the belong inside the brackets of .SD
.
As an example, say you have data for 10 students over two quarters. In both quarters they have two exams and a final exam. How would you take a straight average of the columns starting with q1?
Since overly trivial examples are annoying, I'd also like to calculate a weighted average for columns starting with q2? (weights = 25% 25% and 50% for q2)
library(data.table)
set.seed(10)
dt <- data.table(id = paste0("student_", sprintf("%02.f" , 1:10)),
q1_exam1 = round(rnorm(10, .78, .05), 2),
q1_exam2 = round(rnorm(10, .68, .02), 2),
q1_final = round(rnorm(10, .88, .08), 2),
q2_exam1 = round(rnorm(10, .78, .05), 2),
q2_exam2 = round(rnorm(10, .68, .10), 2),
q2_final = round(rnorm(10, .88, .04), 2))
dt
# > dt
# id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final
# 1: student_01 0.78 0.70 0.83 0.69 0.79 0.86
# 2: student_02 0.77 0.70 0.71 0.78 0.60 0.87
# 3: student_03 0.71 0.68 0.83 0.83 0.60 0.93
# 4: student_04 0.75 0.70 0.71 0.79 0.76 0.97
# 5: student_05 0.79 0.69 0.78 0.71 0.58 0.90
# 6: student_06 0.80 0.68 0.85 0.71 0.68 0.91
# 7: student_07 0.72 0.66 0.82 0.80 0.70 0.84
# 8: student_08 0.76 0.68 0.81 0.69 0.65 0.90
# 9: student_09 0.70 0.70 0.87 0.76 0.61 0.85
# 10: student_10 0.77 0.69 0.86 0.75 0.75 0.89
Here are a few thoughts on your options, largely gathered from the comments:
apply
along rows
The OP's approach uses apply(.,1,.)
for the by-row operation, but this is discouraged because it unnecessarily coerces the data.table into a matrix. lapply
/sapply
also are not suitable, since they are designed to work on each columns separately, not to combine them.
rowMeans
and similarly-named functions also coerce to a matrix.
Split by rows
As @Jaap said, you can use by=1:nrow(dt)
for any rowwise operation, but it may be slow.
Efficiently create new columns
This approach taken from eddi is probably the most efficient if you must keep your data in wide format:
jwts = list(
q1_AVG = c(q1_exam1 = 1 , q1_exam2 = 1 , q1_final = 1)/3,
q2_WAVG = c(q1_exam1 = 1/4, q2_exam2 = 1/4, q2_final = 1/2)
)
for (newj in names(jwts)){
w = jwts[[newj]]
dt[, (newj) := Reduce("+", lapply(names(w), function(x) dt[[x]] * w[x]))]
}
This avoids coercion to a matrix and allows for different weighting rules (unlike rowMeans
).
Go long
As @alexis_laz suggested, you might gain clarity and efficiency with a different structure, like
# reshape
m = melt(dt, id.vars="id", value.name="score")[,
c("quarter","exam") := tstrsplit(variable, "_")][, variable := NULL]
# input your weighting rules
w = unique(m[,c("quarter","exam"), with=FALSE])
w[quarter=="q1" , wt := 1/.N]
w[quarter=="q2" & exam=="final", wt := .5]
w[quarter=="q2" & exam!="final", wt := (1-.5)/.N]
# merge and compute
m[w, on=c("quarter","exam")][, sum(score*wt), by=.(id,quarter)]
This is what I would do.
In any case, you should have your weighting rules stored somewhere explicitly rather than entered on the fly if you want to scale up the number of quarters.
这篇关于如何对data.table中的.SD列执行逐行操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!