如何对data.table中的.SD列执行逐行操作 [英] How to do row wise operations on .SD columns in data.table

查看:124
本文介绍了如何对data.table中的.SD列执行逐行操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

虽然我以前想过这个,但我仍然发现自己在stackoverflow上搜索(而无法找到)这个语法,所以...



使用 .SD .SDcols 对data.table列的子集执行行操作。我永远不会记得操作是否需要 sapply lapply ,或者如果属于 .SD



例如,假设您有两个季度的10名学生的数据。在两个季度,他们有两个考试和期末考试。您将如何对从q1开始的列取直接平均值?



由于过于琐碎的例子很讨厌,我还想计算一个加权平均值, q2? (权重= q2的25%25%和50%)

  library(data.table)

set.seed(10)
dt< - data.table(id = paste0(student_,sprintf(%02.f,1:10)),
q1_exam1 = round rnorm(10,.78,.05),2),
q1_exam2 = round(rnorm(10,.68,.02),2),
q1_final = round(rnorm ,.08),2),
q2_exam1 = round(rnorm(10,.78,.05),2),
q2_exam2 = round(rnorm(10,.68,.10),2 ),
q2_final = round(rnorm(10,.88,.04),2))

dt
# dt
#id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final
#1:student_01 0.78 0.70 0.83 0.69 0.79 0.86
#2:student_02 0.77 0.70 0.71 0.78 0.60 0.87
#3:student_03 0.71 0.68 0.83 0.83 0.60 0.93
#4:student_04 0.75 0.70 0.71 0.79 0.76 0.97
#5:student_05 0.79 0.69 0.78 0.71 0.58 0.90
#6:student_06 0.80 0.68 0.85 0.71 0.68 0.91
#7:student_07 0.72 0.66 0.82 0.80 0.70 0.84
#8:student_08 0.76 0.68 0.81 0.69 0.65 0.90
#9:student_09 0.70 0.70 0.87 0.76 0.61 0.85
#10:student_10 0.77 0.69 0.86 0.75 0.75 0.89


解决方案

,大部分来自评论:



沿行应用 b
$ b

OP的方法对于逐行操作使用 apply(。,1,。),但是这是不鼓励的,因为它不必要地强制将data.table转换为矩阵。 lapply / sapply 也不合适,因为它们被设计为分别在每个列上工作,而不是组合它们。



rowMeans ,类似命名的函数也强制转换为矩阵。



按行拆分



如@Jaap所说,您可以使用 by = 1:nrow

有效创建新列



这种方法来自eddi 可能是最有效的,如果你必须保持你的宽格式的数据:

  jwts = list(
q1_AVG = c(q1_exam1 = 1,q1_exam2 = 1,q1_final = 1)/ 3,
q2_WAVG = c(q1_exam1 = 1/4,q2_exam2 = 1/4,q2_final = 1/2)



(newj in names(jwts)){
w = jwts [[newj]]
dt [,(newj):= Reduce [[x]] * w [x])]
}

到矩阵并允许不同的加权规则(不像 rowMeans )。



>



如@alexis_laz建议,您可以使用不同的结构,例如

获得清晰度和效率。

 #reshape 
m = melt(dt,id.vars =id,value.name =score)[,
c(quarter,exam): = tstrsplit(variable,_)] [,variable:= NULL]

#输入你的加权规则
w = unique(m [,c(quarter,exam) ,with = FALSE])
w [quarter ==q1,wt:= 1 / .N]
w [quarter ==q2& exam ==final,wt:= .5]
w [quarter ==q2&考试!=final,wt:=(1-.5)/.N]

#合并和计算
m [w,on = c )] [,sum(score * wt),by =。(id,quarter)]






在任何情况下,你应该将你的权重规则存储在明确的地方而不是输入到如果你想扩大季度的数量。


Although I've figured this out before, I still find myself searching (and unable to find) this syntax on stackoverflow, so...

I want to do row wise operations on a subset of the data.table's columns, using .SD and .SDcols. I can never remember if the operations need an sapply, lapply, or if the belong inside the brackets of .SD.

As an example, say you have data for 10 students over two quarters. In both quarters they have two exams and a final exam. How would you take a straight average of the columns starting with q1?

Since overly trivial examples are annoying, I'd also like to calculate a weighted average for columns starting with q2? (weights = 25% 25% and 50% for q2)

library(data.table)

set.seed(10)
dt <- data.table(id = paste0("student_", sprintf("%02.f" , 1:10)),
                 q1_exam1 = round(rnorm(10, .78, .05), 2),
                 q1_exam2 = round(rnorm(10, .68, .02), 2),
                 q1_final = round(rnorm(10, .88, .08), 2),
                 q2_exam1 = round(rnorm(10, .78, .05), 2),
                 q2_exam2 = round(rnorm(10, .68, .10), 2),
                 q2_final = round(rnorm(10, .88, .04), 2))

dt
# > dt
#             id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final
#  1: student_01     0.78     0.70     0.83     0.69     0.79     0.86
#  2: student_02     0.77     0.70     0.71     0.78     0.60     0.87
#  3: student_03     0.71     0.68     0.83     0.83     0.60     0.93
#  4: student_04     0.75     0.70     0.71     0.79     0.76     0.97
#  5: student_05     0.79     0.69     0.78     0.71     0.58     0.90
#  6: student_06     0.80     0.68     0.85     0.71     0.68     0.91
#  7: student_07     0.72     0.66     0.82     0.80     0.70     0.84
#  8: student_08     0.76     0.68     0.81     0.69     0.65     0.90
#  9: student_09     0.70     0.70     0.87     0.76     0.61     0.85
# 10: student_10     0.77     0.69     0.86     0.75     0.75     0.89

解决方案

Here are a few thoughts on your options, largely gathered from the comments:

apply along rows

The OP's approach uses apply(.,1,.) for the by-row operation, but this is discouraged because it unnecessarily coerces the data.table into a matrix. lapply/sapply also are not suitable, since they are designed to work on each columns separately, not to combine them.

rowMeans and similarly-named functions also coerce to a matrix.

Split by rows

As @Jaap said, you can use by=1:nrow(dt) for any rowwise operation, but it may be slow.

Efficiently create new columns

This approach taken from eddi is probably the most efficient if you must keep your data in wide format:

jwts = list( 
  q1_AVG  = c(q1_exam1 = 1  , q1_exam2 = 1  , q1_final =   1)/3, 
  q2_WAVG = c(q1_exam1 = 1/4, q2_exam2 = 1/4, q2_final = 1/2)
)


for (newj in names(jwts)){
  w = jwts[[newj]]
  dt[, (newj) := Reduce("+", lapply(names(w), function(x) dt[[x]] * w[x]))]
}

This avoids coercion to a matrix and allows for different weighting rules (unlike rowMeans).

Go long

As @alexis_laz suggested, you might gain clarity and efficiency with a different structure, like

# reshape
m = melt(dt, id.vars="id", value.name="score")[,
  c("quarter","exam") := tstrsplit(variable, "_")][, variable := NULL]

# input your weighting rules
w = unique(m[,c("quarter","exam"), with=FALSE])
w[quarter=="q1"                , wt := 1/.N]
w[quarter=="q2" & exam=="final", wt := .5]
w[quarter=="q2" & exam!="final", wt := (1-.5)/.N]

# merge and compute
m[w, on=c("quarter","exam")][, sum(score*wt), by=.(id,quarter)]

This is what I would do.


In any case, you should have your weighting rules stored somewhere explicitly rather than entered on the fly if you want to scale up the number of quarters.

这篇关于如何对data.table中的.SD列执行逐行操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆