在data.table中使用.SD和.SDcols的均值 [英] using mean with .SD and .SDcols in data.table
问题描述
我正在编写一个非常简单的函数来汇总data.tables的列.我一次将一列传递给该函数,然后进行一些诊断以找出汇总的选项,然后进行汇总.我正在data.table中执行此操作,以允许一些非常大的数据集.
I am writing a very simple function to summarize columns of data.tables. I am passing one column at a time to the function, and then doing some diagnostics to figure out the options for summarization, and then doing the summarization. I am doing this in data.table to allow for some very large datasets.
因此,我正在使用.SDcols
传递该列中的摘要,并在data.table表达式的j
部分中的.SD
上使用函数.由于我一次只传递一列,因此我没有使用lapply.我发现有些功能可以工作,而有些则不能.以下是我正在使用的测试数据集和看到的结果:
So, I am using .SDcols
to pass in the column to summarize, and using functions on .SD
in the j
part of a data.table expression. Since I am passing in one column at a time, I am not using lapply. And what I am finding is that some functions work and others do not. Below is a test dataset I am working with and the results I see:
dt <- data.table(
a=1:10,
b=as.factor(letters[1:10]),
c=c(TRUE, FALSE),
d=runif(10, 0.5, 100),
e=c(0,1),
f=as.integer(c(0,1)),
g=as.numeric(1:10),
h=c("cat1", "cat2", "cat3", "cat4", "cat5"))
mean(dt$a)
[1] 5.5
dt[, mean(.SD), .SDcols = "a"]
[1] NA
Warning message:
In mean.default(.SD) : argument is not numeric or logical: returning NA
dt[, sum(.SD), .SDcols = "a"]
[1] 55
dt[, max(.SD), .SDcols = "a"]
[1] 10
dt[, colMeans(.SD), .SDcols = "a"]
a
5.5
dt[, lapply(.SD, mean), .SDcols = "a"]
a
1: 5.5
有趣的是,当我在j中使用weighted.mean(.SD)
时,weighted.mean
给出了错误的答案(55,总和).但是当我在j中使用lapply(.SD, weighted.mean)
时,它给出了正确的答案(5.5,均值).
Interestingly, weighted.mean
gives the wrong answer (55, the sum) when I use weighted.mean(.SD)
in j. But when I use lapply(.SD, weighted.mean)
in j, it gives the right answer (5.5, the mean).
我尝试关闭data.table优化以查看它是否是内部data.table均值函数,但这并没有改变.
I tried turning off data.table optimizations to see if it was the internal data.table mean function, but that didn't change things.
也许这只是在列表上使用mean()
的问题(这似乎是.SD
返回的内容)?我猜从来没有理由不将lapply
范式与.SD
一起使用?似乎只有lapply
选项返回data.table.其他的似乎返回向量,除了colMeans会返回其他东西(列表?).
Maybe this is just a problem with using mean()
on a list (which seems to be what .SD
returns)? I guess there is never a reason to NOT use the lapply
paradigm with .SD
? It seems that only the lapply
option returns a data.table. The others seem to return vectors, except for colMeans which is returning something else (list?).
我的主要问题是为什么mean(.SD)
无法正常工作.结果是,是否可以在没有应用功能之一的情况下使用.SD.
My main question is why mean(.SD)
does not work. And the corollary is whether .SD can be used in the absence of one of the apply functions.
谢谢.
推荐答案
我认为处理所需内容的适当方法是仅使用标准语法:
I think the appropriate way of approaching what you want is to just use the standard syntax:
dt[ , lapply(.SD, mean), .SDcols = "a"]
或者,您可以按名称传递变量,如下所示:
Alternatively, you can pass a variable by name as follows:
col_to_pass = "a"
dt[ , mean(get(col_to_pass)) ]
最终,您可以将这种方法推广到多个列,如下所示:
Eventually, you can generalized this approach to multiple columns as follows:
col_to_pass = c("a", "d")
dt[ , lapply( mget(col_to_pass), mean) ]
这篇关于在data.table中使用.SD和.SDcols的均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!