在 data.table 中使用带有 .SD 和 .SDcols 的平均值 [英] using mean with .SD and .SDcols in data.table

查看:12
本文介绍了在 data.table 中使用带有 .SD 和 .SDcols 的平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个非常简单的函数来汇总 data.tables 的列.我一次将一列传递给函数,然后进行一些诊断以找出汇总选项,然后进行汇总.我在 data.table 中这样做是为了允许一些非常大的数据集.

I am writing a very simple function to summarize columns of data.tables. I am passing one column at a time to the function, and then doing some diagnostics to figure out the options for summarization, and then doing the summarization. I am doing this in data.table to allow for some very large datasets.

所以,我使用 .SDcols 传入列进行汇总,并在 j 部分的 .SD 上使用函数一个 data.table 表达式.由于我一次只传递一列,因此我没有使用 lapply.我发现有些功能有效,而另一些则无效.下面是我正在使用的测试数据集和我看到的结果:

So, I am using .SDcols to pass in the column to summarize, and using functions on .SD in the j part of a data.table expression. Since I am passing in one column at a time, I am not using lapply. And what I am finding is that some functions work and others do not. Below is a test dataset I am working with and the results I see:

dt <- data.table(
  a=1:10, 
  b=as.factor(letters[1:10]), 
  c=c(TRUE, FALSE), 
  d=runif(10, 0.5, 100), 
  e=c(0,1), 
  f=as.integer(c(0,1)), 
  g=as.numeric(1:10), 
  h=c("cat1", "cat2", "cat3", "cat4", "cat5"))

mean(dt$a)
[1] 5.5

dt[, mean(.SD), .SDcols = "a"]

[1] NA
Warning message:
In mean.default(.SD) : argument is not numeric or logical: returning NA

dt[, sum(.SD), .SDcols = "a"]
[1] 55

dt[, max(.SD), .SDcols = "a"]
[1] 10

dt[, colMeans(.SD), .SDcols = "a"]
  a 
5.5 

dt[, lapply(.SD, mean), .SDcols = "a"]
     a
1: 5.5

有趣的是,当我在 j 中使用 weighted.mean(.SD) 时,weighted.mean 给出了错误的答案(55,总和).但是当我在 j 中使用 lapply(.SD, weighted.mean) 时,它给出了正确的答案(5.5,平均值).

Interestingly, weighted.mean gives the wrong answer (55, the sum) when I use weighted.mean(.SD) in j. But when I use lapply(.SD, weighted.mean) in j, it gives the right answer (5.5, the mean).

我尝试关闭 data.table 优化以查看它是否是内部 data.table 均值函数,但这并没有改变.

I tried turning off data.table optimizations to see if it was the internal data.table mean function, but that didn't change things.

也许这只是在列表上使用 mean() 的问题(这似乎是 .SD 返回的内容)?我想没有理由不将 lapply 范例与 .SD 一起使用?似乎只有 lapply 选项返回 data.table.其他似乎返回向量,除了 colMeans 返回其他东西(列表?).

Maybe this is just a problem with using mean() on a list (which seems to be what .SD returns)? I guess there is never a reason to NOT use the lapply paradigm with .SD? It seems that only the lapply option returns a data.table. The others seem to return vectors, except for colMeans which is returning something else (list?).

我的主要问题是为什么 mean(.SD) 不起作用.推论是 .SD 是否可以在没有应用功能之一的情况下使用.

My main question is why mean(.SD) does not work. And the corollary is whether .SD can be used in the absence of one of the apply functions.

谢谢.

推荐答案

我认为接近你想要的适当方法是只使用标准语法:

I think the appropriate way of approaching what you want is to just use the standard syntax:

dt[ , lapply(.SD, mean), .SDcols = "a"]

或者,您可以按名称传递变量,如下所示:

Alternatively, you can pass a variable by name as follows:

col_to_pass = "a"
dt[ , mean(get(col_to_pass)) ]

最终,您可以将这种方法推广到多个列,如下所示:

Eventually, you can generalized this approach to multiple columns as follows:

col_to_pass = c("a", "d")
dt[ , lapply( mget(col_to_pass), mean) ]

这篇关于在 data.table 中使用带有 .SD 和 .SDcols 的平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆