在data.table中使用.SD和.SDcols的均值 [英] using mean with .SD and .SDcols in data.table

查看:250
本文介绍了在data.table中使用.SD和.SDcols的均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个非常简单的函数来汇总data.tables的列.我一次将一列传递给该函数,然后进行一些诊断以找出汇总的选项,然后进行汇总.我正在data.table中执行此操作,以允许一些非常大的数据集.

I am writing a very simple function to summarize columns of data.tables. I am passing one column at a time to the function, and then doing some diagnostics to figure out the options for summarization, and then doing the summarization. I am doing this in data.table to allow for some very large datasets.

因此,我正在使用.SDcols传递该列中的摘要,并在data.table表达式的j部分中的.SD上使用函数.由于我一次只传递一列,因此我没有使用lapply.我发现有些功能可以工作,而有些则不能.以下是我正在使用的测试数据集和看到的结果:

So, I am using .SDcols to pass in the column to summarize, and using functions on .SD in the j part of a data.table expression. Since I am passing in one column at a time, I am not using lapply. And what I am finding is that some functions work and others do not. Below is a test dataset I am working with and the results I see:

dt <- data.table(
  a=1:10, 
  b=as.factor(letters[1:10]), 
  c=c(TRUE, FALSE), 
  d=runif(10, 0.5, 100), 
  e=c(0,1), 
  f=as.integer(c(0,1)), 
  g=as.numeric(1:10), 
  h=c("cat1", "cat2", "cat3", "cat4", "cat5"))

mean(dt$a)
[1] 5.5

dt[, mean(.SD), .SDcols = "a"]

[1] NA
Warning message:
In mean.default(.SD) : argument is not numeric or logical: returning NA

dt[, sum(.SD), .SDcols = "a"]
[1] 55

dt[, max(.SD), .SDcols = "a"]
[1] 10

dt[, colMeans(.SD), .SDcols = "a"]
  a 
5.5 

dt[, lapply(.SD, mean), .SDcols = "a"]
     a
1: 5.5

有趣的是,当我在j中使用weighted.mean(.SD)时,weighted.mean给出了错误的答案(55,总和).但是当我在j中使用lapply(.SD, weighted.mean)时,它给出了正确的答案(5.5,均值).

Interestingly, weighted.mean gives the wrong answer (55, the sum) when I use weighted.mean(.SD) in j. But when I use lapply(.SD, weighted.mean) in j, it gives the right answer (5.5, the mean).

我尝试关闭data.table优化以查看它是否是内部data.table均值函数,但这并没有改变.

I tried turning off data.table optimizations to see if it was the internal data.table mean function, but that didn't change things.

也许这只是在列表上使用mean()的问题(这似乎是.SD返回的内容)?我猜从来没有理由不将lapply范式与.SD一起使用?似乎只有lapply选项返回data.table.其他的似乎返回向量,除了colMeans会返回其他东西(列表?).

Maybe this is just a problem with using mean() on a list (which seems to be what .SD returns)? I guess there is never a reason to NOT use the lapply paradigm with .SD? It seems that only the lapply option returns a data.table. The others seem to return vectors, except for colMeans which is returning something else (list?).

我的主要问题是为什么mean(.SD)无法正常工作.结果是,是否可以在没有应用功能之一的情况下使用.SD.

My main question is why mean(.SD) does not work. And the corollary is whether .SD can be used in the absence of one of the apply functions.

谢谢.

推荐答案

我认为处理所需内容的适当方法是仅使用标准语法:

I think the appropriate way of approaching what you want is to just use the standard syntax:

dt[ , lapply(.SD, mean), .SDcols = "a"]

或者,您可以按名称传递变量,如下所示:

Alternatively, you can pass a variable by name as follows:

col_to_pass = "a"
dt[ , mean(get(col_to_pass)) ]

最终,您可以将这种方法推广到多个列,如下所示:

Eventually, you can generalized this approach to multiple columns as follows:

col_to_pass = c("a", "d")
dt[ , lapply( mget(col_to_pass), mean) ]

这篇关于在data.table中使用.SD和.SDcols的均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆