使用 data.table 汇总多列 [英] Summarizing multiple columns with data.table

查看:29
本文介绍了使用 data.table 汇总多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 data.table 来加速处理由几个较小的合并 data.frames 组成的大型 data.frame (300k x 60).我是 data.table 的新手.目前代码如下

I'm trying to use data.table to speed up processing of a large data.frame (300k x 60) made of several smaller merged data.frames. I'm new to data.table. The code so far is as follows

library(data.table)
a = data.table(index=1:5,a=rnorm(5,10),b=rnorm(5,10),z=rnorm(5,10))
b = data.table(index=6:10,a=rnorm(5,10),b=rnorm(5,10),c=rnorm(5,10),d=rnorm(5,10))
dt = merge(a,b,by=intersect(names(a),names(b)),all=T)
dt$category = sample(letters[1:3],10,replace=T)

我想知道是否有比以下更有效的方法来汇总数据.

and I wondered if there was a more efficient way than the following to summarize the data.

summ = dt[i=T,j=list(a=sum(a,na.rm=T),b=sum(b,na.rm=T),c=sum(c,na.rm=T),
                     d=sum(d,na.rm=T),z=sum(z,na.rm=T)),by=category]

我真的不想手动输入所有 50 列的计算,而且 eval(paste(...)) 不知何故似乎很笨重.

I don't really want to type all 50 column calculations by hand and a eval(paste(...)) seems clunky somehow.

我查看了下面的示例,但对于我的需求来说似乎有点复杂.谢谢

I had a look at the example below but it seems a bit complicated for my needs. thanks

如何跨多列汇总数据表

推荐答案

您可以使用带有 .SD

dt[, lapply(.SD, sum, na.rm=TRUE), by=category ]

   category index        a        b        z         c        d
1:        c    19 51.13289 48.49994 42.50884  9.535588 11.53253
2:        b     9 17.34860 20.35022 10.32514 11.764105 10.53127
3:        a    27 25.91616 31.12624  0.00000 29.197343 31.71285


如果您只想对某些列进行汇总,可以添加 .SDcols 参数

#  note that .SDcols also allows reordering of the columns
dt[, lapply(.SD, sum, na.rm=TRUE), by=category, .SDcols=c("a", "c", "z") ] 

   category        a         c        z
1:        c 51.13289  9.535588 42.50884
2:        b 17.34860 11.764105 10.32514
3:        a 25.91616 29.197343  0.00000


这当然不限于sum,您可以使用lapply 的任何函数,包括匿名函数.(即,这是一个常规的 lapply 语句).


This of course, is not limited to sum and you can use any function with lapply, including anonymous functions. (ie, it's a regular lapply statement).

最后,不需要使用 i=Tj= <..>.我个人认为这会降低代码的可读性,但这只是一种风格偏好.

Lastly, there is no need to use i=T and j= <..>. Personally, I think that makes the code less readable, but it is just a style preference.

参见 ?.SD?data.table 及其 .SDcols 参数,以及小插图 使用 .SD 进行数据分析.

See ?.SD, ?data.table and its .SDcols argument, and the vignette Using .SD for Data Analysis.

也看看 data.table 常见问题解答 2.1.

这篇关于使用 data.table 汇总多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆