R用data.table总结多个列 [英] R summarizing multiple columns with data.table

查看:128
本文介绍了R用data.table总结多个列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用data.table来加速处理由几个较小的合并data.frames组成的大型data.frame(300k x 60)。我是新的data.table。目前的代码如下

I'm trying to use data.table to speed up processing of a large data.frame (300k x 60) made of several smaller merged data.frames. I'm new to data.table. The code so far is as follows

library(data.table)
a = data.table(index=1:5,a=rnorm(5,10),b=rnorm(5,10),z=rnorm(5,10))
b = data.table(index=6:10,a=rnorm(5,10),b=rnorm(5,10),c=rnorm(5,10),d=rnorm(5,10))
dt = merge(a,b,by=intersect(names(a),names(b)),all=T)
dt$category = sample(letters[1:3],10,replace=T)

我想知道是否有更有效的方法来总结数据。

and I wondered if there was a more efficient way than the following to summarize the data.

summ = dt[i=T,j=list(a=sum(a,na.rm=T),b=sum(b,na.rm=T),c=sum(c,na.rm=T),
                     d=sum(d,na.rm=T),z=sum(z,na.rm=T)),by=category]

我真的不想手动键入所有50列计算和 eval(粘贴(...)

I don't really want to type all 50 column calculations by hand and a eval(paste(...)) seems clunky somehow.

我看了下面的例子,但似乎有点复杂的我的需要。感谢

I had a look at the example below but it seems a bit complicated for my needs. thanks

如何跨多个列汇总data.table

推荐答案

您可以使用简单的 lapply .SD

You can use a simple lapply statement with .SD

dt[, lapply(.SD, sum, na.rm=TRUE), by=category ]

   category index        a        b        z         c        d
1:        c    19 51.13289 48.49994 42.50884  9.535588 11.53253
2:        b     9 17.34860 20.35022 10.32514 11.764105 10.53127
3:        a    27 25.91616 31.12624  0.00000 29.197343 31.71285






如果只想对某些列进行汇总,可以添加 .SDcols 参数

#  note that .SDcols also allows reordering of the columns
dt[, lapply(.SD, sum, na.rm=TRUE), by=category, .SDcols=c("a", "c", "z") ] 

   category        a         c        z
1:        c 51.13289  9.535588 42.50884
2:        b 17.34860 11.764105 10.32514
3:        a 25.91616 29.197343  0.00000






不限于 sum ,您可以使用 lapply 的任何函数,包括匿名函数。 (即,它是一个常规 lapply 语句)。


This of course, is not limited to sum and you can use any function with lapply, including anonymous functions. (ie, it's a regular lapply statement).

最后,没有必要使用 i = T j = ..> 。就个人而言,我认为这使得代码的可读性更差,但它只是样式偏好。

Lastly, there is no need to use i=T and j= <..>. Personally, I think that makes the code less readable, but it is just a style preference.

?。<。$ .SD ?的

帮助部分下的其他几个特殊变量[。data.table]
(在Arguments部分中,查看的信息 c>)。

You will find the documentation to .SDand several other special variables under the
help section of ?"[.data.table" (in the Arguments section, look under the info for by).

data.table FAQ 2.1

Also have a look at data.table FAQ 2.1

http: //datatable.r-forge.r-project.org/datatable-faq.pdf

这篇关于R用data.table总结多个列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆