ddp在R中按组求和 [英] ddply for sum by group in R

查看:101
本文介绍了ddp在R中按组求和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个示例数据框"data",如下所示:

I have a sample dataframe "data" as follows:

X            Y  Month   Year    income
2281205 228120  3   2011    1000
2281212 228121  9   2010    1100
2281213 228121  12  2010    900
2281214 228121  3   2011    9000
2281222 228122  6   2010    1111
2281223 228122  9   2010    3000
2281224 228122  12  2010    1889
2281225 228122  3   2011    778
2281243 228124  12  2010    1111
2281244 228124  3   2011    200
2281282 228128  9   2010    7889
2281283 228128  12  2010    2900
2281284 228128  3   2011    3400
2281302 228130  9   2010    1200
2281303 228130  12  2010    2000
2281304 228130  3   2011    1900
2281352 228135  9   2010    2300
2281353 228135  12  2010    1333
2281354 228135  3   2011    2340

如果我对每个Y都有四个观测值(例如,对于2010年第6,9,12个月的2281223,并且我对每个Y都有四个观测值,则我想使用ddply计算每个Y(而不是X)的收入) 2011年第3个月).如果我少于四个观测值(例如,对于Y = 228130),我想简单地忽略它.我出于以下目的在R中使用以下命令:

I want to use the ddply to compute the income for each Y(not X), if I have four observations for each Y (for example for 2281223 with months 6,9,12 of 2010 and month 3 of 2011). If I have less than four observations (for example for Y =228130), I want to simply ignore it. I use the following commands in R for the above purpose:

require(plyr)
     # the data are in the data csv file
    data<-read.csv("data.csv")
    # convert Y (integers) into factors
    y<-as.factor(y)
    # get the count of each unique Y
    count<-ddply(data,.(Y), summarize, freq=length(Y))
    # get the sum of each unique Y 
    sum<-ddply(data,.(Y),summarize,tot=sum(income))
    # show the sum if number of observations for each Y is less than 4
    colbind<-cbind(count,sum)
    finalsum<-subset(colbind,freq>3)

我的输出如下:

>colbind
       Y freq      Y   tot
1 228120    1 228120  1000
2 228121    3 228121 11000
3 228122    4 228122  6778
4 228124    2 228124  1311
5 228128    3 228128 14189
6 228130    3 228130  5100
7 228135    3 228135  5973
>finalsum
       Y freq    Y.1  tot
3 228122    4 228122 6778

上面的代码有效,但是需要很多步骤.因此,我想知道是否有一种简单的方法可以执行上述任务(使用plyr软件包).

The above code works, but requires many steps. So,I would like to know whether there is a simple way of performing the above task (using the plyr package).

推荐答案

正如注释中指出的,您可以在summarize内部执行多项操作.

As pointed out in a comment, you can do multiple operations inside the summarize.

这会将您的代码减少为一行ddply()和一行子集,这对于[运算符来说很容易:

This reduces your code to one line of ddply() and one line of subsetting, which is easy enough with the [ operator:

x <- ddply(data, .(Y), summarize, freq=length(Y), tot=sum(income))
x[x$freq > 3, ]

       Y freq  tot
3 228122    4 6778


使用data.table软件包也非常容易:


This is also exceptionally easy with the data.table package:

library(data.table)
data.table(data)[, list(freq=length(income), tot=sum(income)), by=Y][freq > 3]
        Y freq  tot
1: 228122    4 6778

实际上,计算向量长度的操作在data.table中有其自己的快捷方式-使用.N快捷方式:

In fact, the operation to calculate the length of a vector has its own shortcut in data.table - use the .N shortcut:

data.table(data)[, list(freq=.N, tot=sum(income)), by=Y][freq > 3]
        Y freq  tot
1: 228122    4 6778

这篇关于ddp在R中按组求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆