根据多个其他因子列求和的列 [英] Sum columns based on multiple other factor columns

查看:82
本文介绍了根据多个其他因子列求和的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据框:

df<-structure(list(totprivland = c(175L, 50L, 100L, 14L, 4L, 240L, 
10L, 20L, 20L, 58L), ncushr8d1 = c(0L, 0L, 0L, 0L, 0L, 30L, 5L, 
0L, 0L, 50L), ncu_CENREG1 = structure(c(4L, 4L, 4L, 4L, 1L, 3L, 
3L, 3L, 4L, 4L), .Label = c("Northeast", "Midwest", "South", 
"West"), class = "factor"), ncushr8d2 = c(75L, 50L, 100L, 14L, 
2L, 30L, 5L, 20L, 20L, 8L), ncu_CENREG2 = structure(c(4L, 4L, 
4L, 4L, 1L, 2L, 1L, 4L, 3L, 4L), .Label = c("Northeast", "Midwest", 
"South", "West"), class = "factor"), ncushr8d3 = c(100L, NA, 
NA, NA, 2L, 180L, 0L, NA, NA, NA), ncu_CENREG3 = structure(c(4L, 
NA, NA, NA, 1L, 1L, 3L, NA, NA, NA), .Label = c("Northeast", 
"Midwest", "South", "West"), class = "factor"), ncushr8d4 = c(NA, 
NA, NA, NA, 0L, NA, NA, NA, NA, NA), ncu_CENREG4 = structure(c(NA, 
NA, NA, NA, 1L, NA, NA, NA, NA, NA), .Label = c("Northeast", 
"Midwest", "South", "West"), class = "factor")), .Names = c("totprivland", 
"ncushr8d1", "ncu_CENREG1", "ncushr8d2", "ncu_CENREG2", "ncushr8d3", 
"ncu_CENREG3", "ncushr8d4", "ncu_CENREG4"), row.names = c(27404L, 
27525L, 27576L, 27822L, 28099L, 28238L, 28306L, 28312L, 28348L, 
28379L), class = "data.frame")

=======

以下基本思想是 dput

Total    VariableA  LocationA    VariableB     LocationB
30            20       East          10         East
20            20       South         NA         West
115           15       East         100         South
100           50       West          50         West 
35            10       East          25         South  

总数(或dput示例中的totprivland)是变量(ncushr8d1,ncushr8d2,ncushr8d3)的总和,以及ncushr8d4),并且每个变量都有一个对应的因子l位置变量(ncu_CENREG1等)。在同一模式中,还有6个其他变量和位置。对于多个数字变量,位置变量通常是相同的值(例如,多个东位置值,例如示例的第一行)。

The total (or totprivland in dput example) is a sum of the variables (ncushr8d1, ncushr8d2, ncushr8d3, and ncushr8d4) and each variable has a corresponding factor location variable (ncu_CENREG1, etc.). There are 6 additional variables and locations in this same pattern. The location variable is oftentimes the same value for multiple numeric variables (e.g. multiple 'East' location values like the first row of the example).

我想获取由公共位置系数得出的每一行的值之和,从而为每个位置的总和创建一个新列。看起来像这样,可以忽略NA值:

I'd like to get a sum of the values for each row by the common location factor, creating a new column for each location's sum. It would look something like this, with the ability to ignore NA values:

Total    VariableA  LocationA    VariableB     LocationB   TotalWest  TotalEast TotalSouth
30            20       East          10         East          0          30          0
20            20       South         NA         NA            0           0         20
115           15       East         100         South         0          15        100
100           50       West          50         West        100           0          0 
35            10       East          25         South         0          10         25

我已经研究了聚合和分裂,但似乎无法弄清楚如何使它们跨如此多的列工作。我也在考虑一个冗长的 if语句,该语句将遍历所有8个变量及其对应的位置,但我认为对此必须有更好的解决方案。观测值经过加权后可以在调查工具包中使用,我希望避免重复观测值,并使它们与整形工具包长在一起,尽管也许以后我可以重新组合它们。任何建议表示赞赏!

I've looked into aggregate and splitting but can't seem to figure out how to get them to work across so many columns. I'm also considering a lengthy "if" statement that would rotate through all 8 variables and their corresponding locations, but feel that there must be a better solution for this. The observations are weighted for use in the survey package and I'd like to avoid duplicating the observations and making them "long" with the reshape package, although maybe I can re-combine them later. Any suggestions appreciated!

非常感谢
卢克

Much thanks, Luke

推荐答案

首先,我将数据框转换为一个长格式,其中有3列:值,位置,大小写。案例应指明数据来自哪个案例(例如行)。顺序无关紧要。因此您的数据框将类似于:

First, I'd convert the data frame to a long form in which you have 3 columns: value, location, case. case should indicate from which case (e.g. row) the data came from. order doesn't matter. so your data frame will look something like:

Value    Loc    Case
20       East   1
20       South  2
...
10       East   1

依此类推...
这样做的一种方法是堆叠您的值和位置,然后手动(轻松地)添加案例编号。假设原始数据帧称为df,并且在第2,4列中有值,在第3,5列中有位置

and so forth... one way to do that is to stack your values and locations, and then manually (and easily) add case numbers. suppose your original dataframe is called df, and has values in columns 2,4 and locations in columns 3,5

v.col = stack(df[,c(2,4)])[,1]
v.loc = stack(df[,c(3,5)])[,1]
v.case = rep(1:nrow(df),2)
long.data = data.frame(v.col,v.loc,v.case)    # this is not actually needed, but just so you can view it

现在使用tapply创建所需的列

now use tapply to create the columns you need

s = tapply(v.col,list(v.case,v.loc),sum,na.rm=T)
new.df = cbind(df,s)

您可能需要将NA调整为0或类似的值,但这应该很容易

You'll probably need to adjust NA to 0 or something, but this should be easy.

使用plyr / reshape软件包也可能有更简单的方法,但是我不是这些专家。

There are also probably easier ways to do this using the plyr/reshape packages, but I am not expert on these.

希望这会有所帮助

这篇关于根据多个其他因子列求和的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆