基于某些条件(这种情况下的3个条件)过滤和添加的有效方法 [英] Efficient method to filter and add based on certain conditions (3 conditions in this case)
问题描述
我有一个像这样的数据框。
I have a data frame which looks like this
a b c d
1 1 1 0
1 1 1 200
1 1 1 300
1 1 2 0
1 1 2 600
1 2 3 0
1 2 3 100
1 2 3 200
1 3 1 0
我有一个数据框,看起来像这样
I have a data frame which looks like this
a b c d
1 1 1 250
1 1 2 600
1 2 3 150
1 3 1 0
我目前正在做它
{
I am currently doing it {
n=nrow(subset(Wallmart, a==i & b==j & c==k ))
sum=subset(Wallmart, a==i & b==j & c==k )
#sum
sum1=append(sum1,sum(sum$d)/(n-1))
}
'd'coloumn并通过计算行数而不计数为0来计算平均值。例如,第一行是(200 + 300)/ 2 = 250.
目前我正在建立一个列表, coloumn但理想情况下,我想要它在上面的格式。例如,第一行类似
I would like to add the 'd' coloumn and take the average by counting the number of rows without counting 0. For example the first row is (200+300)/2 = 250. Currently I am building a list that stores the 'd' coloumn but ideally I want it in the format above. For example first row would look like
a b c d
1 1 1 250
这是一个非常低效的方式来完成这项工作。代码在循环中运行需要很长时间。
所以任何帮助是赞赏,使其运行更快。
This is a very inefficient way to do this work. The code takes a long time to run in a loop. so any help is appreciated that makes it run faster. The original data frame has about a million rows.
推荐答案
您可以尝试 aggregate
:
aggregate(d ~ a + b + c, data = df, sum)
# a b c d
# 1 1 1 1 500
# 2 1 3 1 0
# 3 1 1 2 600
# 4 1 2 3 300
如@Roland所示,对于更大的数据集,您可以尝试 data.table
code> dplyr ,例如:
As noted by @Roland, for bigger data sets, you may try data.table
or dplyr
instead, e.g.:
library(dplyr)
df %>%
group_by(a, b, c) %>%
summarise(
sum_d = sum(d))
# Source: local data frame [4 x 4]
# Groups: a, b
#
# a b c sum_d
# 1 1 1 1 500
# 2 1 1 2 600
# 3 1 2 3 300
# 4 1 3 1 0
修改以下更新的问题。
如果要计算按组平均值,排除为零的行,您可以尝试:
Edit following updated question. If you want to calculate group-wise mean, excluding rows that are zero, you may try this:
aggregate(d ~ a + b + c, data = df, function(x) mean(x[x > 0]))
# a b c d
# 1 1 1 1 250
# 2 1 3 1 NaN
# 3 1 1 2 600
# 4 1 2 3 150
df %>%
filter(d != 0) %>%
group_by(a, b, c) %>%
summarise(
mean_d = mean(d))
# a b c mean_d
# 1 1 1 1 250
# 2 1 1 2 600
# 3 1 2 3 150
但是,因为似乎你希望将零作为缺失值而不是数字零,我认为最好将它们转换为 NA
However, because it seems that you wish to treat your zeros as missing values rather than numeric zeros, I think it would be better to convert them to NA
when preparing your data set, before the calculations.
df$d[df$d == 0] <- NA
df %>%
group_by(a, b, c) %>%
summarise(
mean_d = mean(d, na.rm = TRUE))
# a b c mean_d
# 1 1 1 1 250
# 2 1 1 2 600
# 3 1 2 3 150
# 4 1 3 1 NaN
这篇关于基于某些条件(这种情况下的3个条件)过滤和添加的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!