R中的averageif()等效项 [英] averageif() equivalent in R
问题描述
我求助于R来处理像函数一样的 averageif()
和 maxif()
之类的函数,因为我的数据集太大并且Excel一直崩溃.
I'm resorting to R to handling averageif()
and maxif()
like functions because my data set is too large and Excel keeps crashing.
我正在寻找一种基于 status
和列 Start Date 之间的
平均值
wind
的方法.code>和 df
中的结束日期
.我以为这会涉及R中的 ween
函数.
I'm looking for a way to find the average
wind
based on status
, and between columns Start Date
and End Date
from df
. I'd imagine this would involve the between
function in R.
status<-c("tropical depression", "tropical depression", "tropical storm")
Begin_Date <- as.Date(c("1974/06/01","1980/06/05","1990/06/07"))
End_Date <- as.Date(c("1975/06/01","1981/07/05","1991/08/07"))
df<-as.data.frame(cbind(status,Begin_Date,End_Date))
df$Begin_Date<-as.Date(Begin_Date)
df$End_Date<-as.Date(End_Date)
df$status<-as.character(status)
storms$date <- as.Date(with(storms, paste(year, month, day,sep="-")), "%Y-%m-%d")
(注意:我的实际问题是帐单数据,因此我需要一个解决方案来根据ID收集开始帐单日期和结束帐单日期之间的平均值).
(Note: my actual problem is billing data, so I would need a solution to gather an average based on IDs, between a start billing date and end billing date).
以 dplyr
中的 storms
数据集为例.
从下面的我的 df
开始:
status Begin_Date End_Date
tropical depression 1974-06-01 1975-06-01
tropical depression 1980-06-05 1981-07-05
tropical storm 1990-06-07 1991-08-07
我想要的是:
status Begin_Date End_Date Avg Wind
tropical depression 1974-06-01 1975-06-01 44.3
tropical depression 1980-06-05 1981-07-05 66.7
tropical storm 1990-06-07 1991-08-07 56
我正在尝试使其也符合 dplyr
.
I'm trying to get this to be dplyr
compliant as well.
我的尝试出错了:
df$Average_wind < - storms %>% filter (status = df$status, date >= df$Begin_Date,date<df$End_Date) %>% summarise(avg = mean(wind))
Error in FUN(left) : invalid argument to unary operator
我想要"示例中的平均风和最大风值不准确,仅出于格式化目的.
The average wind and max wind values in the "I'd like" example are not accurate, simply for formatting purposes.
推荐答案
完全如注释中所述:仅 left_join(风暴,your_data)
(在状态
上)和过滤
排除年份不在您范围内的行.
Exactly as described in comments: Just left_join(storms, your_data)
(on status
) and filter
out the rows where year isn't in your range.
如果您愿意使用其他工具,则 data.table
支持非等额联接,这对于大数据将显着提高效率.
If you're open to other tools, data.table
supports non-equi joins, which will be significantly more efficient on large data.
left_join(storms, df, by = "status") %>%
filter(Begin_Date <= date & date <= End_Date) %>%
group_by(Begin_Date, End_Date, status) %>%
summarize(avg_wind = mean(wind))
# # A tibble: 2 x 4
# # Groups: Begin_Date, End_Date [?]
# Begin_Date End_Date status avg_wind
# <date> <date> <chr> <dbl>
# 1 1980-06-05 1981-07-05 tropical depression 26.9
# 2 1990-06-07 1991-08-07 tropical storm 45.4
结果只有2行,因为在1974-06-01至1975-06-01之间的暴风雨
数据中显然没有热带低压.实际上,暴风雨
中的最小日期是1975-06-27.
There are only 2 rows in the results because apparently there are no tropical depressions in the storms
data between 1974-06-01 and 1975-06-01. In fact, the smallest date in storms
is 1975-06-27.
您似乎非常热衷于在 ween
之间使用.如果需要,可以在 filter()
中使用它,而不是我的代码.它不会改变结果.
You seem very keen on using between
. If you want to, you can use it inside filter()
instead of my code. It won't change the results.
这篇关于R中的averageif()等效项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!