R中的averageif()等效项 [英] averageif() equivalent in R

查看:55
本文介绍了R中的averageif()等效项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我求助于R来处理像函数一样的 averageif() maxif()之类的函数,因为我的数据集太大并且Excel一直崩溃.

I'm resorting to R to handling averageif() and maxif() like functions because my data set is too large and Excel keeps crashing.

我正在寻找一种基于 status 和列 Start Date 之间的平均值 wind 的方法.code>和 df 中的结束日期.我以为这会涉及R中的 ween 函数.

I'm looking for a way to find the average wind based on status, and between columns Start Date and End Date from df. I'd imagine this would involve the between function in R.

status<-c("tropical depression", "tropical depression", "tropical storm")
Begin_Date <- as.Date(c("1974/06/01","1980/06/05","1990/06/07"))
End_Date <- as.Date(c("1975/06/01","1981/07/05","1991/08/07"))
df<-as.data.frame(cbind(status,Begin_Date,End_Date))
df$Begin_Date<-as.Date(Begin_Date)
df$End_Date<-as.Date(End_Date)
df$status<-as.character(status)
storms$date <- as.Date(with(storms, paste(year, month, day,sep="-")), "%Y-%m-%d")

(注意:我的实际问题是帐单数据,因此我需要一个解决方案来根据ID收集开始帐单日期和结束帐单日期之间的平均值).

(Note: my actual problem is billing data, so I would need a solution to gather an average based on IDs, between a start billing date and end billing date).

dplyr 中的 storms 数据集为例.

从下面的我的 df 开始:

               status Begin_Date   End_Date
tropical depression  1974-06-01 1975-06-01
 tropical depression 1980-06-05 1981-07-05
      tropical storm 1990-06-07 1991-08-07 

我想要的是:

               status Begin_Date   End_Date   Avg Wind
tropical depression  1974-06-01 1975-06-01    44.3
 tropical depression 1980-06-05 1981-07-05    66.7
      tropical storm 1990-06-07 1991-08-07    56

我正在尝试使其也符合 dplyr .

I'm trying to get this to be dplyr compliant as well.

我的尝试出错了:

df$Average_wind < - storms %>% filter (status = df$status, date >= df$Begin_Date,date<df$End_Date) %>% summarise(avg = mean(wind))

Error in FUN(left) : invalid argument to unary operator

我想要"示例中的平均风和最大风值不准确,仅出于格式化目的.

The average wind and max wind values in the "I'd like" example are not accurate, simply for formatting purposes.

推荐答案

完全如注释中所述:仅 left_join(风暴,your_data)(在状态上)和过滤排除年份不在您范围内的行.

Exactly as described in comments: Just left_join(storms, your_data) (on status) and filter out the rows where year isn't in your range.

如果您愿意使用其他工具,则 data.table 支持非等额联接,这对于大数据将显着提高效率.

If you're open to other tools, data.table supports non-equi joins, which will be significantly more efficient on large data.

    left_join(storms, df, by = "status") %>%
        filter(Begin_Date <= date & date <= End_Date) %>%
        group_by(Begin_Date, End_Date, status) %>%
        summarize(avg_wind = mean(wind))
    # # A tibble: 2 x 4
    # # Groups: Begin_Date, End_Date [?]
    #   Begin_Date End_Date   status              avg_wind
    #   <date>     <date>     <chr>                  <dbl>
    # 1 1980-06-05 1981-07-05 tropical depression     26.9
    # 2 1990-06-07 1991-08-07 tropical storm          45.4

结果只有2行,因为在1974-06-01至1975-06-01之间的暴风雨数据中显然没有热带低压.实际上,暴风雨中的最小日期是1975-06-27.

There are only 2 rows in the results because apparently there are no tropical depressions in the storms data between 1974-06-01 and 1975-06-01. In fact, the smallest date in storms is 1975-06-27.

您似乎非常热衷于在 ween 之间使用.如果需要,可以在 filter()中使用它,而不是我的代码.它不会改变结果.

You seem very keen on using between. If you want to, you can use it inside filter() instead of my code. It won't change the results.

这篇关于R中的averageif()等效项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆