有条件的累积均值 [英] Cumulative mean with conditionals

查看:97
本文介绍了有条件的累积均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

R.df的小代表新手:

New to R. Small rep of my df:

PTS_TeamHome <- c(101,87,94,110,95)
PTS_TeamAway <- c(95,89,105,111,121)
TeamHome <- c("LAL", "HOU", "SAS", "MIA", "LAL")
TeamAway <- c("IND", "LAL", "LAL", "HOU", "NOP")
df <- data.frame(cbind(TeamHome, TeamAway,PTS_TeamHome,PTS_TeamAway))
df

TeamHome TeamAway PTS_TeamHome PTS_TeamAway
  LAL      IND          101           95
  HOU      LAL           87           89
  SAS      LAL           94          105
  MIA      HOU          110          111
  LAL      NOP           95          121

想象一下,这是一个赛季的前四场比赛,共有1230场比赛.我想计算主队和客队在任何给定时间的每场比赛平均得分(平均值).

Imagine these are the first four games of a season with 1230 games. I want to calculate the cumulative points per game (mean) at any given time for the home team and the visiting team.

输出看起来像这样:

  TeamHome TeamAway PTS_TeamHome PTS_TeamAway HOMETEAM_AVGCUMPTS ROADTEAM_AVGCUMPTS
1  LAL      IND          101           95                101                 95
2  HOU      LAL           87           89                 87                 95
3  SAS      LAL           94          105                 94              98.33
4  MIA      HOU          110          111                110                 99
5  LAL      NOP           95          121               97.5                121

请注意,公式对于主队第五场比赛的作用.由于LAL是主队,因此它会寻找LAL在家里或在公路上比赛时得分多少.在这种情况下(101 + 89 + 105 + 95)/4 = 97.5

Note that what the formula does for the fifth game for the home team. Since the LAL is the home team it looks for how many points has LAL scored when playing at home or on the road. In this case (101 + 89 + 105 + 95) / 4 = 97.5

这是我尝试过但没有成功的事情:

Here is what I tried without much success:

lst <- list()
for(i in 1:nrow(df)) lst[[i]] <- ( cumsum(df[which(df$TEAM1[1:i]==df$TEAM1[i]),df$PTS_TeamAway,0]) 
                                 + cumsum(df[which(df$TEAM2[1:i]==df$TEAM1[i]),df$PTS_TeamHome,0]) ) 
                             / #divided by number of games
  df$HOMETEAM_AVGCUMPTS <- unlist(lst)

我想计算累积的PTS,然后计算除以的游戏数量,但没有一个起作用.

I wanted to calculate the cumulative PTS and then the number of games to divide it by but none of this worked.

推荐答案

lst <- list()
for(i in 1:nrow(df)) lst[[i]] <- mean(c(df$PTS_TeamHome[1:i][df$TeamHome[1:i] == df$TeamHome[i]],
                                        df$PTS_TeamAway[1:i][df$TeamAway[1:i] == df$TeamHome[i]]))
df$HOMETEAM_AVGCUMPTS <- unlist(lst)


lst2 <- list()
for(i in 1:nrow(df)) lst2[[i]] <- mean(c(df$PTS_TeamAway[1:i][df$TeamAway[1:i] == df$TeamAway[i]],
                                        df$PTS_TeamHome[1:i][df$TeamHome[1:i] == df$TeamAway[i]]))
df$ROADTEAM_AVGCUMPTS <- unlist(lst2)


df
#   TeamHome TeamAway PTS_TeamHome PTS_TeamAway HOMETEAM_AVGCUMPTS ROADTEAM_AVGCUMPTS
# 1      LAL      IND          101           95                101                 95
# 2      HOU      LAL           87           89                 87                 95
# 3      SAS      LAL           94          105                 94           98.33333
# 4      MIA      HOU          110          111                110                 99
# 5      LAL      NOP           95          121               97.5                121


该方法分为两个循环.我们采用两个向量的均值.它们与mean(c(vec1,vec2))格式结合在一起.


The approach is divided into two loops. We are taking the mean of two vectors. They are combined with a mean(c(vec1,vec2)) format.

第一个向量是主队在主场时的得分集(队在col1中,得分在col3中),第二个向量是主队在他们离开时的得分集(队在col2中) ,col4中的pts).我们使用for循环,因为它使我们可以轻松控制子集中要考虑的行数.对于df$PTS_TeamHome[1:i],该设置仅限于过去玩过的游戏和当前玩过的游戏.我们用[df$TeamHome[1:i] == df$TeamHome[i]]对该向量进行子集化.用通俗易懂的语言表达的是直到当前游戏的"TeamHome"类别中的团队,它等于当前正在播放的Home团队."使用这些参数,我们将不允许未来"游戏破坏分析.

The first vector is the set of points scored while the home team was at home (team in col1, pts in col3), the second vector is the set of points scored by the home team while they were away (team in col2, pts in col4). We use the for loop as it allows us to easily control how many rows are being considered in the subset. With df$PTS_TeamHome[1:i], the set is limited to the games that were played in the past and the current game. We subset that vector with [df$TeamHome[1:i] == df$TeamHome[i]]. In plain language that expression is "Teams in the "TeamHome category up to the current game that are equal to the Home team currently playing". With those parameters we will not allow "future" games to corrupt the analysis.

对于数据,我将stringsAsFactors参数设置为FALSE.并将点列转换为类numeric.见下文.

For the data, I set the stringsAsFactors argument to FALSE. And converted the points columns to class numeric. See below.

数据

PTS_TeamHome <- c(101,87,94,110,95)
PTS_TeamAway <- c(95,89,105,111,121)
TeamHome <- c("LAL", "HOU", "SAS", "MIA", "LAL")
TeamAway <- c("IND", "LAL", "LAL", "HOU", "NOP")
df <- data.frame(cbind(TeamHome, TeamAway,PTS_TeamHome,PTS_TeamAway), stringsAsFactors=F)
df[3:4] <- lapply(df[3:4], function(x) as.numeric(x))

这篇关于有条件的累积均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆