比较来自不同行的data.frame中的值 [英] Compare values in data.frame from different rows

查看:58
本文介绍了比较来自不同行的data.frame中的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个R data.frame的大学橄榄球数据,每个比赛有两个条目(每个团队一个,带有统计信息和诸如此类).我想比较这些点以创建一个二进制Win/Loss变量,但是我不知道如何(我对R不太了解).有没有办法我可以遍历各列,并尝试将它们与另一列匹配(我有一个游戏ID变量,因此我要匹配该变量),并通过比较点值来创建上述二进制Win/Loss变量?

数据框摘录(遗漏了许多变量):

 团队代码名称游戏代码日期站点点数5阿克伦(Akron)5050320051201 2005年12月1日中立325 阿克伦 404000520051226 12/26/2005 中性 238阿拉巴马州419000820050903 2005年9月3日团队378阿拉巴马州664000820050910 2005年9月10日团队43 

我想要的是附加一个新列,一个二进制变量,根据球队是赢还是输,分配 1 或 0.为了弄清楚这一点,我需要获取游戏代码,例如5050320051201,找到具有相同游戏代码的另一行(对于该游戏中的另一支球队,只有另一行具有相同游戏代码),并比较分数这两个值,然后使用它为Win/Loss变量分配1或0.

解决方案

假定每个唯一的 Game Code 的数据有两个团队,并且没有以下示例给出的平局游戏:

  df<-结构(list(`Team Code` = c(5L,6L,5L,5L,8L,9L,9L,8L),名称= c("Akron",圣约瑟夫","Akron",迈阿密(俄亥俄州)",阿拉巴马州"佛罗里达",田纳西州",阿拉巴马州"),游戏代码" =结构(c(1L,1L,2L,2L,3L,3L,4L,4L),. Label = c("5050320051201","404000520051226","419000820050903","664000820050910"),类=因子"),日期=结构(c(13118,13118、13143、13143、13029、13029、13036、13036),类=日期"),网站= c(中立",中立",中立",中立",团队","AWAY","AWAY","TEAM"),点数= c(32L,25L,23L,42L,37L,45L,42L,43L)),.Names = c("Team Code","Name","Game Code","Date","Site","Points"),row.names = c(NA,-8L),class ="data.frame")打印(df)## 团队代码名称游戏代码日期站点点数## 1 5 Akron 5050320051201 2005-12-01中立32## 2 6圣约瑟夫5050320051201 2005-12-01中立25## 3 5 Akron 404000520051226 2005-12-26中性23## 4 5迈阿密(俄亥俄州)404000520051226 2005-12-26中立42## 5 8阿拉巴马州419000820050903 2005-09-03团队37## 6 9佛罗里达州419000820050903 2005-09-03约45## 7 9田纳西州664000820050910 2005-09-10客满42## 8 8阿拉巴马州664000820050910 2005-09-10团队43 

您可以使用 dplyr 生成所需的内容:

 库(dplyr)结果 <- df %>% group_by(`游戏代码`) %>%mutate(`Win/Loss` = if(first(Points)> last(Points))as.integer(c(1,0))否则as.integer(c(0,1)))打印(结果)##来源:本地数据帧[8 x 7]## Groups:游戏代码[4]####球队代码名称游戏代码日期现场得分赢/输##< int>< chr>< fctr>< date>< chr><int><int>## 1 5 Akron 5050320051201 2005-12-01中立32 1## 2 6圣约瑟夫5050320051201 2005-12-01中立25 0## 3 5 Akron 404000520051226 2005-12-26中性23 0## 4 5迈阿密(俄亥俄州)404000520051226 2005-12-26中立42 1## 5 8阿拉巴马州419000820050903 2005-09-03团队37 0## 6 9佛罗里达州419000820050903 2005-09-03客场45 1## 7 9田纳西州664000820050910 2005-09-10客场42 0## 8 8阿拉巴马州664000820050910 2005-09-10团队43 1 

在这里,我们首先 group_by 游戏代码,然后使用 mutate 创建 Win/Loss 列对于每个组.这里的逻辑很简单,如果 first Points 大于 last (假设只有两个),则设置该列改为 c(1,0).否则,我们将其设置为(0,1).请注意,此逻辑不处理关系,但可以很容易地扩展为处理关系.还要注意,由于特殊字符(例如空格和/),我们在列名前后加上了引号.

I have an R data.frame of college football data, with two entries for each game (one for each team, with stats and whatnot). I would like to compare points from these to create a binary Win/Loss variable, but I have no idea how (I'm not very experienced with R). Is there a way I can iterate through the columns and try to match them up against another column (I have a game ID variable, so I'd match on that) and create aforementioned binary Win/Loss variable by comparing points values?

Excerpt of dataframe (many variables left out):

Team Code  Name      Game Code            Date          Site    Points
5         Akron      5050320051201     12/1/2005        NEUTRAL   32
5         Akron     404000520051226    12/26/2005       NEUTRAL   23
8         Alabama   419000820050903    9/3/2005         TEAM      37
8         Alabama   664000820050910    9/10/2005        TEAM      43

What I want is to append a new column, a binary variable that's assigned 1 or 0 based on if the team won or lost. To figure this out, I need to take the game code, say 5050320051201, find the other row with that same game code (there's only one other row with that same game code, for the other team in that game), and compare the points value for the two, and use that to assign the 1 or 0 for the Win/Loss variable.

解决方案

Assuming that your data has exactly two teams for each unique Game Code and there are no tie games as given by the following example:

df <- structure(list(`Team Code` = c(5L, 6L, 5L, 5L, 8L, 9L, 9L, 8L
), Name = c("Akron", "St. Joseph", "Akron", "Miami(Ohio)", "Alabama", 
"Florida", "Tennessee", "Alabama"), `Game Code` = structure(c(1L, 
1L, 2L, 2L, 3L, 3L, 4L, 4L), .Label = c("5050320051201", "404000520051226", 
"419000820050903", "664000820050910"), class = "factor"), Date = structure(c(13118, 
13118, 13143, 13143, 13029, 13029, 13036, 13036), class = "Date"), 
Site = c("NEUTRAL", "NEUTRAL", "NEUTRAL", "NEUTRAL", "TEAM", 
"AWAY", "AWAY", "TEAM"), Points = c(32L, 25L, 23L, 42L, 37L, 
45L, 42L, 43L)), .Names = c("Team Code", "Name", "Game Code", 
"Date", "Site", "Points"), row.names = c(NA, -8L), class = "data.frame")

print(df)
##  Team Code        Name       Game Code       Date    Site Points
##1         5       Akron   5050320051201 2005-12-01 NEUTRAL     32
##2         6  St. Joseph   5050320051201 2005-12-01 NEUTRAL     25
##3         5       Akron 404000520051226 2005-12-26 NEUTRAL     23
##4         5 Miami(Ohio) 404000520051226 2005-12-26 NEUTRAL     42
##5         8     Alabama 419000820050903 2005-09-03    TEAM     37
##6         9     Florida 419000820050903 2005-09-03    AWAY     45
##7         9   Tennessee 664000820050910 2005-09-10    AWAY     42
##8         8     Alabama 664000820050910 2005-09-10    TEAM     43

You can use dplyr to generate what you want:

library(dplyr)
result <- df %>% group_by(`Game Code`) %>% 
                 mutate(`Win/Loss`=if(first(Points) > last(Points)) as.integer(c(1,0)) else as.integer(c(0,1)))
print(result)
##Source: local data frame [8 x 7]
##Groups: Game Code [4]
##
##  Team Code        Name       Game Code       Date    Site Points Win/Loss
##      <int>       <chr>          <fctr>     <date>   <chr>  <int>    <int>
##1         5       Akron   5050320051201 2005-12-01 NEUTRAL     32        1
##2         6  St. Joseph   5050320051201 2005-12-01 NEUTRAL     25        0
##3         5       Akron 404000520051226 2005-12-26 NEUTRAL     23        0
##4         5 Miami(Ohio) 404000520051226 2005-12-26 NEUTRAL     42        1
##5         8     Alabama 419000820050903 2005-09-03    TEAM     37        0
##6         9     Florida 419000820050903 2005-09-03    AWAY     45        1
##7         9   Tennessee 664000820050910 2005-09-10    AWAY     42        0
##8         8     Alabama 664000820050910 2005-09-10    TEAM     43        1

Here, we first group_by the Game Code and then use mutate to create the Win/Loss column for each group. The logic here is simply that if the first Points is greater than the last (there are only two by assumption), then we set the column to c(1,0). Otherwise, we set it to (0,1). Note that this logic does not handle ties, but can easily be extended to do so. Note also that we surround the column names with back-quotes because of special characters such as space and /.

这篇关于比较来自不同行的data.frame中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆