在R中创建运行中的计数变量? [英] Creating a running counting variable in R?

查看:39
本文介绍了在R中创建运行中的计数变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个足球比赛结果的数据集,我希望通过创建一组与World Football Elo公式相似的连续评分来学习R.我遇到了麻烦,因为在Excel中看似简单的事情在R中并不完全直观.例如,在4270个观测值中,前15个带有必要的变量:

I have a dataset of soccer match results, and I am hoping to learn R by creating a running set of ratings similar to the World Football Elo formula. I am running into trouble with things that seem to be simple in Excel aren't exactly intuitive in R. For instance, the first 15 of 4270 observations with the necessary variables:

       date t.1  t.2 m.result
1  19960406  DC   SJ      0.0
2  19960413 COL   KC      0.0
3  19960413  NE   TB      0.0
4  19960413 CLB   DC      1.0
5  19960413 LAG NYRB      1.0
6  19960414 FCD   SJ      0.5
7  19960418 FCD   KC      1.0
8  19960420  NE NYRB      1.0
9  19960420  DC  LAG      0.0
10 19960420 CLB   TB      0.0
11 19960421 COL  FCD      1.0
12 19960421  SJ   KC      0.5
13 19960427 CLB NYRB      1.0
14 19960427  DC   NE      0.5
15 19960428 FCD   TB      1.0

我希望能够创建一个新变量,该变量的连续运行计数为t.1和t.2的总匹配数(即,直到问题日期为止,实例"DC"出现在t列中的实例).1或t.2):

I want to be able to create a new variable that will be a running count of t.1 and t.2's total matches played (i.e., the instances up to the date in question that "DC" occurs in columns t.1 or t.2):

           date t.1  t.2 m.result  ##t.1m    ##t.2m
    1  19960406  DC   SJ      0.0       1         1
    2  19960413 COL   KC      0.0       1         1
    3  19960413  NE   TB      0.0       1         1
    4  19960413 CLB   DC      1.0       1         2
    5  19960413 LAG NYRB      1.0       1         1
    6  19960414 FCD   SJ      0.5       1         2
    7  19960418 FCD   KC      1.0       2         2
    8  19960420  NE NYRB      1.0       2         2
    9  19960420  DC  LAG      0.0       3         2
    10 19960420 CLB   TB      0.0       2         2
    11 19960421 COL  FCD      1.0       2         3
    12 19960421  SJ   KC      0.5       3         3
    13 19960427 CLB NYRB      1.0       3         3
    14 19960427  DC   NE      0.5       4         3
    15 19960428 FCD   TB      1.0       4         3

在Excel中,这是一个(相对)简单的= SUMPRODUCT方程,例如:

in Excel, this is a (relatively) simple =SUMPRODUCT equation, e.g:

E4=SUMPRODUCT((A:A<=A4)*(B:B=B4))+SUMPRODUCT((A:A<=A4)*(C:C=B4))

对于obs#4,E4是t.m,A:A是Date,B:B是t.1,C:C是t.2,依此类推.

where E4 is t.1m for obs # 4, A:A is Date, B:B is t.1, C:C is t.2, etc.

但是在R中,我可以获得为我打印的总和积(即"DC"在我的数据集中玩了576场游戏),但是由于某种原因(可能是我是新手,不耐烦,反复无常地困扰着我)M只是失去了对如何使观测数据的运行计数,尤其是如何使运行中的计数到一个变量,这是任何游戏的评价指标是至关重要的.我知道'PlayerRatings'存在,我觉得对于我的R教育,我应该能够在没有该软件包的R套件中做到这一点.当然,plyr或dplyr还可以.

But in R, I can get total sumproduct printed for me (i.e. "DC" has played 576 games across my dataset), but for some reason (probably that I'm new, impatient, rattled by trial and error) I'm just lost on how to make a running count on observation data, and especially how to make that running count into a variable, which is vital for any game rating index. I know 'PlayerRatings' exists, I feel that for my R education I should be able do this in the R suite without that package. plyr or dplyr is okay, of course.

作为参考,这是我的数据,可供您复制/粘贴到您的R中.

For reference, here is my data for you to copy/paste into your R.

date<-c(19960406,19960413,19960413,19960413,19960413,19960414,19960418,19960420,19960420,19960420,19960421,19960421,19960427,19960427,19960428)
t.1<-c("DC","COL","NE","CLB","LAG","FCD","FCD","NE","DC","CLB","COL","SJ","CLB","DC","FCD")
t.2<-c("SJ","KC","TB","DC","NYRB","SJ","KC","NYRB","LAG","TB","FCD","KC","NYRB","NE","TB")
m.result<-c(0.0,0.0,0.0,1.0,1.0,0.5,1.0,1.0,0.0,0.0,1.0,0.5,1.0,0.5,1.0)
mtable<-data.frame(date,t.1,t.2,m.result)
mtable

推荐答案

这是一个非常简单的解决方案,虽然看起来很漂亮,但是可以完成工作.

Here's a very straightforward solution that isn't pretty but does the job.

首先,只需对数据进行更改以使比较变得容易:

First, just a change to your data to make comparisons easier:

mtable<-data.frame(date,t.1,t.2,m.result, stringsAsFactors = FALSE)


编辑于:

如果要确保按日期对比赛进行排序,可以使用@ eipi10指出的 order :

If you want to assure the matches are ordered by date, you can use order as pointed out by @eipi10:

mtable = mtable[order(mtable$date), ]

请注意,如果日期格式按时间顺序不是整数顺序,则可以先使用 as.Date()将其转换为日期格式.

Just note that in case the dates are in a format that the chronological order isn't the integer order, you can first convert them to Date format using as.Date().

对于每行,我们要做的是获取数据框的子集,其中包含 t.1 t.2 列,所有行从1到上述行.因此是1:1、1:2、1:3等.在每次运行中,我们都会计算该团队出现的次数,并将其用作新列的结果.

What we are going to do is, for each row, take a subset of the dataframe with the columns t.1 and t.2, with all the rows from 1 to the said row. So 1:1, 1:2, 1:3, etc. At each run, we count the number of times that team has appeared, and use that as the result for the new column.

mtable$t.1m <- sapply(1:nrow(mtable),
             function(i) sum(mtable[1:i, c("t.1", "t.2")] == mtable$t.1[i]))

这是为 t.1 中的团队完成的,对 == 之后的参数进行了很小的更改,我们可以针对 t.2 :

This was done for teams in t.1, with a small change on argument after ==we can make it for t.2:

mtable$t.2m <- sapply(1:nrow(mtable),
             function(i) sum(mtable[1:i, c("t.1", "t.2")] == mtable$t.2[i]))

现在我们的数据框如下所示:

Now our dataframe looks like this:

> mtable
       date t.1  t.2 m.result t.1m t.2m
1  19960406  DC   SJ      0.0    1    1
2  19960413 COL   KC      0.0    1    1
3  19960413  NE   TB      0.0    1    1
4  19960413 CLB   DC      1.0    1    2
5  19960413 LAG NYRB      1.0    1    1
6  19960414 FCD   SJ      0.5    1    2
7  19960418 FCD   KC      1.0    2    2
8  19960420  NE NYRB      1.0    2    2
9  19960420  DC  LAG      0.0    3    2
10 19960420 CLB   TB      0.0    2    2
11 19960421 COL  FCD      1.0    2    3
12 19960421  SJ   KC      0.5    3    3
13 19960427 CLB NYRB      1.0    3    3
14 19960427  DC   NE      0.5    4    3
15 19960428 FCD   TB      1.0    4    3

这篇关于在R中创建运行中的计数变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆