用相邻值替换 R 向量中的 NA [英] replace NA in an R vector with adjacent values

查看:46
本文介绍了用相邻值替换 R 向量中的 NA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,它合并了足球赛季的球员和球队数据所以对于特定赛季的特定球员,我有这样的数据

I have a dataframe which has merged player and team data for soccer seasons So for a particular player in a specific season I have data like

df <- data.frame(team=c(NA,"CRP",NA,"CRP","CRP",NA),
             player=c(NA,"Ed",NA,"Ed","Ed",NA),
             playerGame= c(NA,1,NA,2,3,NA),
             teamGame =c(1,2,3,4,5,6)) 

NA 表示该球员没有出现在特定的团队比赛中

Where the NA's indicate that the player did not appear in that specific team game

我如何最有效地分别用CRP"和Ed"替换球队和球员的 NA并且在本例中的 plGame 输出为 0,1,1,2,3,3

How would I most efficiently replace the team and player NA's with "CRP" and "Ed" respectively and have a plGame output of, in this instance, 0,1,1,2,3,3

编辑

抱歉,这是我半夜醒来时写的,可能过于简化了我的问题.似乎只有一个人意识到这是一组更大数据的子集,即使他/她也没有遵循它,尽管直接硬编码替换球员和球队是不够的感谢您的回复.Dsee 对动物园包中的 na.locf 的提示和 AK 回答的第一行似乎提供了最好的前进方式

Sorry, I wrote this when I woke up in the middle of the night and may have over-simplified my problem too much. Only one person seems to have picked up on the fact that this is a subset of a much larger set of data and even he/she did not follow it though that a straight hardcode replacement of player and team was insufficient Thanks for the replies. Dsee's hint for the na.locf in the zoo package and the first line of AK's answer appears to offer the best way forward

df$playerGame[df$teamGame == min(df$teamGame) & is.na(df$playerGame) == TRUE] <- 0
na.locf(df$playerGame)

这涵盖了多个 NA 以启动序列的可能性.在我的情况下, min(df$teamGame) 将始终为 1,因此硬编码可能会加快速度

This covers the eventuality of more than one NA to start the sequence. In my case the min(df$teamGame) will always be 1 so hardcoding that may speed things up

这里有一个更现实的例子

A more realistic example is here

library(zoo)
library(plyr)

newdf <- data.frame(team=c("CRP","CRP","CRP","CRP","CRP","CRP","TOT","TOT","TOT"),
             player=c(NA,"Ed",NA,"Bill","Bill",NA,NA,NA,"Tom"),
             playerGame= c(NA,1,NA,1,2,NA,NA,NA,1),
             teamGame =c(1,2,3,1,2,3,1,2,3))

我现在可以显示每一行的团队每支球队在一个赛季中打三场比赛.Ed 和 Bill 为 CRP 效力,分别出现在第 2 场和第 1,2 场比赛中.汤姆只在第 3 场比赛中为 TOT 效力.假设玩家名字是唯一的(即使在现实世界的数据中)

I can now show the team for every row Each team plays three games in a season. Ed and Bill, play for CRP and appear in games 2 and 1,2 respectively. Tom plays for TOT in game 3 only. Assume that player names are unique(even in real world data)

在我看来,我需要创建另一列playerTeam"

It seems to me that I need to create another column, 'playerTeam'

newdf$playerTeam <- 0

for (i in 1:nrow(newdf)) {
newdf$playerTeam[i] <-ceiling(i/3)
}

然后我可以使用这个值来填补玩家的空白.我已经使用了排序功能省略 NA

I can then use this value to fill in the player gaps. I have used the sort functiom which omits NA

newdf <- ddply(newdf,.(playerTeam),transform,player=sort(player)[1])

然后我可以使用上述代码

I can then use the aforementioned code

newdf$playerGame[newdf$teamGame == 1 & is.na(newdf$playerGame) == TRUE] <- 0
newdf$playerGame <- na.locf(newdf$playerGame)

   team player playerGame teamGame playerTeam
1  CRP     Ed          0        1          1
2  CRP     Ed          1        2          1
3  CRP     Ed          1        3          1
4  CRP   Bill          1        1          2
5  CRP   Bill          2        2          2
6  CRP   Bill          2        3          2
7  TOT    Tom          0        1          3
8  TOT    Tom          0        2          3
9  TOT    Tom          1        3          3

我也需要按季节建造,但这应该不是问题

I will need to build in season as well but that should not be a problem

我在这里遗漏了什么吗?

Am I missing anything here?

我有几十万行要处理,因此任何加速都会有所帮助.例如,我可能想避免使用 ddply 并使用 data.table 方法或其他应用函数,对

I have several hundred thousand rows to process so any speed ups would be helpful. For instance I would probably want to avoid ddply and use a data.table approach or another apply function, right

推荐答案

你想要的似乎有两部分:

There seem to be 2 parts to what you want:

  1. 您想替换玩家名称 &具有预定价值观的团队
  2. 您希望通过playerGame列表结转游戏数量

对于(1),你可以:

df$team[is.na(df$team)] <- 'CRP' 

同样,您可以更改数据框的其他组件

Similarly you can alter the other component of the dataframe

对于 (2) 你可以这样做:

For (2) you could do this:

if(is.na(df$playerGame[1])) {
    df$playerGame[1] <- 0
}
for(i in 2:length(df$playerGame)) { 
    if(is.na(x[i])) {
        df$playerGame[i] <- df$playerGame[i-1]
    }
} 

那么 df$playerGame 是:

[1] 0 1 1 2 3 3

也许有一种非常漂亮的方法可以做到这一点,但这显然是可读的......

Perhaps there is a very nifty way to do this, but this is clearly readable...

这篇关于用相邻值替换 R 向量中的 NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆