按组用最近的非 NA 替换缺失值 (NA) [英] Replace missing values (NA) with most recent non-NA by group
问题描述
我想用 dplyr 解决以下问题.最好使用其中一种窗口函数.我有一个包含房屋和购买价格的数据框.下面是一个例子:
I would like to solve the following problem with dplyr. Preferable with one of the window-functions. I have a data frame with houses and buying prices. The following is an example:
houseID year price
1 1995 NA
1 1996 100
1 1997 NA
1 1998 120
1 1999 NA
2 1995 NA
2 1996 NA
2 1997 NA
2 1998 30
2 1999 NA
3 1995 NA
3 1996 44
3 1997 NA
3 1998 NA
3 1999 NA
我想制作这样的数据框:
I would like to make a data frame like this:
houseID year price
1 1995 NA
1 1996 100
1 1997 100
1 1998 120
1 1999 120
2 1995 NA
2 1996 NA
2 1997 NA
2 1998 30
2 1999 30
3 1995 NA
3 1996 44
3 1997 44
3 1998 44
3 1999 44
以下是一些格式正确的数据:
Here are some data in the right format:
# Number of houses
N = 15
# Data frame
df = data.frame(houseID = rep(1:N,each=10), year=1995:2004, price =ifelse(runif(10*N)>0.15, NA,exp(rnorm(10*N))))
是否有 dplyr 方式来做到这一点?
Is there a dplyr-way to do that?
推荐答案
这些都使用zoo包中的na.locf
.另请注意,na.locf0
(也在 zoo 中定义)类似于 na.locf
,除了它默认为 na.rm = FALSE
并且需要一个单向量参数.在第一个解决方案中定义的 na.locf2
也用于其他一些解决方案.
These all use na.locf
from the zoo package. Also note that na.locf0
(also defined in zoo) is like na.locf
except it defaults to na.rm = FALSE
and requires a single vector argument. na.locf2
defined in the first solution is also used in some of the others.
dplyr
library(dplyr)
library(zoo)
na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup
给予:
Source: local data frame [15 x 3]
Groups: houseID
houseID year price
1 1 1995 NA
2 1 1996 100
3 1 1997 100
4 1 1998 120
5 1 1999 120
6 2 1995 NA
7 2 1996 NA
8 2 1997 NA
9 2 1998 30
10 2 1999 30
11 3 1995 NA
12 3 1996 44
13 3 1997 44
14 3 1998 44
15 3 1999 44
它的一个变体是:
df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup
下面的其他解决方案给出的输出非常相似,所以我们不会重复,除非格式有很大不同.
Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.
另一种可能性是将 by
解决方案(如下所示)与 dplyr 结合起来:
Another possibility is to combine the by
solution (shown further below) with dplyr:
df %>% by(df$houseID, na.locf2) %>% bind_rows
由
library(zoo)
do.call(rbind, by(df, df$houseID, na.locf2))
普通
library(zoo)
transform(df, price = ave(price, houseID, FUN = na.locf0))
数据表
library(data.table)
library(zoo)
data.table(df)[, na.locf2(.SD), by = houseID]
zoo 此解决方案仅使用 zoo.它返回一个宽而不是长的结果:
zoo This solution uses zoo alone. It returns a wide rather than long result:
library(zoo)
z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)
给予:
1 2 3
1995 NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44
这个解决方案可以像这样与 dplyr 结合:
This solution could be combined with dplyr like this:
library(dplyr)
library(zoo)
df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2
输入
以下是用于上述示例的输入:
Here is the input used for the examples above:
df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L,
1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L,
1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA,
30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year",
"price"), class = "data.frame", row.names = c(NA, -15L))
REVISED 重新安排并添加了更多解决方案.修改了 dplyr/zoo 解决方案以符合最新的变化 dplyr.应用固定并从所有解决方案中分解出 na.locf2
.
REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2
from all solutions.
这篇关于按组用最近的非 NA 替换缺失值 (NA)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!