按组用最新的非NA替换缺失值(NA) [英] Replace missing values (NA) with most recent non-NA by group
问题描述
我想用dplyr解决以下问题。最好使用其中一种窗口功能。
我有一个包含房屋和购买价格的数据框。下面是一个示例:
I would like to solve the following problem with dplyr. Preferable with one of the window-functions. I have a data frame with houses and buying prices. The following is an example:
houseID year price
1 1995 NA
1 1996 100
1 1997 NA
1 1998 120
1 1999 NA
2 1995 NA
2 1996 NA
2 1997 NA
2 1998 30
2 1999 NA
3 1995 NA
3 1996 44
3 1997 NA
3 1998 NA
3 1999 NA
我想制作一个这样的数据框:
I would like to make a data frame like this:
houseID year price
1 1995 NA
1 1996 100
1 1997 100
1 1998 120
1 1999 120
2 1995 NA
2 1996 NA
2 1997 NA
2 1998 30
2 1999 30
3 1995 NA
3 1996 44
3 1997 44
3 1998 44
3 1999 44
这里有一些格式正确的数据:
Here are some data in the right format:
# Number of houses
N = 15
# Data frame
df = data.frame(houseID = rep(1:N,each=10), year=1995:2004, price =ifelse(runif(10*N)>0.15, NA,exp(rnorm(10*N))))
是否有dplyr方法
推荐答案
这些都使用了Zoo包中的 na.locf
。另请注意, na.locf0
(也在动物园中定义)类似于 na.locf
,但默认为 na.rm = FALSE
,并且需要一个向量参数。第一个解决方案中定义的 na.locf2
也用于其他一些解决方案。
These all use na.locf
from the zoo package. Also note that na.locf0
(also defined in zoo) is like na.locf
except it defaults to na.rm = FALSE
and requires a single vector argument. na.locf2
defined in the first solution is also used in some of the others.
dplyr
library(dplyr)
library(zoo)
na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup
:
Source: local data frame [15 x 3]
Groups: houseID
houseID year price
1 1 1995 NA
2 1 1996 100
3 1 1997 100
4 1 1998 120
5 1 1999 120
6 2 1995 NA
7 2 1996 NA
8 2 1997 NA
9 2 1998 30
10 2 1999 30
11 3 1995 NA
12 3 1996 44
13 3 1997 44
14 3 1998 44
15 3 1999 44
其变体为:
df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup
下面的其他解决方案给出的输出也非常相似,因此我们将不再重复输出,除非格式存在显着差异。
Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.
另一种可能性是合并
Another possibility is to combine the by
solution (shown further below) with dplyr:
df %>% by(df$houseID, na.locf2) %>% bind_rows
作者
library(zoo)
do.call(rbind, by(df, df$houseID, na.locf2))
ave
library(zoo)
transform(df, price = ave(price, houseID, FUN = na.locf0))
data.table
library(data.table)
library(zoo)
data.table(df)[, na.locf2(.SD), by = houseID]
zoo 此解决方案仅使用Zoo。它返回一个宽而不是长的结果:
zoo This solution uses zoo alone. It returns a wide rather than long result:
library(zoo)
z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)
给予:
1 2 3
1995 NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44
此解决方案可以与dplyr结合使用,例如:
This solution could be combined with dplyr like this:
library(dplyr)
library(zoo)
df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2
输入
以下是上面示例中使用的输入:
Here is the input used for the examples above:
df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L,
1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L,
1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA,
30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year",
"price"), class = "data.frame", row.names = c(NA, -15L))
已修订,已重新安排并添加了更多解决方案。修订的dplyr / zoo解决方案与最新的dplyr更改保持一致。已应用固定并从所有解决方案中排除了 na.locf2
。
REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2
from all solutions.
这篇关于按组用最新的非NA替换缺失值(NA)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!