按组用最近的非 NA 替换缺失值 (NA) [英] Replace missing values (NA) with most recent non-NA by group

查看:36
本文介绍了按组用最近的非 NA 替换缺失值 (NA)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用 dplyr 解决以下问题.最好使用其中一种窗口函数.我有一个包含房屋和购买价格的数据框.下面是一个例子:

I would like to solve the following problem with dplyr. Preferable with one of the window-functions. I have a data frame with houses and buying prices. The following is an example:

houseID      year    price 
1            1995    NA
1            1996    100
1            1997    NA
1            1998    120
1            1999    NA
2            1995    NA
2            1996    NA
2            1997    NA
2            1998    30
2            1999    NA
3            1995    NA
3            1996    44
3            1997    NA
3            1998    NA
3            1999    NA

我想制作这样的数据框:

I would like to make a data frame like this:

houseID      year    price 
1            1995    NA
1            1996    100
1            1997    100
1            1998    120
1            1999    120
2            1995    NA
2            1996    NA
2            1997    NA
2            1998    30
2            1999    30
3            1995    NA
3            1996    44
3            1997    44
3            1998    44
3            1999    44

以下是一些格式正确的数据:

Here are some data in the right format:

# Number of houses
N = 15

# Data frame
df = data.frame(houseID = rep(1:N,each=10), year=1995:2004, price =ifelse(runif(10*N)>0.15, NA,exp(rnorm(10*N))))

是否有 dplyr 方式来做到这一点?

Is there a dplyr-way to do that?

推荐答案

这些都使用zoo包中的na.locf.另请注意,na.locf0(也在 zoo 中定义)类似于 na.locf,除了它默认为 na.rm = FALSE 并且需要一个单向量参数.在第一个解决方案中定义的 na.locf2 也用于其他一些解决方案.

These all use na.locf from the zoo package. Also note that na.locf0 (also defined in zoo) is like na.locf except it defaults to na.rm = FALSE and requires a single vector argument. na.locf2 defined in the first solution is also used in some of the others.

dplyr

library(dplyr)
library(zoo)

na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup

给予:

Source: local data frame [15 x 3]
Groups: houseID

   houseID year price
1        1 1995    NA
2        1 1996   100
3        1 1997   100
4        1 1998   120
5        1 1999   120
6        2 1995    NA
7        2 1996    NA
8        2 1997    NA
9        2 1998    30
10       2 1999    30
11       3 1995    NA
12       3 1996    44
13       3 1997    44
14       3 1998    44
15       3 1999    44

它的一个变体是:

df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup

下面的其他解决方案给出的输出非常相似,所以我们不会重复,除非格式有很大不同.

Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.

另一种可能性是将 by 解决方案(如下所示)与 dplyr 结合起来:

Another possibility is to combine the by solution (shown further below) with dplyr:

df %>% by(df$houseID, na.locf2) %>% bind_rows

library(zoo)

do.call(rbind, by(df, df$houseID, na.locf2))

普通

library(zoo)

transform(df, price = ave(price, houseID, FUN = na.locf0))

数据表

library(data.table)
library(zoo)

data.table(df)[, na.locf2(.SD), by = houseID]

zoo 此解决方案仅使用 zoo.它返回一个宽而不是长的结果:

zoo This solution uses zoo alone. It returns a wide rather than long result:

library(zoo)

z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)

给予:

       1  2  3
1995  NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44

这个解决方案可以像这样与 dplyr 结合:

This solution could be combined with dplyr like this:

library(dplyr)
library(zoo)

df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2

输入

以下是用于上述示例的输入:

Here is the input used for the examples above:

df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
  2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L, 
  1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L, 
  1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA, 
  30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year", 
  "price"), class = "data.frame", row.names = c(NA, -15L))

REVISED 重新安排并添加了更多解决方案.修改了 dplyr/zoo 解决方案以符合最新的变化 dplyr.应用固定并从所有解决方案中分解出 na.locf2.

REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2 from all solutions.

这篇关于按组用最近的非 NA 替换缺失值 (NA)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆