如何用中位数填充NA? [英] How to fill NA with median?
问题描述
示例数据:
set.seed(1)
df <- data.frame(years=sort(rep(2005:2010, 12)),
months=1:12,
value=c(rnorm(60),NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
head(df)
years months value
1 2005 1 -0.6264538
2 2005 2 0.1836433
3 2005 3 -0.8356286
4 2005 4 1.5952808
5 2005 5 0.3295078
6 2005 6 -0.8204684
请告诉我,如何将 df$value 中的 NA 替换为其他月份的中位数?值"必须包含同月所有先前值的中值.也就是说,如果当前月份是 5 月,则值"必须包含 5 月份之前所有值的中值.
Tell me please, how i can replace NA in df$value to median of others months? "value" must contain the median of value of all previous values for the same month. That is, if current month is May, "value" must contain the median value for all previous values of the month of May.
推荐答案
或 with ave
df <- data.frame(years=sort(rep(2005:2010, 12)),
months=1:12,
value=c(rnorm(60),NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
df$value[is.na(df$value)] <- with(df, ave(value, months,
FUN = function(x) median(x, na.rm = TRUE)))[is.na(df$value)]
<小时>
既然有这么多答案,让我们看看哪个最快.
Since there are so many answers let's see which is fastest.
plyr2 <- function(df){
medDF <- ddply(df,.(months),summarize,median=median(value,na.rm=TRUE))
df$value[is.na(df$value)] <- medDF$median[match(df$months,medDF$months)][is.na(df$value)]
df
}
library(plyr)
library(data.table)
DT <- data.table(df)
setkey(DT, months)
benchmark(ave = df$value[is.na(df$value)] <-
with(df, ave(value, months,
FUN = function(x) median(x, na.rm = TRUE)))[is.na(df$value)],
tapply = df$value[61:72] <-
with(df, tapply(value, months, median, na.rm=TRUE)),
sapply = df[61:72, 3] <- sapply(split(df[1:60, 3], df[1:60, 2]), median),
plyr = ddply(df, .(months), transform,
value=ifelse(is.na(value), median(value, na.rm=TRUE), value)),
plyr2 = plyr2(df),
data.table = DT[,value := ifelse(is.na(value), median(value, na.rm=TRUE), value), by=months],
order = "elapsed")
test replications elapsed relative user.self sys.self user.child sys.child
3 sapply 100 0.209 1.000000 0.196 0.000 0 0
1 ave 100 0.260 1.244019 0.244 0.000 0 0
6 data.table 100 0.271 1.296651 0.264 0.000 0 0
2 tapply 100 0.271 1.296651 0.256 0.000 0 0
5 plyr2 100 1.675 8.014354 1.612 0.004 0 0
4 plyr 100 2.075 9.928230 2.004 0.000 0 0
我敢打赌 data.table 是最快的.
I would have bet that data.table was the fastest.
[ Matthew Dowle ] 这里计时的任务最多需要 0.02 秒 (2.075/100).data.table
认为这无关紧要.尝试将 replications
设置为 1
并增加数据大小.或者计时 3 次运行中最快的也是一个普遍的经验法则.在这些链接中进行更详细的讨论:
[ Matthew Dowle ] The task being timed here takes at most 0.02 seconds (2.075/100). data.table
considers that insignificant. Try setting replications
to 1
and increasing the data size, instead. Or timing the fastest of 3 runs is also a common rule of thumb. More verbose discussion in these links :
- Evidence that data.table isn't always fastest
- Benchmarks in Averaging column values for specific sections of data corresponding to other column values
- London R presentation, June 2012 (slide 21 headed "Other")
- A transform by group benchmark in an extreme case
这篇关于如何用中位数填充NA?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!