如何用组/子集的平均值替换NA? [英] How to replace NA with mean by group / subset?
问题描述
我有一个数据框,里面有from足动物各种节肢动物的长和宽.由于某些胆量有成千上万种特定的猎物,因此我只测量了每种猎物的一个子集.我现在想用该猎物的平均长度和宽度替换每个无法测量的个体.我想保留数据框并仅添加估算列(length2,width2).主要原因是每一行也都有列,其中包含on的收集日期和位置的数据.我可以用随机选择的被测个体来填充NA,但是为了便于讨论,我们假设我只想用均值替换每个NA.
I have a dataframe with the lengths and widths of various arthropods from the guts of salamanders. Because some guts had thousands of certain prey items, I only measured a subset of each prey type. I now want to replace each unmeasured individual with the mean length and width for that prey. I want to keep the dataframe and just add imputed columns (length2, width2). The main reason is that each row also has columns with data on the date and location the salamander was collected. I could fill in the NA with a random selection of the measured individuals but for the sake of argument let's assume I just want to replace each NA with the mean.
例如,假设我有一个看起来像这样的数据框:
For example imagine I have a dataframe that looks something like:
id taxa length width
101 collembola 2.1 0.9
102 mite 0.9 0.7
103 mite 1.1 0.8
104 collembola NA NA
105 collembola 1.5 0.5
106 mite NA NA
实际上,我有更多的专栏和大约25种不同的分类单元,总共有约30,000个猎物.看来plyr软件包可能是理想的选择,但我只是不知道如何做到这一点.我不是R或编程方面的专家,但是我正在尝试学习.
In reality I have more columns and about 25 different taxa and a total of ~30,000 prey items in total. It seems like the plyr package might be ideal for this but I just can't figure out how to do this. I'm not very R or programming savvy but I'm trying to learn.
并不是我知道我在做什么,但是如果有帮助,我会尝试创建一个小的数据集来玩.
Not that I know what I'm doing but I'll try to create a small dataset to play with if it helps.
exampleDF <- data.frame(id = seq(1:100), taxa = c(rep("collembola", 50), rep("mite", 25),
rep("ant", 25)), length = c(rnorm(40, 1, 0.5), rep("NA", 10), rnorm(20, 0.8, 0.1), rep("NA",
5), rnorm(20, 2.5, 0.5), rep("NA", 5)), width = c(rnorm(40, 0.5, 0.25), rep("NA", 10),
rnorm(20, 0.3, 0.01), rep("NA", 5), rnorm(20, 1, 0.1), rep("NA", 5)))
以下是我尝试过的一些方法(没有用):
Here are a few things I've tried (that haven't worked):
# mean imputation to recode NA in length and width with means
(could do random imputation but unnecessary here)
mean.imp <- function(x) {
missing <- is.na(x)
n.missing <-sum(missing)
x.obs <-a[!missing]
imputed <- x
imputed[missing] <- mean(x.obs)
return (imputed)
}
mean.imp(exampleDF[exampleDF$taxa == "collembola", "length"])
n.taxa <- length(unique(exampleDF$taxa))
for(i in 1:n.taxa) {
mean.imp(exampleDF[exampleDF$taxa == unique(exampleDF$taxa[i]), "length"])
} # no way to get back into dataframe in proper places, try plyr?
另一种尝试:
imp.mean <- function(x) {
a <- mean(x, na.rm = TRUE)
return (ifelse (is.na(x) == TRUE , a, x))
} # tried but not sure how to use this in ddply
Diet2 <- ddply(exampleDF, .(taxa), transform, length2 = function(x) {
a <- mean(exampleDF$length, na.rm = TRUE)
return (ifelse (is.na(exampleDF$length) == TRUE , a, exampleDF$length))
})
有什么建议吗?
推荐答案
不是我自己的技术,我不久前在板上看到了它:
Not my own technique I saw it on the boards a while back:
dat <- read.table(text = "id taxa length width
101 collembola 2.1 0.9
102 mite 0.9 0.7
103 mite 1.1 0.8
104 collembola NA NA
105 collembola 1.5 0.5
106 mite NA NA", header=TRUE)
library(plyr)
impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
dat2 <- ddply(dat, ~ taxa, transform, length = impute.mean(length),
width = impute.mean(width))
dat2[order(dat2$id), ] #plyr orders by group so we have to reorder
编辑:具有for
循环的非plyr方法:
Edit A non plyr approach with a for
loop:
for (i in which(sapply(dat, is.numeric))) {
for (j in which(is.na(dat[, i]))) {
dat[j, i] <- mean(dat[dat[, "taxa"] == dat[j, "taxa"], i], na.rm = TRUE)
}
}
编辑许多月之后,这里是 data.table & dplyr 方法:
Edit many moons later here is a data.table & dplyr approach:
data.table
data.table
library(data.table)
setDT(dat)
dat[, length := impute.mean(length), by = taxa][,
width := impute.mean(width), by = taxa]
dplyr
dplyr
library(dplyr)
dat %>%
group_by(taxa) %>%
mutate(
length = impute.mean(length),
width = impute.mean(width)
)
这篇关于如何用组/子集的平均值替换NA?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!