R:根据更少的行中的缺失值删除多行 [英] R: remove multiple rows based on missing values in fewer rows
问题描述
我有一个R数据框,其中包含来自多个主题的数据,每个主题都经过了多次测试.为了对集合进行统计,有一个主题("id")和每个观察结果一行(由"session"给定).即
print(allData)
id session measure
1 1 7.6
2 1 4.5
3 1 5.5
1 2 7.1
2 2 NA
3 2 4.9
在上面的示例中,是否存在一种简单的方法来删除所有id == 2的行,假设"measure"列中id == 2的行之一包含NA?
更一般地说,由于我实际上对每个主题都有很多度量(列)和四个会话(行),因此考虑到(至少)具有"id"级别的行中的一列包含NA?
我的直觉是,可能有一个内置函数可以比我当前的解决方案更优雅地解决此问题:
# Which columns to check for NA's in
probeColumns = c('measure1','measure4') # Etc...
# A vector which contains all levels of "id" that are present in rows with NA's in the probeColumns
idsWithNAs = allData[complete.cases(allData[probeColumns])==FALSE,"id"]
# All rows that isn't in idsWithNAs
cleanedData = allData[!allData$id %in% idsWithNAs,]
谢谢, /乔纳斯(Jonas)
您可以将plyr
包中的ddply
函数用于1)id
的数据子集,2)
应用一个函数,如果子data.frame在您选择的列中包含NA
,则该函数将返回NULL
,否则返回data.frame本身,并且3)将所有内容连接回一个data.frame.
allData <- data.frame(id = rep(1:4, 3),
session = rep(1:3, each = 4),
measure1 = sample(c(NA, 1:11)),
measure2 = sample(c(NA, 1:11)),
measure3 = sample(c(NA, 1:11)),
measure4 = sample(c(NA, 1:11)))
allData
# id session measure1 measure2 measure3 measure4
# 1 1 1 3 7 10 6
# 2 2 1 4 4 9 9
# 3 3 1 6 6 7 10
# 4 4 1 1 5 2 3
# 5 1 2 NA NA 5 11
# 6 2 2 7 10 6 5
# 7 3 2 9 8 4 2
# 8 4 2 2 9 1 7
# 9 1 3 5 1 3 8
# 10 2 3 8 3 8 1
# 11 3 3 11 11 11 4
# 12 4 3 10 2 NA NA
# Which columns to check for NA's in
probeColumns = c('measure1','measure4')
library(plyr)
ddply(allData, "id",
function(df)if(any(is.na(df[, probeColumns]))) NULL else df)
# id session measure1 measure2 measure3 measure4
# 1 2 1 4 4 9 9
# 2 2 2 7 10 6 5
# 3 2 3 8 3 8 1
# 4 3 1 6 6 7 10
# 5 3 2 9 8 4 2
# 6 3 3 11 11 11 4
I have an R data frame with data from multiple subjects, each tested several times. To perform statistics on the set, there is a factor for subject ("id") and a row for each observation (given by factor "session"). I.e.
print(allData)
id session measure
1 1 7.6
2 1 4.5
3 1 5.5
1 2 7.1
2 2 NA
3 2 4.9
In the above example, is there a simple way to remove all rows with id==2, given that the "measure" column contains NA in one of the rows where id==2?
More generally, since I actually have a lot of measures (columns) and four sessions (rows) for each subject, is there an elegant way to remove all rows with a given level of the "id" factor, given that (at least) one of the rows with this "id"-level contains NA in a column?
I have the intuition that there could be a build-in function that could solve this problem more elegantly than my current solution:
# Which columns to check for NA's in
probeColumns = c('measure1','measure4') # Etc...
# A vector which contains all levels of "id" that are present in rows with NA's in the probeColumns
idsWithNAs = allData[complete.cases(allData[probeColumns])==FALSE,"id"]
# All rows that isn't in idsWithNAs
cleanedData = allData[!allData$id %in% idsWithNAs,]
Thanks, /Jonas
You can use the ddply
function from the plyr
package to 1) subset your data by id
, 2)
apply a function that will return NULL
if the sub data.frame contains NA
in the columns of your choice, or the data.frame itself otherwise, and 3) concatenate everything back into a data.frame.
allData <- data.frame(id = rep(1:4, 3),
session = rep(1:3, each = 4),
measure1 = sample(c(NA, 1:11)),
measure2 = sample(c(NA, 1:11)),
measure3 = sample(c(NA, 1:11)),
measure4 = sample(c(NA, 1:11)))
allData
# id session measure1 measure2 measure3 measure4
# 1 1 1 3 7 10 6
# 2 2 1 4 4 9 9
# 3 3 1 6 6 7 10
# 4 4 1 1 5 2 3
# 5 1 2 NA NA 5 11
# 6 2 2 7 10 6 5
# 7 3 2 9 8 4 2
# 8 4 2 2 9 1 7
# 9 1 3 5 1 3 8
# 10 2 3 8 3 8 1
# 11 3 3 11 11 11 4
# 12 4 3 10 2 NA NA
# Which columns to check for NA's in
probeColumns = c('measure1','measure4')
library(plyr)
ddply(allData, "id",
function(df)if(any(is.na(df[, probeColumns]))) NULL else df)
# id session measure1 measure2 measure3 measure4
# 1 2 1 4 4 9 9
# 2 2 2 7 10 6 5
# 3 2 3 8 3 8 1
# 4 3 1 6 6 7 10
# 5 3 2 9 8 4 2
# 6 3 3 11 11 11 4
这篇关于R:根据更少的行中的缺失值删除多行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!