折叠的行,其中有些全为NA,其他行与某些NA不相交 [英] Collapsing rows where some are all NA, others are disjoint with some NAs
问题描述
我有一个简单的数据框,
I have a simple dataframe as such:
ID Col1 Col2 Col3 Col4
1 NA NA NA NA
1 5 10 NA NA
1 NA NA 15 20
2 NA NA NA NA
2 25 30 NA NA
2 NA NA 35 40
我想这样重新格式化:
ID Col1 Col2 Col3 Col4
1 5 10 15 20
2 25 30 35 40
(请注意:实际数据集具有数千行,其值来自生物学数据-除了NA
是不相交的,NA
并不遵循简单的模式,是的,正好有3个每个ID
行).
(please note: the real data set has thousands of rows and the values are from biological data -- the NA
s follow no simple pattern, except that the NA
s are disjoint, and yes there are exactly 3 rows for each ID
).
第一个步骤:删除仅包含NA
值的行.
STEP ONE: get rid of rows that have only NA
values.
表面上看这很简单,但是我遇到了一些问题.
On the surface this looked simple, but I've run across some problems.
complete.cases(DF)
返回所有FALSE
,所以我不能像DF[complete.cases(DF),]
那样使用它真正删除具有所有NA
的行.这是因为所有行均包含至少一个NA
.
complete.cases(DF)
returns all FALSE
, so I can't really use this to remove the rows with all NA
s, as in DF[complete.cases(DF),]
. This is because all rows contain at least one NA
.
由于NA
要传播自己,因此使用is.na
的其他方案由于相同的原因而失败.
Since NA
s want to propagate themselves, other schemes using is.na
fail for the same reason.
第二步:将剩余的两行合为一体.
STEP TWO: collapse the remaining two rows into one.
正在考虑使用类似aggregate
的方法来实现此目的,但是必须有一种比
Thinking about using something like aggregate
to pull this off, but there has got to be an easier way than this, which doesn't work at all.
谢谢您的建议.
推荐答案
尝试
library(dplyr)
DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE)))
要说明某一列具有某个ID
的所有NAs
的情况,我们需要sum_NA()
函数,如果所有都是NAs
,则该函数将返回NA
To account for the case in which one column has all NAs
for a certain ID
, we need sum_NA()
function which returns NA
if all are NAs
txt <- "ID Col1 Col2 Col3 Col4
1 NA NA NA NA
1 5 10 NA NA
1 NA NA 15 20
2 NA NA NA NA
2 NA 30 NA NA
2 NA NA 35 40"
DF <- read.table(text = txt, header = TRUE)
# original code
DF %>%
group_by(ID) %>%
summarise_each(funs(sum(., na.rm = TRUE)))
# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over all variables, use `summarise_all()`
# A tibble: 2 x 5
ID Col1 Col2 Col3 Col4
<int> <int> <int> <int> <int>
1 1 5 10 15 20
2 2 0 30 35 40
sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}
DF %>%
group_by(ID) %>%
summarise_all(funs(sum_NA))
DF %>%
group_by(ID) %>%
summarise_if(is.numeric, funs(sum_NA))
# A tibble: 2 x 5
ID Col1 Col2 Col3 Col4
<int> <int> <int> <int> <int>
1 1 5 10 15 20
2 2 NA 30 35 40
这篇关于折叠的行,其中有些全为NA,其他行与某些NA不相交的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!