折叠的行,其中有些全为NA,其他行与某些NA不相交 [英] Collapsing rows where some are all NA, others are disjoint with some NAs

查看:71
本文介绍了折叠的行,其中有些全为NA,其他行与某些NA不相交的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的数据框,

I have a simple dataframe as such:

ID    Col1    Col2    Col3    Col4
1     NA      NA      NA      NA  
1     5       10      NA      NA
1     NA      NA      15      20
2     NA      NA      NA      NA  
2     25      30      NA      NA
2     NA      NA      35      40 

我想这样重新格式化:

ID    Col1    Col2    Col3    Col4
1     5       10      15      20
2     25      30      35      40

(请注意:实际数据集具有数千行,其值来自生物学数据-除了NA是不相交的,NA并不遵循简单的模式,是的,正好有3个每个ID行).

(please note: the real data set has thousands of rows and the values are from biological data -- the NAs follow no simple pattern, except that the NAs are disjoint, and yes there are exactly 3 rows for each ID).

第一个步骤:删除仅包含NA值的行.

STEP ONE: get rid of rows that have only NA values.

表面上看这很简单,但是我遇到了一些问题.

On the surface this looked simple, but I've run across some problems.

complete.cases(DF)返回所有FALSE,所以我不能像DF[complete.cases(DF),]那样使用它真正删除具有所有NA的行.这是因为所有行均包含至少一个NA.

complete.cases(DF) returns all FALSE, so I can't really use this to remove the rows with all NAs, as in DF[complete.cases(DF),]. This is because all rows contain at least one NA.

由于NA要传播自己,因此使用is.na的其他方案由于相同的原因而失败.

Since NAs want to propagate themselves, other schemes using is.na fail for the same reason.

第二步:将剩余的两行合为一体.

STEP TWO: collapse the remaining two rows into one.

正在考虑使用类似aggregate的方法来实现此目的,但是必须有一种比

Thinking about using something like aggregate to pull this off, but there has got to be an easier way than this, which doesn't work at all.

谢谢您的建议.

推荐答案

尝试

library(dplyr)
DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE))) 


要说明某一列具有某个ID的所有NAs的情况,我们需要sum_NA()函数,如果所有都是NAs,则该函数将返回NA


To account for the case in which one column has all NAs for a certain ID, we need sum_NA() function which returns NA if all are NAs

txt <- "ID    Col1    Col2    Col3    Col4
        1     NA      NA      NA      NA
        1     5       10      NA      NA
        1     NA      NA      15      20
        2     NA      NA      NA      NA
        2     NA      30      NA      NA
        2     NA      NA      35      40"
DF <- read.table(text = txt, header = TRUE)

# original code
DF %>% 
  group_by(ID) %>% 
  summarise_each(funs(sum(., na.rm = TRUE)))

# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over all variables, use `summarise_all()`
# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2     0    30    35    40

sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}

DF %>%
  group_by(ID) %>%
  summarise_all(funs(sum_NA))

DF %>%
  group_by(ID) %>%
  summarise_if(is.numeric, funs(sum_NA))

# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2    NA    30    35    40

这篇关于折叠的行,其中有些全为NA,其他行与某些NA不相交的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆