如果在R中使用1组以上,则na.rm函数不起作用 [英] na.rm function doesn't work if use more then 1 group in R

查看:61
本文介绍了如果在R中使用1组以上,则na.rm函数不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这篇文章中在某些观察结果分离之前,选择组通过使用N控件将R中的var分组,当使用一组 add na.rm = T 时有效.但是新数据,其中三组

In this post select group before certain observations separated by grouping var in R with NA control, when using one group add na.rm=T works. But new data, where three groups

data=structure(list(add = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"), 
    x1 = c(0L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 
    1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 3L, 0L, 0L, 
    0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), add1 = c(514L, 514L, 
    514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 
    514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 
    514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 
    514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 
    514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L
    ), group = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("female", 
    "male"), class = "factor"), add2 = c(2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L)), .Names = c("add", "x1", "add1", 
"group", "add2"), class = "data.frame", row.names = c(NA, -52L
))

所以当我运行代码时

library(tidyverse)
library( data.table)
data %>%  
  group_by(add,add1,add2) %>%                                          
  mutate(group2 = rleid(group)) %>% 
  group_by(add,add1,add2, group, group2) %>%
  mutate(MEAN = mean(x1[group=="male" & group2==1], na.rm = T),      ## extra code here ##    
         Q25 = quantile(x1[group=="male" & group2==1], 0.25, na.rm = T)) %>%  ## extra code here ##
  group_by(add,add1,add2) %>%                                           
  mutate(x1 = ifelse(group=="male" & group2==3 & x1 > unique(Q25[!is.na(Q25)]), unique(MEAN[!is.na(MEAN)]), x1))%>%
  ungroup() %>%
  select(-group2) %>%
  data.frame()

我遇到错误

Error in mutate_impl(.data, dots) : 
  Column `x1` must be length 24 (the group size) or one, not 0

PS.我只是提供了一个示例来给出数据结构,原因是有1000个组.我找不到群组出现错误的地方

PS. I just provided one example to give data structure, cause there are 1000 groups. I can't find group from which there is error

如何解决此错误.

推荐答案

如果我正确理解,则该错误是由第一个男性群体(其中 all x1 是<第一部分中的code> NA ( group == 1L ).

If I understand correctly, the error is caused by a first male group where all x1 are NA in the first section (group == 1L).

恕我直言,一种更干净的方法是按照建议的此处首先计算所有组的统计信息,并使用非请按照此处的建议,通过均等连接来更新第二个男性组中受影响的行.

IMHO, a cleaner approach is to compute the statistics for all groups first as suggested here and to use a a non-equi join to update the affected rows in the second male group as suggested here.

library( data.table)
grp_stats <- setDT(data)[, group2 :=rleid(group), by = .(add, add1, add2)][
  group2 == 1L & group == "male", 
  .(group2 = 3L, mean = mean(x1, na.rm = TRUE), q25 = quantile(x1, 0.25, na.rm = TRUE)), 
  by = .(add, add1, add2)] 
grp_stats 

   add add1 add2 group2 mean  q25
1:   x  514 2018      3  1.5 1.25
2:   y  515 2018      3  NaN   NA
3:   z  516 2018      3  2.0 2.00

可以清楚地识别出产生错误统计信息的组.由OP决定从数据集中删除受影响的组.

The groups which produce wrong statistics can be cleary identified. It's up to the OP to remove the affected groups from the dataset.

但是,对于后续的加入,我们可以将其保留,因为它们不会产生任何影响.

However, for the subsequent join we can leave them in as they will not have any affect.

具有常量值 3 的列 group2 已添加到组统计信息中,以简化非等额联接中的后续 update :

The column group2 with the constant value 3 already has been added to the group statistics to simplify the subsequent update in a non-equi join:

data[, x1 := as.numeric(x1)][
  grp_stats, on = .(group2, add, add1, add2, x1 > q25), x1 := mean][]
data

    add  x1 add1  group add2 group2
 1:   x 1.0  514   male 2018      1
 2:   x 2.0  514   male 2018      1
 3:   x  NA  514 female 2018      2
 4:   x  NA  514 female 2018      2
 5:   x 1.5  514   male 2018      3
 6:   x 1.0  514   male 2018      3
 7:   y  NA  515   male 2018      1
 8:   y  NA  515   male 2018      1
 9:   y  NA  515 female 2018      2
10:   y  NA  515 female 2018      2
11:   y 7.0  515   male 2018      3
12:   y 1.0  515   male 2018      3
13:   z 2.0  516   male 2018      1
14:   z  NA  516   male 2018      1
15:   z  NA  516 female 2018      2
16:   z  NA  516 female 2018      2
17:   z 2.0  516   male 2018      3
18:   z 1.0  516   male 2018      3

请注意,第5行和第17行已更新,而未触及产生错误统计信息的第二组行.

Note that rows 5 and 17 have been updated while the rows of the second group which produced the wrong statitistics haven't been touched.

x1 被强制键入 numeric ,以匹配 mean()返回的结果的类型.

x1 is coerced to type numeric before joining to match the type of the result returned by mean().

这里是由三组组成的样本数据.在第二个组中,第一个公节的所有 x1 值均为 NA .

Here is a sample data consisting of three groups. In the seocnd group, all x1 values of the first male section are NA.

data <- data.table::fread("
add x1 add1  group add2
x    1  514   male 2018
x    2  514   male 2018
x   NA  514 female 2018
x   NA  514 female 2018
x    7  514   male 2018
x    1  514   male 2018
y   NA  515   male 2018
y   NA  515   male 2018
y   NA  515 female 2018
y   NA  515 female 2018
y    7  515   male 2018
y    1  515   male 2018
z    2  516   male 2018
z   NA  516   male 2018
z   NA  516 female 2018
z   NA  516 female 2018
z    7  516   male 2018
z    1  516   male 2018
")

验证错误消息是由不适用的第一个男性组造成的

将上述示例数据集传递到OP的代码中后,我们可以重现错误消息:

Verify error message is caused by an all-NA first male group

When above sample dataset is piped into OP's code we can reproduce the error message:

library(dplyr)
data %>% 
  group_by(add,add1,add2) %>%                                          
  mutate(group2 = rleid(group)) %>% 
  group_by(add,add1,add2, group, group2) %>%
  mutate(MEAN = mean(x1[group=="male" & group2==1], na.rm = T),      ## extra code here ##    
         Q25 = quantile(x1[group=="male" & group2==1], 0.25, na.rm = T)) %>%  ## extra code here ##
  group_by(add,add1,add2) %>%                                           
  mutate(x1 = ifelse(group=="male" & group2==3 & x1 > unique(Q25[!is.na(Q25)]), unique(MEAN[!is.na(MEAN)]), x1))%>%
  ungroup() %>%
  select(-group2) %>%
  data.frame()

mutate_impl(.data,点)中的错误:
x1 列的长度必须为6(组大小)或1,而不是0

Error in mutate_impl(.data, dots) :
Column x1 must be length 6 (the group size) or one, not 0

这篇关于如果在R中使用1组以上,则na.rm函数不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆