如何获得R中多列的中位数(根据条件)(根据另一列) [英] How do I get the median of multiple columns in R with conditions (according to another column)

查看:1348
本文介绍了如何获得R中多列的中位数(根据条件)(根据另一列)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是R语言的初学者,我想知道如何执行以下任务:

I'm a beginner in R and I would like to know how to do the following task:

我想用数据集所有列的中位数替换数据集的缺失值. 但是,对于每一列,我想要某个类别的中位数(取决于另一列).我的数据集如下

I want to replace the missing values of my dataset by the median for all the columns of my dataset. However, for each column, I want the median of a certain category (depending on another column).My dataset is as follows

structure(list(Country = structure(1:5, .Label = c("Afghanistan", 
"Albania", "Algeria", "Andorra", "Angola"), class = "factor"), 
    CountryID = 1:5, Continent = c(1L, 2L, 3L, 2L, 3L), Adolescent.fertility.rate.... = c(151L, 
    27L, 6L, NA, 146L), Adult.literacy.rate.... = c(28, 98.7, 
    69.9, NA, 67.4)), class = "data.frame", row.names = c(NA, 
-5L))

因此,对于每一列,我想用特定大陆的值的中位数替换缺失的值.

So for each of the columns, I want to replace the missing values by the median of the values in the specific continent.

推荐答案

我们可以使用dplyr::mutate_at将每列(Continent和非数字列Country除外)中的NA替换为其Continent

We can use dplyr::mutate_at to replace NAs in each column (except Continent and the non numeric column Country) with the median for its Continent group

df <- structure(list(Country = structure(1:5, .Label = c("Afghanistan",  "Albania", "Algeria", "Andorra", "Angola"), class = "factor"), 
               CountryID = 1:5, Continent = c(1L, 2L, 3L, 2L, 3L),
               Adolescent.fertility.rate.... = c(151L, 27L, 6L, NA, 146L),
               Adult.literacy.rate.... = c(28, 98.7, 69.9, NA, 67.4)), class = "data.frame", row.names = c(NA, -5L))

library(dplyr)
df %>%
  group_by(Continent) %>% 
  mutate_at(vars(-group_cols(), -Country), ~ifelse(is.na(.), median(., na.rm = TRUE), .)) %>% 
  ungroup()

返回:

  # A tibble: 5 x 5
    Country     CountryID Continent Adolescent.fertility.rate.... Adult.literacy.rate....
    <fct>           <int>     <int>                         <int>                   <dbl>
  1 Afghanistan         1         1                           151                    28  
  2 Albania             2         2                            27                    98.7
  3 Algeria             3         3                             6                    69.9
  4 Andorra             4         2                            27                    98.7
  5 Angola              5         3                           146                    67.4

说明: 首先,我们将data.frame dfContinent分组.然后,通过以下方式对分组列(以及非数字的Country除外)中的所有列 进行突变:如果is.na为TRUE,则将其替换为中位数,并且由于对分组进行了分组,这将是Continent组的中位数(如果不是NA,则将其替换为自身).最后,我们称呼ungroup为有效措施,以恢复正常" 小贴士.

Explanation: First we group the data.frame df by Continent. Then we mutate all columns except the grouping column (and Country which is not numeric) the following way: If is.na is TRUE, we replace it with the median, and since we are grouped, it's going to be the median for the Continent group (if its not NA we replace it with itself). Finally we call ungroup for good measure to get back a 'normal' tibble.

这篇关于如何获得R中多列的中位数(根据条件)(根据另一列)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆