如何用R中具有特定值范围的NA替换离群值? [英] How to replace outliers with NA having a particular range of values in R?

查看:77
本文介绍了如何用R中具有特定值范围的NA替换离群值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有气候数据,我正在尝试用 NA 替换异常值.我之所以不使用 boxplot(x)$ out 是因为我有一定​​范围的值可以用来计算离群值.

I have climate data and I'm trying to replace outliers with NA. I'm not using boxplot(x)$out is because I have a range of values to be considered to compute the outlier.

temp_range <- c(-15, 45)
wind_range <- c(0, 15)
humidity_range <- c(0, 100)

我的数据框看起来像这样

My dataframe looks like this

带有异常值的df

(我突出显示了根据范围应替换为NA的值.)

(I highlighted values that should be replaced with NA according to ranges.)

因此,必须根据 temp_range wind 的离群值应根据 wind_range 替换为 NA ,最后将湿度的离群值替换为 NA 根据湿度范围.

So temp1 and temp2 outliers must be replaced to NA according to temp_range, wind's outliers should be replaced to NA according to wind_range and finally humidity's outliers must be replaced to NA according to humidity_range.

这就是我所拥有的:

df <- read.csv2("http://pastebin.com/raw/vwqBu2M5", stringsAsFactors = FALSE)

df[,2:5] = apply(df[,2:5], 2, function(x) as.numeric(x))

#Ranges
temp_range <- c(-15, 45)
wind_range <- c(0, 15)
humidity_range <- c(0, 100)

#Function to detect outlier
in_interval <- function(x, interval){
  stopifnot(length(interval) == 2L)
  interval[1] <= x & x <= interval[2]
}


#Replace outliers according to temp_range
cols <- c('temp1', 'temp2')
df[, cols] <- lapply(df[, cols], function(x) {

  x[in_interval(x, temp_range)==FALSE] <- NA
  x
})

我正在为每个范围进行代码的最后一部分(替换).有没有一种方法可以简化它,这样我就可以避免很多重复?

I'm doing the last part of code (the replacement) for every range. Is there a way to simplify it so I can avoid a lot of repetition?

最后,假设 cols<-c('wind')会向我抛出警告,并用常量替换整个 wind 列.

Last thing, let's say cols <- c('wind') this throws me a warning and replaces the whole wind column with a constant.

Warning message:
In `[<-.data.frame`(`*tmp*`, , cols, value = list(23.88, 23.93,  :
  provided 10 variables to replace 1 variables

有什么建议吗?

推荐答案

要更动态地执行此操作,请使用字典:具有异常值的数据框与每个变量相关联.

To do it more dynamically, use a dictionnary: a dataframe with outlier value associate to each variable.

在这里我用R创建它,但是将它包含在csv中会更加实用,因此您可以轻松地对其进行编辑.

Here I create it in R, but it would be more practical to have it in csv so you can edit it easily.

df <- read.csv2("http://pastebin.com/raw/vwqBu2M5", stringsAsFactors = FALSE)

df[,2:5] = apply(df[,2:5], 2, function(x) as.numeric(x))


df_dict <- data.frame(variable = c("temp1", "temp2", "wind", "humidity"), 
                       out_low = c(-15, -15, 0, 0), 
                       out_high =c(45, 45, 15, 100))

for (var in df_dict$variable) {

  df[[var]][df[[var]] < df_dict[df_dict$variable == var, ]$out_low | df[[var]] > df_dict[df_dict$variable == var, ]$out_high] <- NA

}

这篇关于如何用R中具有特定值范围的NA替换离群值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆