使用`boxplot.stats`删除R中的数据框异常值 [英] Removing dataframe outliers in R with `boxplot.stats`

查看:128
本文介绍了使用`boxplot.stats`删除R中的数据框异常值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R处比较陌生,所以请多包涵。

I'm relatively new at R, so please bear with me.

我正在使用Ames数据集(数据集的完整说明此处;链接到数据集下载此处)。

I'm using the Ames dataset (full description of dataset here; link to dataset download here).

我正在尝试创建一个子集数据框,该数据框将允许我运行线性回归分析,并且我试图使用<$ c删除异常值$ c> boxplot.stats 函数。我使用以下代码创建了一个包含样本的框架:

I'm trying to create a subset data frame that will allow me to run a linear regression analysis, and I'm trying to remove the outliers using the boxplot.stats function. I created a frame that will include my samples using the following code:

regressionFrame <- data.frame(subset(ames_housing_data[,c('SalePrice','GrLivArea','LotArea')] , BldgType == '1Fam'))

我的下一个目标是删除异常值,因此我尝试使用 which()函数进行子集化:

My next objective was to remove the outliers, so I tried to subset using a which() function:

regressionFrame <- regressionFrame[which(regressionFrame$GrLivArea != boxplot.stats(regressionFrame$GrLivArea)$out),]

不幸的是,产生了


更长的对象长度不是较短的对象长度的倍数

longer object length is not a multiple of shorter object length

错误。有谁知道一种更好的方法来实现此目的,最好使用 which()子设置功能?我假设它会包含某种形式的 lapply(),但是对于我来说,我不知道该怎么做。 (我认为以后总是可以学习更高级的方法,但这是我现在所要学习的方法。)

error. Does anyone know a better way to approach this, ideally using the which() subsetting function? I'm assuming it would include some form of lapply(), but for the life of me I can't figure out how. (I figure I can always learn fancier methods later, but this is the one I'm going for right now since I already understand it.)

推荐答案

boxplot.stats 很好地使用。

您不能使用<$ c $来安全地测试c>!= 如果 boxplot.stats $ out 中返回多个异常值。这里的类比是 1:5!= 1:3 。您可能想尝试!(1:5%in%1:3)

You can not test SAFELY using != if boxplot.stats returns you more than one outliers in $out. An analogy here is 1:5 != 1:3. You probably want to try !(1:5 %in% 1:3).

regressionFrame <- subset(regressionFrame,
                          subset = !(GrLivArea %in% boxplot.stats(GrLivArea)$out))

安全是指 1:5!= 1:3 给出错误的结果并带有警告,但 1:6!= 1:3 会给出错误的结果,而不会发出警告。该警告与回收规则有关。在后一种情况下, 1:3 可以被回收为具有相同长度的 1:6 (即, 1:6 的长度是 1:3 长度的倍数),因此您将使用 1:6!= c(1:3,1:3)

What I mean by SAFELY, is that 1:5 != 1:3 gives a wrong result with a warning, but 1:6 != 1:3 gives a wrong result without warning. The warning is related to the recycling rule. In the latter case, 1:3 can be recycled to have the same length of 1:6 (that is, the length of 1:6 is a multiple of the length of 1:3), so you will be testing with 1:6 != c(1:3, 1:3).

一个简单的例子。

x <- c(1:10/10, 101, 102, 103)  ## has three outliers: 101, 102 and 103
out <- boxplot.stats(x)$out  ## `boxplot.stats` has picked them out
x[x != out]  ## this gives a warning and wrong result
x[!(x %in% out)]  ## this removes them from x

这篇关于使用`boxplot.stats`删除R中的数据框异常值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆