如何从数据集中删除异常值 [英] How to remove outliers from a dataset

查看:91
本文介绍了如何从数据集中删除异常值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些关于美貌与年龄的多元数据.年龄范围为 20-40 岁,间隔为 2(20、22、24....40),并且对于每条数据记录,他们被赋予一个年龄和 1-5 级的美貌评级.当我绘制这些数据的箱线图(X 轴为年龄,Y 轴为美貌评分)时,每个箱体的胡须外都绘制了一些异常值.

I've got some multivariate data of beauty vs ages. The ages range from 20-40 at intervals of 2 (20, 22, 24....40), and for each record of data, they are given an age and a beauty rating from 1-5. When I do boxplots of this data (ages across the X-axis, beauty ratings across the Y-axis), there are some outliers plotted outside the whiskers of each box.

我想从数据框本身中删除这些异常值,但我不确定 R 如何计算其箱线图的异常值.下面是我的数据可能是什么样子的示例.

I want to remove these outliers from the data frame itself, but I'm not sure how R calculates outliers for its box plots. Below is an example of what my data might look like.

推荐答案

好的,您应该将类​​似的内容应用到您的数据集.不要更换 &保存,否则你会破坏你的数据!而且,顺便说一句,您应该(几乎)永远不要从数据中删除异常值:

OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:

remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

查看实际效果:

set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()

再说一次,你永远不应该自己做这件事,离群值只是注定的!=)

And once again, you should never do this on your own, outliers are just meant to be! =)

我添加了 na.rm = TRUE 作为默认值.

I added na.rm = TRUE as default.

删除了 quantile 函数,添加了下标,从而使函数更快!=)

Removed quantile function, added subscripting, hence made the function faster! =)

这篇关于如何从数据集中删除异常值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆