R：通过使用分位数0.05和0.95，数据帧中每列的异常值清理 [英] R: outlier cleaning for each column in a dataframe by using quantiles 0.05 and 0.95

查看：819 发布时间：2017/3/26 1:41:51 function r scaling dataframe outliers

本文介绍了R：通过使用分位数0.05和0.95，数据帧中每列的异常值清理的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是一个新手。在将样品放入随机森林之前，我想做一些异常清洗和超范围从0到1。

  g <-c（1000,60,50,60,50,40,50,60,70,60,40 ，70,50,60,50,70,10）

如果我从0开始简单的缩放 - 1，结果将是：

 > （（g  -  min（g））/ abs（max（g） -  min（g）），1）
 
 [1] 1.0 0.1 0.0 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0

所以我的想法是替换大于0.95的每列的值 - 小于0.95分位数的下一个值，并且与0.05分位数相同。

因此，预缩放的结果将是：

  g <-c（** 70 **，60,50,60,50,40,50,60,70,60,40 ，70,50,60,50,70，** 40 **）

并缩放： / p>

 > （（g  -  min（g））/ abs（max（g） -  min（g）），1）
 
 [1] 1.0 0.7 0.3 0.7 0.3 0.0 0.3 0.7 1.0 0.7 0.0 1.0 0.3 0.7 0.3 1.0 0.0

我需要整个数据框的这个公式，所以R中的功能实现应该是像：

 >应用（c，2，函数（x）x [x` <位数（x，0.95）]`< -max（x [x，... max，没有分位数（x，0.95））

任何人都可以帮助？

旁边说：如果存在一个直接执行这项工作的功能，请让我知道，我已经签出了 cut 和 cut2 。 cut 由于不是唯一的中断而失败; cut2 将工作，但只返回字符串值或平均值，我需要

试用：

  a< -c（100,6,5,6,5,4,5,6,7,6,4,7,5,6,5,7,1）
 
 b< c（1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10）
 
 c <-cbind（a ，b）
 
 c< -as.data.frame（c）

感谢您的帮助，

Rainer

解决方案

请不这样做，这不是处理异常值的好策略，特别是因为不太可能0％的数据是异常值！

I am a R-novice. I want to do some outlier cleaning and over-all-scaling from 0 to 1 before putting the sample into a random forest.

g<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10)

If i do a simple scaling from 0 - 1 the result would be:

> round((g - min(g))/abs(max(g) - min(g)),1)

 [1] 1.0 0.1 0.0 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0

So my idea is to replace the values of each column that are greater than the 0.95-quantile with the next value smaller than the 0.95-quantile - and the same for the 0.05-quantile.

So the pre-scaled result would be:

g<-c(**70**,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,**40**)

and scaled:

> round((g - min(g))/abs(max(g) - min(g)),1)

 [1] 1.0 0.7 0.3 0.7 0.3 0.0 0.3 0.7 1.0 0.7 0.0 1.0 0.3 0.7 0.3 1.0 0.0

I need this formula for a whole dataframe, so the functional implementation within R should be something like:

> apply(c, 2, function(x) x[x`<quantile(x, 0.95)]`<-max(x[x, ... max without the quantile(x, 0.95))

Can anyone help?

Spoken beside: if there exists a function that does this job directly, please let me know. I already checked out cut and cut2. cut fails because of not-unique breaks; cut2 would work, but only gives back string values or the mean value, and I need a numeric vector from 0 - 1.

for trial:

a<-c(100,6,5,6,5,4,5,6,7,6,4,7,5,6,5,7,1)

b<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10)

c<-cbind(a,b)

c<-as.data.frame(c)

Regards and thanks for help,

Rainer

解决方案

Please don't do this. This is not a good strategy for dealing with outliers - particularly since it's unlikely that 10% of your data are outliers!

这篇关于R：通过使用分位数0.05和0.95，数据帧中每列的异常值清理的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R：通过使用分位数0.05和0.95，数据帧中每列的异常值清理 [英] R: outlier cleaning for each column in a dataframe by using quantiles 0.05 and 0.95

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R：通过使用分位数0.05和0.95，数据帧中每列的异常值清理 [英] R: outlier cleaning for each column in a dataframe by using quantiles 0.05 and 0.95

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭