R:通过使用分位数0.05和0.95,数据帧中每列的异常值清理 [英] R: outlier cleaning for each column in a dataframe by using quantiles 0.05 and 0.95
问题描述
g <-c(1000,60,50,60,50,40,50,60,70,60,40 ,70,50,60,50,70,10)
如果我从0开始简单的缩放 - 1,结果将是:
> ((g - min(g))/ abs(max(g) - min(g)),1)
[1] 1.0 0.1 0.0 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0
所以我的想法是替换大于0.95的每列的值 - 小于0.95分位数的下一个值,并且与0.05分位数相同。
因此,预缩放的结果将是:
g <-c(** 70 **,60,50,60,50,40,50,60,70,60,40 ,70,50,60,50,70,** 40 **)
并缩放: / p>
> ((g - min(g))/ abs(max(g) - min(g)),1)
[1] 1.0 0.7 0.3 0.7 0.3 0.0 0.3 0.7 1.0 0.7 0.0 1.0 0.3 0.7 0.3 1.0 0.0
我需要整个数据框的这个公式,所以R中的功能实现应该是像:
>应用(c,2,函数(x)x [x` <位数(x,0.95)]`< -max(x [x,... max,没有分位数(x,0.95))
任何人都可以帮助?
旁边说:如果存在一个直接执行这项工作的功能,请让我知道,我已经签出了 cut
和 cut2
。 cut
由于不是唯一的中断而失败; cut2
将工作,但只返回字符串值或平均值,我需要
试用:
a< -c(100,6,5,6,5,4,5,6,7,6,4,7,5,6,5,7,1)
b< c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10)
c <-cbind(a ,b)
c< -as.data.frame(c)
感谢您的帮助,
Rainer
请不这样做,这不是处理异常值的好策略,特别是因为不太可能0%的数据是异常值!
I am a R-novice. I want to do some outlier cleaning and over-all-scaling from 0 to 1 before putting the sample into a random forest.
g<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10)
If i do a simple scaling from 0 - 1 the result would be:
> round((g - min(g))/abs(max(g) - min(g)),1)
[1] 1.0 0.1 0.0 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0
So my idea is to replace the values of each column that are greater than the 0.95-quantile with the next value smaller than the 0.95-quantile - and the same for the 0.05-quantile.
So the pre-scaled result would be:
g<-c(**70**,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,**40**)
and scaled:
> round((g - min(g))/abs(max(g) - min(g)),1)
[1] 1.0 0.7 0.3 0.7 0.3 0.0 0.3 0.7 1.0 0.7 0.0 1.0 0.3 0.7 0.3 1.0 0.0
I need this formula for a whole dataframe, so the functional implementation within R should be something like:
> apply(c, 2, function(x) x[x`<quantile(x, 0.95)]`<-max(x[x, ... max without the quantile(x, 0.95))
Can anyone help?
Spoken beside: if there exists a function that does this job directly, please let me know. I already checked out cut
and cut2
. cut
fails because of not-unique breaks; cut2
would work, but only gives back string values or the mean value, and I need a numeric vector from 0 - 1.
for trial:
a<-c(100,6,5,6,5,4,5,6,7,6,4,7,5,6,5,7,1)
b<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10)
c<-cbind(a,b)
c<-as.data.frame(c)
Regards and thanks for help,
Rainer
Please don't do this. This is not a good strategy for dealing with outliers - particularly since it's unlikely that 10% of your data are outliers!
这篇关于R:通过使用分位数0.05和0.95,数据帧中每列的异常值清理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!