R.scale() 和 sklearn.preprocessing.scale() 的区别 [英] Difference between R.scale() and sklearn.preprocessing.scale()

查看:63
本文介绍了R.scale() 和 sklearn.preprocessing.scale() 的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在将我的数据分析从 R 转移到 Python.在 R 中缩放数据集时,我会使用 R.scale(),据我所知,它会执行以下操作:(x-mean(x))/sd(x)

为了替换该函数,我尝试使用 sklearn.preprocessing.scale().根据我对描述的理解,它做同样的事情.尽管如此,我运行了一个小测试文件并发现,这两种方法都有不同的返回值.显然标准差是不一样的......有人能解释为什么标准差会彼此偏离"吗?

MWE:

# 导入包从 sklearn 导入预处理导入 numpy导入 rpy2.robjects.numpy2ri从 rpy2.robjects.packages 导入导入器rpy2.robjects.numpy2ri.activate()# 设置R命名空间R = rpy2.robjects.rnp1 = numpy.array([[1.0,2.0],[3.0,1.0]])打印Numpy 数组:"打印 np1打印通过 R.scale() 缩放的 numpy 数组"打印 R.scale(np1)打印 " -  -  - -"打印通过 preprocessing.scale() 缩放的 numpy 数组"打印预处理.scale(np1,轴= 0,with_mean = True,with_std = True)scaler = preprocessing.StandardScaler()缩放器.fit(np1)打印预处理的平均值.scale():"打印缩放器.mean_打印预处理.scale()的标准:"打印缩放器.std_

输出:

解决方案

这似乎与标准偏差的计算方式有关.

<预><代码>>>>将 numpy 导入为 np>>>a = np.array([[1, 2],[3, 1]])>>>np.std(a,axis=0)数组([ 1. , 0.5])>>>np.std(a,axis=0,ddof=1)数组([1.41421356,0.70710678])

来自 numpy.std 文档

<块引用>

ddof : 整数,可选

表示 Delta 自由度.计算中使用的除数是 N - ddof,其中 N 表示元素的数量.默认情况下,ddof 为零.

显然,R.scale() 使用 ddof=1,但 sklearn.preprocessing.StandardScaler() 使用 ddof=0.

(解释如何使用备用 ddof)

在不访问 StandardScaler() 对象本身的变量的情况下,似乎没有一种直接的方法来计算带有备用 ddof 的 std.

sc = StandardScaler()sc.fit(数据)# 现在,sc.mean_ 和 sc.std_ 是数据的均值和标准差# 使用使用 numpy 计算的 std 替换 sc.std_ 值sc.std_ = numpy.std(数据,轴=0,ddof=1)

I am currently moving my data analysis from R to Python. When scaling a dataset in R i would use R.scale(), which in my understanding would do the following: (x-mean(x))/sd(x)

To replace that function I tried to use sklearn.preprocessing.scale(). From my understanding of the description it does the same thing. Nonetheless I ran a little test-file and found out, that both of these methods have different return-values. Obviously the standard deviations are not the same... Is someone able to explain why the standard deviations "deviate" from one another?

MWE:

# import packages
from sklearn import preprocessing
import numpy
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rpy2.robjects.numpy2ri.activate()
# Set up R namespaces
R = rpy2.robjects.r


np1 = numpy.array([[1.0,2.0],[3.0,1.0]])
print "Numpy-array:"
print np1

print "Scaled numpy array through R.scale()"
print R.scale(np1)
print "-------"
print "Scaled numpy array through preprocessing.scale()"
print preprocessing.scale(np1, axis = 0, with_mean = True, with_std = True)
scaler = preprocessing.StandardScaler()
scaler.fit(np1)
print "Mean of preprocessing.scale():"
print scaler.mean_
print "Std of preprocessing.scale():"
print scaler.std_

Output:

解决方案

It seems to have to do with how standard deviation is calculated.

>>> import numpy as np
>>> a = np.array([[1, 2],[3, 1]])
>>> np.std(a, axis=0)
array([ 1. ,  0.5])
>>> np.std(a, axis=0, ddof=1)
array([ 1.41421356,  0.70710678])

From numpy.std documentation,

ddof : int, optional

Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.

Apparently, R.scale() uses ddof=1, but sklearn.preprocessing.StandardScaler() uses ddof=0.

EDIT: (To explain how to use alternate ddof)

There doesn't seem to be a straightforward way to calculate std with alternate ddof, without accessing the variables of the StandardScaler() object itself.

sc = StandardScaler()
sc.fit(data)
# Now, sc.mean_ and sc.std_ are the mean and standard deviation of the data
# Replace the sc.std_ value using std calculated using numpy
sc.std_ = numpy.std(data, axis=0, ddof=1)

这篇关于R.scale() 和 sklearn.preprocessing.scale() 的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆