R 和 Python 给出不同的结果(中值、IQR、平均值和 STD) [英] R and Python Give Different Results (Median, IQR, Mean, and STD)

查看:46
本文介绍了R 和 Python 给出不同的结果(中值、IQR、平均值和 STD)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对我的数据进行特征缩放,而 R 和 Python 在缩放方面给了我不同的答案.R 和 Python 对许多统计值给出了不同的答案:

I am doing feature scaling on my data and R and Python are giving me different answers in the scaling. R and Python give different answers for the many statistical values:

中位数:Numpy 给 14.948499999999999 和这个代码:np.percentile(X[:, 0], 50,interpolation = 'midpoint').Python 中内置的 Statistics 包通过以下代码给出了相同的答案:statistics.median(X[:, 0]).另一方面,R 给出了这个结果 14.9632 和这个代码:median(X[, 1]).有趣的是,R 中的 summary() 函数给出了 14.960 作为中值.

Median: Numpy gives 14.948499999999999 with this code:np.percentile(X[:, 0], 50, interpolation = 'midpoint'). The built in Statistics package in Python gives the same answer with the following code: statistics.median(X[:, 0]). On the other hand, R gives this results 14.9632 with this code: median(X[, 1]). Interestingly, the summary() function in R gives 14.960 as the median.

在计算相同数据的 mean 时会出现类似的差异.R 使用内置的 mean() 函数给出 13.10936,Numpy 和 Python Statistics 包给出 13.097945407088607.

A similar difference occurs when computing the mean of this same data. R gives 13.10936 using the built-in mean() function and both Numpy and the Python Statistics package give 13.097945407088607.

同样,在计算标准偏差时也会发生同样的事情.R 给出 7.390328,Numpy(DDOF = 1)给出 7.3927612774052083.当 DDOF = 0 时,Numpy 给出 7.3927565984408936.

Again, the same thing happens when computing the Standard Deviation. R gives 7.390328 and Numpy (with DDOF = 1) gives 7.3927612774052083. With DDOF = 0, Numpy gives 7.3927565984408936.

IQR 也给出了不同的结果.使用 R 中内置的 IQR() 函数,给定的结果是 12.3468.将此代码与 Numpy 一起使用:np.percentile(X[:, 0], 75) - np.percentile(X[:, 0], 25) 结果为 12.358700000000002>.

The IQR also gives different results. Using the built-in IQR() function in R, the given results is 12.3468. Using Numpy with this code: np.percentile(X[:, 0], 75) - np.percentile(X[:, 0], 25) the results is 12.358700000000002.

这里发生了什么?为什么 Python 和 R 总是给出不同的结果?知道我的数据有 795066 行并且在 Python 中被视为 np.array() 可能会有所帮助.相同的数据在 R 中被视为 matrix.

What is going on here? Why are Python and R always giving different results? It may help to know that my data has 795066 rows and is being treated as an np.array() in Python. The same data is being treated as a matrix in R.

推荐答案

tl;dr 即使对于如此简单的汇总统计,算法也存在一些潜在差异,但考虑到您会看到差异全面,甚至在相对简单的计算(例如中位数)中,我认为问题更有可能是值在平台之间的传输中以某种方式被截断/修改/失去精度.

tl;dr there are a few potential differences in algorithms even for such simple summary statistics, but given that you're seeing differences across the board and even in relatively simple computations such as the median, I think the problem is more likely that the values are getting truncated/modified/losing precision somehow in the transfer between platforms.

(这与其说是一个答案,不如说是一个扩展评论,但它变得很长.)

(This is more of an extended comment than an answer, but it was getting awkwardly long.)

  • 如果没有可重现的例子,你不可能走得更远;有多种方法可以创建示例来检验假设的差异,但最好是您自己这样做,而不是让回答者这样做.

  • you're unlikely to get much farther without a reproducible example; there are various ways to create examples to test hypotheses for the differences, but it's better if you do so yourself rather than making answerers do it.

你如何将数据传入/传出 Python/R?传输中使用的表示是否有一些四舍五入?(max/min 得到什么,它应该基于一个没有浮点计算的数字?如果你删除一个值来得到一个奇数长度的向量并取中位数呢?)

how are you transferring data to/from Python/R? Is there some rounding in the representation used in the transfer? (What do you get for max/min, which should be based on a single number with no floating-point computations? How about if you drop one value to get an odd-length vector and take the median?)

中位数:我原本想说这可能是定义偶数长度向量的分位数插值的不同方法的函数,但中位数的定义稍微简单一些比一般分位数,所以我不确定.在这种情况下,您在上面报告的差异似乎太大而无法由浮点计算驱动(因为计算只是两个相似幅度的值的平均值).

medians: I was originally going to say that this could be a function of different ways to define quantile interpolation for an even-length vector, but the definition of the median is somewhat simpler than general quantiles, so I'm not sure. The differences you're reporting above seem way too big to be driven by floating-point computation in this case (since the computation is just an average of two values of similar magnitude).

IQR:类似地,百分位数/分位数有不同的可能定义:参见 R 中的 ?quantile.

IQRs: similarly, there are different possible definitions of percentiles/quantiles: see ?quantile in R.

median() vs summary():R 的 summary() 以降低的精度报告值(通常用于快速概览);这是常见的混淆来源.

median() vs summary(): R's summary() reports values at reduced precision (often useful for a quick overview); this is a common source of confusion.

mean/sd:这里的算法可能存在一些微妙之处——例如,R 在求和之前对向量进行排序 在内部使用扩展精度来减少不稳定性,我不知道 Python 是否这样做.但是,除非数据有点奇怪,否则这不会产生像您看到的那么大的差异:

mean/sd: there are some possible subtleties in the algorithm here -- for example, R sorts the vector before summing uses extended precision internally to reduce instability, I don't know if Python does or not. However, this shouldn't make as big a difference as you're seeing unless the data are a bit weird:

 x <- rnorm(1000000,mean=0,sd=1)
 > mean(x)
 [1] 0.001386724
 > sum(x)/length(x)
 [1] 0.001386724
 > mean(x)-sum(x)/length(x)
 [1] -1.734723e-18

同样,计算方差/标准差的方法越来越不稳定.

Similarly, there are more- and less-stable ways to compute a variance/standard deviation.

这篇关于R 和 Python 给出不同的结果(中值、IQR、平均值和 STD)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆