修正“摘要”在R中有适当的精度位数 [英] Correcting "summary" in R with appropriate # of digits of precision

查看:181
本文介绍了修正“摘要”在R中有适当的精度位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一个简单的简单的问题,简单的看似无辜的函数: summary



直到我看到Min和Max的结果超出了我的数据范围,我并不知道 summary 有一个 digits 参数来指定输出结果的精度。我的问题是关于如何以一个干净,普遍的方式解决这个问题。



这是一个问题的例子:

  set.seed(0)
vals <-1 + 10 * 1:50000
df < - cbind(rnorm(10000),sample vals,10000),runif(10000))

应用摘要 range ,我们得到以下输出 - 注意范围值与最小值和最大值的差异:

 >适用(df,2,摘要)

[,1] [,2] [,3]
最小。 -3.703000 11 6.791e-05
第一曲。 -0.668500 122800 2.498e-01
中位数0.009778 248000 5.014e-01
平均值0.010450 248800 5.001e-01
三曲。 0.688800 374000 7.502e-01
最大值3.568000 499900 9.999e-01

> (df,2,range)
[,1] [,2] [,3]
[1,] -3.703236 11 6.790622e-05
[2,] 3.568101 499931 9.998686 e-01

summary 有点令人不安,所以我看了 digits 选项,但这只是格式化输出的标准符号。 另外请注意:除Min以外的每个分位数都显示 不存在于数据集 中的值(这就是为什么我将 vals 的定义中,c $ c> 1 + 也不会在大多数标准分位数计算中看到这些分位数,即使允许在中点选择。 (当我在原始数据中看到这个时,我想知道我是如何从一切中失去1的值的)
$ b

可以解释

(即格式化和精确性)和统计上有动机的检验(这些值被确定为分位数实际上在数据集的范围内)。既然我们不能改变期望,我们需要改变代码的行为,或至少改进它。



问题:是否有一些更合适的方法来设置输出可以确定的范围,除了设置为一个很大的值,例如 digits = 16 ? 16甚至是最合适的通用默认值?使用16位数似乎是双精度浮点数精度的最好保证,虽然看起来输出实际上不会有16位数(输出仍然被截断为8或9位数)。

b
$ b 更新1:由于@BrianDiggs已经注意到,通过链接,行为被记录下来,但却是意外的。为了澄清我的问题,相对于Brian提供的链接上的答案(除了Brian自己的答案):并不是行为没有记录,但是将最小值和最大值不是最小值和最大值表示为明显的错误。在默认设置下出现错误输出的文档化功能需要与非默认设置一起使用(或者不应该使用)。 (也许可以争论Min和Max是否应该改名为近似最小和近似最大,但是我们不要去那里。)

更新2:正如@Dwin所指出的, summary()的默认值是 max(3,getOption(digits) - 3)。我以前错误地说默认是3.有趣的是,这意味着 两种方式 来设置输出的行为。如果我们使用 ,行为会变得很奇怪:

 > options(digits = 20)
> (df,2,summary,digits = 10)

[,1] [,2] [,3]
最小。 -3.7032358429999998605808 11.00000000000000 6.7906221370000004927e-05
第一曲。 -0.6684710537000000396546 122798.50000000000000 2.4977348059999998631e-01
中位数0.0097783099960000001427 247971.00000000000000 5.0137970539999998643e-01
平均值0.0104475229200000005458 248776.38699999998789 5.0011818200000002221e-01
3曲。 0.6887842181000000119084 374031.00000000000000 7.5024240300000000214e-01
最大。 3.5681007909999999938577 499931.00000000000000 9.9986864070000003313e-01

注意,现在有20个数字的输出,即使参数已经通过指定10位精度。如果我们将数字的全局选项设置为16这样的理智值,那么如果我们提供 summary 参数为10,我们仍然会遇到问题。



我相信这些文档是不完整的,Brian Diggs在R-help的链接中深思熟虑的答案中指出了其他问题。



尽管有这些皱纹,问题仍然存在,但也许是不能回答的。我怀疑,最好的结果是简单地离开全球数字选项(虽然我有点被上述行为的影响扰乱),而是传递一个值16到摘要。在指定输出精度的地方并不明显,但是在 summary.data.frame 看起来像(在我的灵魂上说me )一个黑客。



更新3:我接受迪文的答案 - 这让我了解这种香肠是如何制作的。看到发生了什么,我不认为有一种方法可以做我所要求的,而不用重写 summary

summary.data.frame 的默认值不是数字= 3,而是:

  ... max(3,getOption(digits) -  3)#在参数列表中设置
getOption(digits)#默认设置
[1] 7
options(digits = 10)
>摘要(df)
V1 V2 V3
最小。 :-3.70323584分钟。 :11.0分钟:6.790622e-05
1st Qu.:-0.66847105 1st Qu.:122798.5 1st Qu.:2.497735e-01
中位数:0.00977831中位值:247971.0中位数:5.013797e-01
平均值: 0.01044752平均值:248776.4平均值:5.001182e-01
三次曲线:0.68878422第三次曲线:374031.0第三条曲线:7.502424e-01
最大值:3.56810079最大。 :499931.0最大。 :9.998686e-01


A simple question on a simple seemingly innocent function: summary.

Until I saw results for Min and Max that were outside the range of my data, I was unaware that summary has a digits argument to specify precision of the output results. My question is about how to address this in a clean, universal manner.

Here is an example of the issue:

set.seed(0)
vals    <- 1 + 10 * 1:50000
df      <- cbind(rnorm(10000),sample(vals, 10000), runif(10000))

Applying summary and range, we get the following output - notice the discrepancy in the range values versus the Min and Max:

    > apply(df, 2, summary)

                [,1]   [,2]      [,3]
    Min.    -3.703000     11 6.791e-05
    1st Qu. -0.668500 122800 2.498e-01
    Median   0.009778 248000 5.014e-01
    Mean     0.010450 248800 5.001e-01
    3rd Qu.  0.688800 374000 7.502e-01
    Max.     3.568000 499900 9.999e-01

    >     apply(df, 2, range)
            [,1]   [,2]         [,3]
    [1,] -3.703236     11 6.790622e-05
    [2,]  3.568101 499931 9.998686e-01

Seeing erroneous ranges in summary is a little disconcerting, so I looked at the digits option, but this is simply the standard notation for formatting output. Also note: Every single quantile other than Min shows a value that does not exist in the data set (this is why I put a 1 + in the definition for vals), nor would one see these quantiles in most standard quantile calculations, even allowing for differences in midpoint selection. (When I saw this in the original data, I wondered how I had lost a value of 1 from everything!)

There is a difference between explicable computational behavior (i.e. formatting and precision) and statistically-motivated expecations (such values identified as quantiles actually being within the range of the dataset). Since we can't change the expectations, we need to change the behavior of the code or at least improve it.

The question: Is there some more appropriate way to set the output to be sure of the range, other than setting it to a large value, e.g. digits = 16? Is 16 even the most appropriate universal default? Using 16 digits seems to be the best guarantee of precision for double floats, though it seems the output will not actually have 16 digits (the output still seems to be truncated to 8 or 9 digits).


Update 1: As @BrianDiggs has noted, via the links, the behavior is documented, but unexpected. To clarify my issue, relative to the answers on the link provided by Brian (excepting the answer by Brian himself): it's not that the behavior is undocumented, but it's flatly wrong to denote as Min and Max values which are not Min and Max. A documented function that gives incorrect output in its default settings needs to be used with non-default settings (or should not be used). (Maybe one could argue whether "Min" and "Max" should be renamed as "Approximate Min" and "Approximate Max", but let's not go there.)

Update 2: As @Dwin has noted, summary() takes as its default max(3, getOption("digits") - 3). I'd previously erred in saying the default was 3. What's interesting about this is that this implies two ways to set the behavior of the output. If we use both, the behavior gets weird:

> options(digits = 20)
> apply(df, 2, summary, digits = 10)

                             [,1]                  [,2]                      [,3]
Min.    -3.7032358429999998605808     11.00000000000000 6.7906221370000004927e-05
1st Qu. -0.6684710537000000396546 122798.50000000000000 2.4977348059999998631e-01
Median   0.0097783099960000001427 247971.00000000000000 5.0137970539999998643e-01
Mean     0.0104475229200000005458 248776.38699999998789 5.0011818200000002221e-01
3rd Qu.  0.6887842181000000119084 374031.00000000000000 7.5024240300000000214e-01
Max.     3.5681007909999999938577 499931.00000000000000 9.9986864070000003313e-01

Notice that this now has 20 digits of output, even though the argument passed specifies 10 digits of precision. If we set the global option for digits to be some "sane" value like 16, we still end up with issues if we provide summary with an argument of 10.

I believe the documentation is incomplete, and Brian Diggs has pointed out other issues with it in his thoughtful answer in the link to R-help.

Despite these wrinkles, the question remains open, but maybe it can't be answered. I suspect that the best result is simply to leave the global digits option as-is (though I am a little disturbed by the implications of the above behavior) and instead pass a value of 16 to summary. It isn't immediately obvious where the output precision is specified, but this interaction of 4 values - the global option (and the global option - 3), the passed value, and a hard-coded value of 12 in summary.data.frame looks like (have meRcy on my soul for saying this) a hack.

Update 3: I'm accepting DWin's answer - it led to me understanding how this sausage is made. Seeing what is going on, I don't think there's a way to do what I ask, without rewriting summary.

解决方案

The default for summary.data.frame is not digits=3, but rather:

   ... max(3, getOption("digits") - 3)  # set in the argument list
getOption("digits")    # the default setting
[1] 7
options(digits=10)
> summary(df)
       V1                    V2                 V3              
 Min.   :-3.70323584   Min.   :    11.0   Min.   :6.790622e-05  
 1st Qu.:-0.66847105   1st Qu.:122798.5   1st Qu.:2.497735e-01  
 Median : 0.00977831   Median :247971.0   Median :5.013797e-01  
 Mean   : 0.01044752   Mean   :248776.4   Mean   :5.001182e-01  
 3rd Qu.: 0.68878422   3rd Qu.:374031.0   3rd Qu.:7.502424e-01  
 Max.   : 3.56810079   Max.   :499931.0   Max.   :9.998686e-01  

这篇关于修正“摘要”在R中有适当的精度位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆