如何在 R 中使用 Box-Cox 幂变换 [英] how to use the Box-Cox power transformation in R

查看:52
本文介绍了如何在 R 中使用 Box-Cox 幂变换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将一些数据转换为正常形状",并且我了解到 Box-Cox 可以识别用于转换数据的指数.

根据我的理解

car::boxCoxVariable(y)

用于线性模型中的响应变量,并且

MASS::boxcox(object)

用于公式或拟合模型对象.因此,因为我的数据是数据框的变量,所以我发现可以使用的唯一函数是:

car::powerTransform(dataframe$variable, family="bcPower")

这样对吗?还是我遗漏了什么?

第二个问题是关于我拿到后要做什么

估计的转换参数数据框$变量0.6394806

我应该简单地将变量乘以这个值吗?我这样做了:

aaa = 0.6394806dataframe$variable2 = (dataframe$variable)*aaa

然后我运行 shapiro-wilks 正态性测试,但我的数据似乎没有遵循正态分布:

shapiro.test(dataframe$variable2)数据:数据框$variable2W = 0.97508,p 值 <2.2e-16

解决方案

Box 和 Cox (1964) 提出了一系列变换,旨在减少线性模型中错误的非正态性.事实证明,在这样做时,它通常也会减少非线性.

这里是对原始工作和此后所做的所有工作的一个很好的总结:

(lambda <- bc$x[which.max(bc$y)])[1] 0.4242424powerTransform <- function(y, lambda1, lambda2 = NULL, method = "boxcox") {boxcoxTrans <- 函数(x,lam1,lam2 = NULL){# 如果我们将 lambda2 设置为零,它就变成了单参数变换lam2 <- ifelse(is.null(lam2), 0, lam2)如果(lam1 == 0L){日志(y + lam2)} 别的 {(((y + lam2)^lam1) - 1)/lam1}}开关(方法, boxcox = boxcoxTrans(y, lambda1, lambda2), tukey = y^lambda1)}# 重新运行转换mnew <- lm(powerTransform(y, lambda) ~ x)#QQ图op <- par(pty = "s", mfrow = c(1, 2))qqnorm(m$residuals);qqline(m$residuals)qqnorm(mnew$residuals);qqline(mnew$residuals)标准杆(操作)

正如你所看到的,这不是灵丹妙药——只有一些数据可以被有效地转换(通常小于 -2 或大于 2 的 lambda 是你不应该使用该方法的标志).与任何统计方法一样,实施前请谨慎使用.

要使用两个参数 Box-Cox 转换,请使用 geoR 包来查找 lambda:

library("geoR")bc2 <- boxcoxfit(x, y, lambda2 = TRUE)lambda1 <- bc2$lambda[1]lambda2 <- bc2$lambda[2]

@Yui-Shiuan 指出的 Tukey 和 Box-Cox 实现的合并已修复.

I need to transform some data into a 'normal shape' and I read that Box-Cox can identify the exponent to use to transform the data.

For what I understood

car::boxCoxVariable(y)

is used for response variables in linear models, and

MASS::boxcox(object)

for a formula or fitted model object. So, because my data are the variable of a dataframe, the only function I found I could use is:

car::powerTransform(dataframe$variable, family="bcPower")

Is that correct? Or am I missing something?

The second question is about what to do after I obtain the

Estimated transformation parameters
dataframe$variable
0.6394806

Should I simply multiply the variable by this value? I did so:

aaa = 0.6394806
dataframe$variable2 = (dataframe$variable)*aaa

and then I run the shapiro-wilks test for normality, but again my data don't seem to follow a normal distribution:

shapiro.test(dataframe$variable2)
data:  dataframe$variable2
W = 0.97508, p-value < 2.2e-16

解决方案

Box and Cox (1964) suggested a family of transformations designed to reduce nonnormality of the errors in a linear model. In turns out that in doing this, it often reduces non-linearity as well.

Here is a nice summary of the original work and all the work that's been done since: http://www.ime.usp.br/~abe/lista/pdfm9cJKUmFZp.pdf

You will notice, however, that the log-likelihood function governing the selection of the lambda power transform is dependent on the residual sum of squares of an underlying model (no LaTeX on SO -- see the reference), so no transformation can be applied without a model.

A typical application is as follows:

library(MASS)

# generate some data
set.seed(1)
n <- 100
x <- runif(n, 1, 5)
y <- x^3 + rnorm(n)

# run a linear model
m <- lm(y ~ x)

# run the box-cox transformation
bc <- boxcox(y ~ x)

(lambda <- bc$x[which.max(bc$y)])
[1] 0.4242424

powerTransform <- function(y, lambda1, lambda2 = NULL, method = "boxcox") {

  boxcoxTrans <- function(x, lam1, lam2 = NULL) {

    # if we set lambda2 to zero, it becomes the one parameter transformation
    lam2 <- ifelse(is.null(lam2), 0, lam2)

    if (lam1 == 0L) {
      log(y + lam2)
    } else {
      (((y + lam2)^lam1) - 1) / lam1
    }
  }

  switch(method
         , boxcox = boxcoxTrans(y, lambda1, lambda2)
         , tukey = y^lambda1
  )
}


# re-run with transformation
mnew <- lm(powerTransform(y, lambda) ~ x)

# QQ-plot
op <- par(pty = "s", mfrow = c(1, 2))
qqnorm(m$residuals); qqline(m$residuals)
qqnorm(mnew$residuals); qqline(mnew$residuals)
par(op)

As you can see this is no magic bullet -- only some data can be effectively transformed (usually a lambda less than -2 or greater than 2 is a sign you should not be using the method). As with any statistical method, use with caution before implementing.

To use the two parameter Box-Cox transformation, use the geoR package to find the lambdas:

library("geoR")
bc2 <- boxcoxfit(x, y, lambda2 = TRUE)

lambda1 <- bc2$lambda[1]
lambda2 <- bc2$lambda[2]

EDITS: Conflation of Tukey and Box-Cox implementation as pointed out by @Yui-Shiuan fixed.

这篇关于如何在 R 中使用 Box-Cox 幂变换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆