对于线性回归,h2o.glm与R中的glm不匹配 [英] h2o.glm does not match glm in R for linear regressions

查看:179
本文介绍了对于线性回归,h2o.glm与R中的glm不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在与H.O.ai(版本3.10.3.6)和R一起使用.

我正在努力用h2o.glm复制glm的结果.我期望得到完全相同的结果(在这种情况下,根据均方误差进行评估),但我发现必须确保h2o的准确性较差.由于我的模型是高斯模型,因此我希望这两种情况都是普通的最小二乘(或最大似然)回归.

这是我的例子:

train <- model.matrix(~., training_df)
test <- model.matrix(~., testing_df)

model1 <- glm(response ~., data=data.frame(train))
yhat1 <- predict(model1 , newdata=data.frame(test))
mse1 <- mean((testing_df$response - yhat1)^2) #5299.128

h2o_training <- as.h2o(train)[-1,]
h2o_testing <- as.h2o(test)[-1,]

model2 <- h2o.glm(x = 2:dim(h2o_training)[2], y = 1,
                  training_frame = h2o_training,
                  family = "gaussian", alpha = 0)

yhat2 <- h2o.predict(model2, h2o_testing)
yhat2 <- as.numeric(as.data.frame(yhat2)[,1])
mse2 <- mean((testing_df$response - yhat2)^2) #8791.334

对于h2o模型,MSE高出60%.我的假设glm≈h2o.glm是错误的吗?我将尽快提供一个示例数据集(训练数据集是机密的,有350000行x 350列).

一个额外的问题:由于某种原因,as.h2o添加了一个充满NA的额外行,因此h2o_training和h2o_testing额外增加了一行.在构建模型之前将其删除(就像我在这里所做的那样:as.h2o(train)[-1,])不会影响回归性能.没有传递给glm或h2o.glm的NA值;即训练矩阵没有NA值.

解决方案

您需要设置一些参数才能使H2O的GLM匹配R的GLM,因为默认情况下,它们的功能不同.这是您需要设置以获得相同结果的示例:

library(h2o)
h2o.init(nthreads = -1)

path <- system.file("extdata", "prostate.csv", package = "h2o")
train <- h2o.importFile(filepath)

# Run GLM of VOL ~ CAPSULE + AGE + RACE + PSA + GLEASON
x <- setdiff(colnames(train), c("ID", "DPROS", "DCAPS", "VOL"))

# Train H2O GLM (designed to match R)
h2o_glmfit <- h2o.glm(y = "VOL", 
                      x = x, 
                      training_frame = train, 
                      family = "gaussian",
                      lambda = 0,
                      remove_collinear_columns = TRUE,
                      compute_p_values = TRUE,
                      solver = "IRLSM")

# Train an R GLM
r_glmfit <- glm(VOL ~ CAPSULE + AGE + RACE + PSA + GLEASON, 
                data = as.data.frame(train)) 

以下是它们的配角:

> h2o.coef(h2o_glmfit)
  Intercept     CAPSULE         AGE 
-4.35605671 -4.29056573  0.29789896 
       RACE         PSA     GLEASON 
 4.35567076  0.04945783 -0.51260829 

> coef(r_glmfit)
(Intercept)     CAPSULE         AGE 
-4.35605671 -4.29056573  0.29789896 
       RACE         PSA     GLEASON 
 4.35567076  0.04945783 -0.51260829 

我添加了 JIRA票据,以将此信息添加到文档中. /p>

I have been working with H2O.ai (version 3.10.3.6) in combination with R.

I am struggling to replicate the results from glm with h2o.glm. I would expect exactly the same result (evaluated, in this case, in terms of mean square error), but I am seeing must worse accuracy with h2o. Since my model is Gaussian, I would expect both cases to be ordinary least squares (or maximum likelihood) regressions.

Here is my example:

train <- model.matrix(~., training_df)
test <- model.matrix(~., testing_df)

model1 <- glm(response ~., data=data.frame(train))
yhat1 <- predict(model1 , newdata=data.frame(test))
mse1 <- mean((testing_df$response - yhat1)^2) #5299.128

h2o_training <- as.h2o(train)[-1,]
h2o_testing <- as.h2o(test)[-1,]

model2 <- h2o.glm(x = 2:dim(h2o_training)[2], y = 1,
                  training_frame = h2o_training,
                  family = "gaussian", alpha = 0)

yhat2 <- h2o.predict(model2, h2o_testing)
yhat2 <- as.numeric(as.data.frame(yhat2)[,1])
mse2 <- mean((testing_df$response - yhat2)^2) #8791.334

The MSE is 60% higher for the h2o model. Is my hypothesis that glm ≈ h2o.glm wrong? I will look to provide an example dataset asap (the training dataset is confidential and 350000 rows x 350 columns).

An extra question: for some reason, as.h2o adds an extra row full of NAs, so that h2o_training and h2o_testing have an additional row. Removing it (as I do here: as.h2o(train)[-1,]) before building the model does not affect the regression performance. There are no NA values passed to either glm or h2o.glm; i.e. the training matrices do not have NA values.

解决方案

There are a few arguments you need to set in order to get H2O's GLM to match R's GLM, since by default, they do not function the same way. Here is an example of what you need to set to get identical results:

library(h2o)
h2o.init(nthreads = -1)

path <- system.file("extdata", "prostate.csv", package = "h2o")
train <- h2o.importFile(filepath)

# Run GLM of VOL ~ CAPSULE + AGE + RACE + PSA + GLEASON
x <- setdiff(colnames(train), c("ID", "DPROS", "DCAPS", "VOL"))

# Train H2O GLM (designed to match R)
h2o_glmfit <- h2o.glm(y = "VOL", 
                      x = x, 
                      training_frame = train, 
                      family = "gaussian",
                      lambda = 0,
                      remove_collinear_columns = TRUE,
                      compute_p_values = TRUE,
                      solver = "IRLSM")

# Train an R GLM
r_glmfit <- glm(VOL ~ CAPSULE + AGE + RACE + PSA + GLEASON, 
                data = as.data.frame(train)) 

Here are the coefs (they match):

> h2o.coef(h2o_glmfit)
  Intercept     CAPSULE         AGE 
-4.35605671 -4.29056573  0.29789896 
       RACE         PSA     GLEASON 
 4.35567076  0.04945783 -0.51260829 

> coef(r_glmfit)
(Intercept)     CAPSULE         AGE 
-4.35605671 -4.29056573  0.29789896 
       RACE         PSA     GLEASON 
 4.35567076  0.04945783 -0.51260829 

I've added a JIRA ticket to add this info to the docs.

这篇关于对于线性回归,h2o.glm与R中的glm不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆