RevoScaleR:rxPredict,参数数量与变量数量不匹配 [英] RevoScaleR: rxPredict, the number of parameters does not match the number of variables

查看:106
本文介绍了RevoScaleR:rxPredict,参数数量与变量数量不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用了Microsoft的"数据科学端到端演练"用R Server进行设置,它们的示例效果很好.

I have used Microsoft's "Data Science End to End Walkthrough" to set myself up with R Server, and their example works perfectly.

该示例(纽约出租车数据)使用非分类变量(例如,距离,出租车费用等)来预测分类变量(对于是否支付小费,应为1或0).

The example (New York taxi data) uses non-categorical variables (ie distance, taxi fare etc.) to predict a categorical variable (1 or 0 for whether or not a tip was paid).

我正在尝试使用分类变量作为输入,使用线性回归(rxLinMod函数)来预测相似的二进制输出,并且出现错误.

I am trying to predict a similar binary output using categorical variables as an input, using linear regression (the rxLinMod function), and am coming up with an error.

该错误表明参数的数量与变量的数量不匹配,但是在我看来number of variables实际上是每个因子(变量)中的水平数量.

The error says that the number of parameters does not match the number of variables, however it looks to me like the number of variables is actually the number of levels within each factor (variable).

要复制

在SQL Server中创建一个名为example的表:

Create a table called example in SQL Server:

USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE TABLE [dbo].[example](
    [Person] [nvarchar](max) NULL,
    [City] [nvarchar](max) NULL,
    [Bin] [integer] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY];

将数据放入其中:

insert into [dbo].[example] values ('John','London',0);
insert into [dbo].[example] values ('Paul','New York',0);
insert into [dbo].[example] values ('George','Liverpool',1);
insert into [dbo].[example] values ('Ringo','Paris',1);
insert into [dbo].[example] values ('John','Sydney',1);
insert into [dbo].[example] values ('Paul','Mexico City',1);
insert into [dbo].[example] values ('George','London',1);
insert into [dbo].[example] values ('Ringo','New York',1);
insert into [dbo].[example] values ('John','Liverpool',1);
insert into [dbo].[example] values ('Paul','Paris',0);
insert into [dbo].[example] values ('George','Sydney',0);
insert into [dbo].[example] values ('Ringo','Mexico City',0);

我还使用了一个SQL函数,该函数以表格式返回变量,因为这正是Microsoft示例所要求的.创建函数formatAsTable:

I also use a SQL function which returns variables in table format, as that's what it looks like is required from the Microsoft example. Create the function formatAsTable:

USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE FUNCTION [dbo].[formatAsTable] (
@City nvarchar(max)='',
@Person nvarchar(max)='')
RETURNS TABLE
AS
  RETURN
  (
  -- Add the SELECT statement with parameter references here
  SELECT
    @City AS City,
    @Person AS Person
  );

我们现在有一个包含两个分类变量的表-PersonCity.

We now have a table with two categorical variables - Person, and City.

让我们开始预测.在R中,运行以下命令:

Let's start predicting. In R, run the following:

library(RevoScaleR)
# Set up the database connection
connStr <- "Driver=SQL Server;Server=<servername>;Database=<dbname>;Uid=<uid>;Pwd=<password>"
sqlShareDir <- paste("C:\\AllShare\\",Sys.getenv("USERNAME"),sep="")
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
cc <- RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir, 
                    wait = sqlWait, consoleOutput = sqlConsoleOutput)
rxSetComputeContext(cc)
# Set the SQL which gets our data base
sampleDataQuery <- "SELECT * from [dbo].[example] "
# Set up the data source
inDataSource <- RxSqlServerData(sqlQuery = sampleDataQuery, connectionString = connStr, 
                                colClasses = c(City = "factor",Bin="logical",Person="factor"
                                ),
                                rowsPerRead=500)    

现在,建立线性回归模型.

Now, set up the linear regression model.

isWonObj <- rxLinMod(Bin ~ City+Person,data = inDataSource)

查看模型对象:

isWonObj

请注意,它看起来像这样:

Notice it looks like this:

...
Total independent variables: 11 (Including number dropped: 3)
...

Coefficients:
                           Bin
(Intercept)       6.666667e-01
City=London      -1.666667e-01
City=New York     4.450074e-16
City=Liverpool    3.333333e-01
City=Paris        4.720871e-16
City=Sydney      -1.666667e-01
City=Mexico City       Dropped
Person=John      -1.489756e-16
Person=Paul      -3.333333e-01
Person=George          Dropped
Person=Ringo           Dropped

它说有11个变量,这很好,因为这是各因子水平的总和.

It says there are 11 variables, which is fine, as this is the sum of levels in the factors.

现在,当我尝试基于CityPerson预测Bin值时,出现错误:

Now, when I try to predict the Bin value based on City and Person, I get an error:

首先,我将要预测的CityPerson格式化为表格.然后,我预计将其用作输入.

First I format the City and Person I want to predict for as a table. Then, I predict using this as an input.

sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"
pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
                      , colClasses = c(City = "factor",Person="factor"))

如果您检查pred对象,则它看起来像预期的那样:

If you check the pred object, it looks as expected:

> head(pred)
    City Person
1 London George

现在,当我尝试预测时,我得到了一个错误.

Now when I try to predict, I get an error.

scoredOutput <- RxSqlServerData(
  connectionString = connStr,
  table = "binaryOutput"
)

rxPredict(modelObject = isWonObj, data = pred, outData = scoredOutput, 
          predVarNames = "Score", type = "response", writeModelVars = FALSE, overwrite = TRUE,checkFactorLevels = FALSE)

错误提示:

INTERNAL ERROR: In rxPredict, the number of parameters does not match the number of  variables: 3 vs. 11. 

我可以看到11的来源,但是我只向预测查询提供了2个值-所以我看不到3的来源,或者为什么会有问题.

I can see where the 11 comes from, but I have only supplied 2 values to the predict query - so I can't see where the 3 comes from, or why there is a problem.

感谢您的协助!

推荐答案

虽然仅设置因子水平(... levels(predictionData $ fac)< -levels(trainingData $ fac ...)导致模型使用了错误的因子索引,如果将writeModelVars设置为TRUE则可以看到,尽管我已将查询正确传递给SQL Server,但在RxSqlServerData中将colInfo设置为具有接近10.000级别的因子,导致应用程序挂起.我的策略是在没有任何因素的情况下将数据加载到数据框中,然后对其应用RxFactors:

While only setting the factor levels (... levels(predictionData$fac)<-levels(trainingData$fac ...) avoids the error it also leads to wrong factor indices used by the model, which can be seen if writeModelVars is set to TRUE. Setting colInfo for my factor with almost 10.000 levels in RxSqlServerData resulted in an application hang, although the query was passed to SQL Server correctly. I changed my strategy into loading the data into a data frame without any factors and then apply RxFactors to it:

rxSetComputeContext("local")

rxSetComputeContext("local")

sqlPredictQueryDS<-RxSqlServerData(connectionString = sqlConnString,sqlQuery = sqlQuery,stringsAsFactors = FALSE)

sqlPredictQueryDS <- RxSqlServerData(connectionString = sqlConnString, sqlQuery = sqlQuery, stringsAsFactors = FALSE)

predictQueryDS = rxImport(sqlPredictQueryDS)

predictQueryDS = rxImport(sqlPredictQueryDS)

if("Artikelnummer"%in%colnames(predictQueryDS)){ForecastQueryDS<-rxFactors(predictQueryDS,factorInfo = list(Artikelnummer = list(levels = allItems)))}

if ("Artikelnummer" %in% colnames(predictQueryDS)) { predictQueryDS <- rxFactors(predictQueryDS, factorInfo = list(Artikelnummer = list(levels = allItems))) }

除了设置所需的因子水平外,RxFactors还对因子索引进行重新排序.我并不是说colInfo的解决方案是错误的,也许它对于水平太多"的因素不起作用.

In addition to setting the needed factor levels RxFactors also reorders the factor indices. I'm not saying the solution with colInfo is wrong, maybe it just doesn't work for factors with "too many" levels.

这篇关于RevoScaleR:rxPredict,参数数量与变量数量不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆