如何从 PySpark 中的 spark.ml 中提取模型超参数? [英] How to extract model hyper-parameters from spark.ml in PySpark?

查看:46
本文介绍了如何从 PySpark 中的 spark.ml 中提取模型超参数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在修改 PySpark 文档中的一些交叉验证代码,并试图让 PySpark 告诉我选择了什么模型:

from pyspark.ml.classification import LogisticRegression从 pyspark.ml.evaluation 导入 BinaryClassificationEvaluator从 pyspark.mllib.linalg 导入向量从 pyspark.ml.tuning 导入 ParamGridBuilder,CrossValidator数据集 = sqlContext.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[功能",标签"])lr = LogisticRegression()grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build()评估器 = BinaryClassificationEvaluator()cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)cvModel = cv.fit(数据集)

在 PySpark shell 中运行它,我可以获得线性回归模型的系数,但我似乎无法找到由交叉验证程序选择的 lr.regParam 的值.有什么想法吗?

在[3]中:cvModel.bestModel.coefficients输出[3]:DenseVector([3.1573])在 [4]: cvModel.bestModel.explainParams()出[4]:''在 [5] 中:cvModel.bestModel.extractParamMap()输出 [5]:{}在 [15]: cvModel.params出[15]:[]在 [36]: cvModel.bestModel.params出[36]:[]

解决方案

也遇到了这个问题.我发现由于某种原因您需要调用 java 属性,我不知道为什么.所以就这样做:

from pyspark.ml.tuning import TrainValidationSplit、ParamGridBuilder、CrossValidator从 pyspark.ml.regression 导入 LinearRegression从 pyspark.ml.evaluation 导入 RegressionEvaluatorevaluator = RegressionEvaluator(metricName="mae")lr = 线性回归()grid = ParamGridBuilder().addGrid(lr.maxIter, [500]) \.addGrid(lr.regParam, [0]) \.addGrid(lr.elasticNetParam, [1]) \.建造()lr_cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, \评估者=评估者,numFolds=3)lrModel = lr_cv.fit(your_training_set_here)bestModel = lrModel.bestModel

打印出你想要的参数:

<预><代码>>>>打印'最佳参数(regParam):',bestModel._java_obj.getRegParam()0>>>打印最佳参数(MaxIter):",bestModel._java_obj.getMaxIter()500>>>打印'最佳参数(elasticNetParam):',bestModel._java_obj.getElasticNetParam()1

这也适用于其他方法,例如 extractParamMap().他们应该尽快解决这个问题.

I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected:

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
    [(Vectors.dense([0.0]), 0.0),
     (Vectors.dense([0.4]), 1.0),
     (Vectors.dense([0.5]), 0.0),
     (Vectors.dense([0.6]), 1.0),
     (Vectors.dense([1.0]), 1.0)] * 10,
    ["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)

Running this in PySpark shell, I can get the linear regression model's coefficients, but I can't seem to find the value of lr.regParam selected by the cross validation procedure. Any ideas?

In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []

解决方案

Ran into this problem as well. I found out you need to call the java property for some reason I don't know why. So just do this:

from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder, CrossValidator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(metricName="mae")
lr = LinearRegression()
grid = ParamGridBuilder().addGrid(lr.maxIter, [500]) \
                                .addGrid(lr.regParam, [0]) \
                                .addGrid(lr.elasticNetParam, [1]) \
                                .build()
lr_cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, \
                        evaluator=evaluator, numFolds=3)
lrModel = lr_cv.fit(your_training_set_here)
bestModel = lrModel.bestModel

Printing out the parameters you want:

>>> print 'Best Param (regParam): ', bestModel._java_obj.getRegParam()
0
>>> print 'Best Param (MaxIter): ', bestModel._java_obj.getMaxIter()
500
>>> print 'Best Param (elasticNetParam): ', bestModel._java_obj.getElasticNetParam()
1

This applies to other methods like extractParamMap() as well. They should fix this soon.

这篇关于如何从 PySpark 中的 spark.ml 中提取模型超参数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆