SparkML 交叉验证是否仅适用于“标签"?柱子? [英] Does SparkML Cross Validation Only Work With a "label" Column?

查看:50
本文介绍了SparkML 交叉验证是否仅适用于“标签"?柱子?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我使用数据集运行交叉验证 example在名为label"的列not中有标签我在 Spark 3.1.1 上观察到 IllegalArgumentException.为什么?

When I am running the cross validation example with a dataset that has the label in a column not named "label" I am observing an IllegalArgumentException on Spark 3.1.1. Why?

下面的代码已被修改为重命名标签"列到目标"中并且 labelCol 已设置为目标"对于回归模型.此代码导致异常,同时将所有内容保留在标签"处.工作正常.

The below code has been modified to rename "label" column into "target" and the labelCol has been set to "target" for the regression model. This code is causing the exception, while leaving everything at "label" works fine.

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0),
    (4, "b spark who", 1.0),
    (5, "g d a y", 0.0),
    (6, "spark fly", 1.0),
    (7, "was mapreduce", 0.0),
    (8, "e spark program", 1.0),
    (9, "a e c l", 0.0),
    (10, "spark compile", 1.0),
    (11, "hadoop software", 0.0)
], ["id", "text", "target"]) # try switching between "target" and "label"

tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")

lr = LogisticRegression(maxIter=10, labelCol="target") #try switching between "target" and "label"

pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  


cvModel = crossval.fit(training)

这是否是预期的行为?

推荐答案

您也需要向 BinaryClassificationEvaluator 提供标签列.所以如果你更换线

You need to provide the label column to BinaryClassificationEvaluator too. So if you replace the line

evaluator=BinaryClassificationEvaluator(),

evaluator=BinaryClassificationEvaluator(labelCol="target"),

它应该可以正常工作.

您可以在 文档.

这篇关于SparkML 交叉验证是否仅适用于“标签"?柱子?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆