使用 PySpark 进行多类分类的逻辑回归问题 [英] Issues with Logistic Regression for multiclass classification using PySpark

查看:82
本文介绍了使用 PySpark 进行多类分类的逻辑回归问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Logistic Regression 对特征向量中具有 稀疏向量 的数据集进行分类:

I am trying to use Logistic Regression to classify the datasets which has Sparse Vector in feature vector:

# imported library from ML
from pyspark.ml.feature import HashingTF
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression

print(type(trainingData)) # for checking only
print(trainingData.take(2)) # for of data type
lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=maximumIteration,     regParam=re
gParamValue)
pipeline = Pipeline(stages=[lr])
# Train model
model = pipeline.fit(trainingData)

出现以下错误:

<class 'pyspark.sql.dataframe.DataFrame'>
[Row(label=2.0, features=SparseVector(2000, {51: 1.0, 160: 1.0, 341: 1.0, 417: 1.0, 561: 1.0, 656: 1.0, 863: 1.0, 939: 1.0, 1021: 1.0, 1324: 1.0, 1433: 1.0, 1573: 1.0, 1604: 1.0, 1720: 1.0})), Row(label=3.0, features=SparseVector(2000, {24: 1.0, 51: 2.0, 119: 1.0, 167: 1.0, 182: 1.0, 190: 1.0, 195: 1.0, 285: 1.0, 432: 1.0, 539: 1.0, 571: 1.0, 630: 1.0, 638: 1.0, 656: 1.0, 660: 2.0, 751: 1.0, 785: 1.0, 794: 1.0, 801: 1.0, 823: 1.0, 893: 1.0, 900: 1.0, 915: 1.0, 956: 1.0, 966: 1.0, 1025: 1.0, 1029: 1.0, 1035: 1.0, 1038: 1.0, 1093: 1.0, 1115: 2.0, 1147: 1.0, 1206: 1.0, 1252: 1.0, 1261: 1.0, 1262: 1.0, 1268: 1.0, 1304: 1.0, 1351: 1.0, 1378: 1.0, 1423: 1.0, 1437: 1.0, 1441: 1.0, 1530: 1.0, 1534: 1.0, 1556: 1.0, 1562: 1.0, 1604: 1.0, 1711: 1.0, 1737: 1.0, 1750: 1.0, 1776: 1.0, 1858: 1.0, 1865: 1.0, 1923: 1.0, 1926: 1.0, 1959: 1.0, 1999: 1.0}))]
16/08/25 19:14:07 ERROR org.apache.spark.ml.classification.LogisticRegression: Currently, LogisticRegression with E
lasticNet in ML package only supports binary classification. Found 5 in the input dataset.
Traceback (most recent call last):
  File "/home/LR/test.py", line 260, in <module>
    accuracy = TrainLRCModel(trainData, testData)
  File "/home/LR/test.py", line 211, in TrainLRCModel
    model = pipeline.fit(trainingData)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 69, in fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 213, in _fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 69, in fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 133, in _fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 130, in _fit_java
  File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
  File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o207.fit.
: org.apache.spark.SparkException: Currently, LogisticRegression with ElasticNet in ML package only supports binary
 classification. Found 5 in the input dataset.
        at org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:290)
        at org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
        at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
        at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)

案例 2:我搜索了上述可能的替代解决方案并得到 LogisticRegressionWithLBFGS 将适用于多类分类,我尝试如下:

Case 2: I search the possible alternate solution of above one and got that LogisticRegressionWithLBFGS will work on multi-class classificaton, I tried as follow:

#imported library
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel, LogisticRegressionWithSGD
print(type(trainingData)) # to check the dataset type
print(trainingData.take(2)) # To see the data
model = LogisticRegressionWithLBFGS.train(trainingData, numClasses=5)
print(type(model))

出现以下错误:

<class 'pyspark.sql.dataframe.DataFrame'>
[Row(label=3.0, features=SparseVector(2000, {24: 1.0, 51: 2.0, 119: 1.0, 167: 1.0, 182: 1.0, 190: 1.0, 195: 1.0, 28
5: 1.0, 432: 1.0, 539: 1.0, 571: 1.0, 630: 1.0, 638: 1.0, 656: 1.0, 660: 2.0, 751: 1.0, 785: 1.0, 794: 1.0, 801: 1.
0, 823: 1.0, 893: 1.0, 900: 1.0, 915: 1.0, 956: 1.0, 966: 1.0, 1025: 1.0, 1029: 1.0, 1035: 1.0, 1038: 1.0, 1093: 1.
0, 1115: 2.0, 1147: 1.0, 1206: 1.0, 1252: 1.0, 1261: 1.0, 1262: 1.0, 1268: 1.0, 1304: 1.0, 1351: 1.0, 1378: 1.0, 14
23: 1.0, 1437: 1.0, 1441: 1.0, 1530: 1.0, 1534: 1.0, 1556: 1.0, 1562: 1.0, 1604: 1.0, 1711: 1.0, 1737: 1.0, 1750: 1
.0, 1776: 1.0, 1858: 1.0, 1865: 1.0, 1923: 1.0, 1926: 1.0, 1959: 1.0, 1999: 1.0})), Row(label=5.0, features=SparseV
ector(2000, {103: 1.0, 310: 1.0, 601: 1.0, 817: 1.0, 866: 1.0, 940: 1.0, 1023: 1.0, 1118: 1.0, 1339: 1.0, 1447: 1.0
, 1634: 1.0, 1776: 1.0}))]
Traceback (most recent call last):
  File "/home/LR/test.py", line 260, in <module>
    accuracy = TrainLRCModel(trainData, testData)
  File "/home/LR/test.py", line 230, in TrainLRCModel
    model = LogisticRegressionWithLBFGS.train(trainingData, numClasses=5)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/mllib/classification.py", line 382, in train
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/mllib/regression.py", line 206, in _regression_train_wrapper
TypeError: data should be an RDD of LabeledPoint, but got <class 'pyspark.sql.types.Row'>

我再次尝试将数据集转换为标记点的 RDD,如下所示,即案例 3:

Again I tried to convert the dataset into RDD of Labeled Point as follow i.e case 3:

    #imported libraries
    from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel, LogisticRegressionWithSGD
    from pyspark.mllib.regression import LabeledPoint

    print(type(trainingData))
    print(trainingData.take(2))
    trainingData = trainingData.map(lambda row:[LabeledPoint(row.label,row.features)])
    print('type of trainingData')
    print(type(trainingData))
    print(trainingData.take(2))
    model = LogisticRegressionWithLBFGS.train(trainingData, numClasses=5)
    print(type(model))

出现以下错误:

<class 'pyspark.sql.dataframe.DataFrame'>
[Row(label=2.0, features=SparseVector(2000, {51: 1.0, 160: 1.0, 341: 1.0, 417: 1.0, 561: 1.0, 656: 1.0, 863: 1.0, 9
39: 1.0, 1021: 1.0, 1324: 1.0, 1433: 1.0, 1573: 1.0, 1604: 1.0, 1720: 1.0})), Row(label=3.0, features=SparseVector(
2000, {24: 1.0, 51: 2.0, 119: 1.0, 167: 1.0, 182: 1.0, 190: 1.0, 195: 1.0, 285: 1.0, 432: 1.0, 539: 1.0, 571: 1.0, 
630: 1.0, 638: 1.0, 656: 1.0, 660: 2.0, 751: 1.0, 785: 1.0, 794: 1.0, 801: 1.0, 823: 1.0, 893: 1.0, 900: 1.0, 915: 
1.0, 956: 1.0, 966: 1.0, 1025: 1.0, 1029: 1.0, 1035: 1.0, 1038: 1.0, 1093: 1.0, 1115: 2.0, 1147: 1.0, 1206: 1.0, 12
52: 1.0, 1261: 1.0, 1262: 1.0, 1268: 1.0, 1304: 1.0, 1351: 1.0, 1378: 1.0, 1423: 1.0, 1437: 1.0, 1441: 1.0, 1530: 1
.0, 1534: 1.0, 1556: 1.0, 1562: 1.0, 1604: 1.0, 1711: 1.0, 1737: 1.0, 1750: 1.0, 1776: 1.0, 1858: 1.0, 1865: 1.0, 1
923: 1.0, 1926: 1.0, 1959: 1.0, 1999: 1.0}))]
type of trainingData
<class 'pyspark.rdd.PipelinedRDD'>
[[LabeledPoint(2.0, (2000,[51,160,341,417,561,656,863,939,1021,1324,1433,1573,1604,1720],[1.0,1.0,1.0,1.0,1.0,1.0,1
.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))], [LabeledPoint(3.0, (2000,[24,51,119,167,182,190,195,285,432,539,571,630,638,656
,660,751,785,794,801,823,893,900,915,956,966,1025,1029,1035,1038,1093,1115,1147,1206,1252,1261,1262,1268,1304,1351,
1378,1423,1437,1441,1530,1534,1556,1562,1604,1711,1737,1750,1776,1858,1865,1923,1926,1959,1999],[1.0,2.0,1.0,1.0,1.
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1
.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))]]
Traceback (most recent call last):
  File "/home/LR/test.py", line 260, in <module>
    accuracy = TrainLRCModel(trainData, testData)
  File "/home/LR/test.py", line 230, in TrainLRCModel
    model = LogisticRegressionWithLBFGS.train(trainingData, numClasses=5)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/mllib/classification.py", line 381, in train
AttributeError: 'list' object has no attribute 'features'

有人可以建议我在哪里遗漏了什么,我想在 PySpark 中使用逻辑回归并对多类分类进行分类.

Can someone please suggest where I am missing something, I wanted to use the Logistic Regression in PySpark and classify the multi-class classification.

目前我在谷歌云上使用 spark 版本 version 1.6.2 和 python 版本 Python 2.7.9.

Currently I am using spark version version 1.6.2 and python version Python 2.7.9 on google cloud.

预先感谢您的帮助.

推荐答案

案例 1:这里没有什么奇怪的,只是(如错误消息所述)LogisticRegression 不支持多类分类,正如 文档.

Case 1: There is nothing strange here, simply (as the error message says) LogisticRegression does not support multi-class classification, as clearly stated in the documentation.

案例 2:在这里,您已从 ML 切换到 MLlib,但它不适用于数据帧,但需要将输入作为 LabeledPoint (文档),因此再次错误消息是预期的.

Case 2: Here you have switched from ML to MLlib, which however does not work with dataframes but needs the input as RDD of LabeledPoint (documentation), hence again the error message is expected.

案例 3:这就是事情变得有趣的地方.首先,您应该从 map 函数中删除括号,即它应该是

Case 3: Here is where things get interesting. First, you should remove the brackets from your map function, i.e. it should be

trainingData = trainingData.map(lambda row: LabeledPoint(row.label, row.features)) # no brackets after "row:"

尽管如此,根据您提供的代码片段猜测,您现在很可能会遇到不同的错误:

Nevertheless, guessing from the code snippets you have provided, most probably you are going to get a different error now:

model = LogisticRegressionWithLBFGS.train(trainingData, numClasses=5)
[...]
: org.apache.spark.SparkException: Input validation failed.

这是发生了什么(我花了一些时间才弄明白),使用了一些虚拟数据(提供一些示例数据总是一个好主意):

Here is what happening (it took me some time to figure it out), using some dummy data (it's always a good idea to provide some sample data with your question):

# 3-class classification
data = sc.parallelize([
     LabeledPoint(3.0, SparseVector(100,[10, 98],[1.0, 1.0])),
     LabeledPoint(1.0, SparseVector(100,[1, 22],[1.0, 1.0])),
     LabeledPoint(2.0, SparseVector(100,[36, 54],[1.0, 1.0]))
])

lrm = LogisticRegressionWithLBFGS.train(data, iterations=10, numClasses=3) # throws exception
[...]
: org.apache.spark.SparkException: Input validation failed.

问题是你的标签必须从 0 开始(这没有任何记录 - 你必须在 Scala 源代码 看看是不是这样!);因此,将上面我的虚拟数据中的标签从 (1.0, 2.0, 3.0) 映射到 (0.0, 1.0, 2.0),我们最终得到:

The problem is that your labels must start from 0 (and this is nowhere documented - you have to dig in the Scala source code to see that this is the case!); so, mapping the labels in my dummy data above from (1.0, 2.0, 3.0) to (0.0, 1.0, 2.0), we finally get:

# 3-class classification
data = sc.parallelize([
     LabeledPoint(2.0, SparseVector(100,[10, 98],[1.0, 1.0])),
     LabeledPoint(0.0, SparseVector(100,[1, 22],[1.0, 1.0])),
     LabeledPoint(1.0, SparseVector(100,[36, 54],[1.0, 1.0]))
])

lrm = LogisticRegressionWithLBFGS.train(data, iterations=10, numClasses=3) # no error now

从您的 numClasses=5 参数以及您打印的记录之一中的 label=5.0 来看,我想您的代码很可能受到同样的问题.将您的标签更改为 [0.0, 4.0],您应该没问题.

Judging from your numClasses=5 argument, as well as from the label=5.0 in one of your printed records, I guess that most probably your code suffers from the same issue. Change your labels to [0.0, 4.0] and you should be fine.

(我建议你删除你打开的另一个相同的问题 此处,用于减少杂乱...)

(I suggest that you delete the other identical question you have opened here, for reducing clutter...)

这篇关于使用 PySpark 进行多类分类的逻辑回归问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆