pyspark.sql.utils.IllegalArgumentException: u'Field "features";不存在.' [英] pyspark.sql.utils.IllegalArgumentException: u'Field "features" does not exist.'

查看:26
本文介绍了pyspark.sql.utils.IllegalArgumentException: u'Field "features";不存在.'的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试执行随机森林分类器并使用交叉验证评估模型.我与 pySpark 一起工作.输入的 CSV 文件以 Spark DataFrame 格式加载.但是我在构建模型时遇到了一个问题.

I am trying to execute Random Forest Classifier and evaluate the model using Cross Validation. I work with pySpark. The input CSV file is loaded as Spark DataFrame format. But I face a issue while constructing the model.

下面是代码.

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import BinaryClassificationMetrics
sc = SparkContext()
sqlContext = SQLContext(sc)
trainingData =(sqlContext.read
         .format("com.databricks.spark.csv")
         .option("header", "true")
         .option("inferSchema", "true")
         .load("/PATH/CSVFile"))
numFolds = 10 
rf = RandomForestClassifier(numTrees=100, maxDepth=5, maxBins=5, labelCol="V5409",featuresCol="features",seed=42)
evaluator = MulticlassClassificationEvaluator().setLabelCol("V5409").setPredictionCol("prediction").setMetricName("accuracy")
paramGrid = ParamGridBuilder().build()

pipeline = Pipeline(stages=[rf])
paramGrid=ParamGridBuilder().build()
crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=numFolds)
model = crossval.fit(trainingData)
print accuracy

我收到以下错误

Traceback (most recent call last):
  File "SparkDF.py", line 41, in <module>
    model = crossval.fit(trainingData)
  File "/usr/local/spark-2.1.1/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/usr/local/spark-2.1.1/python/pyspark/ml/tuning.py", line 236, in _fit
    model = est.fit(train, epm[j])
  File "/usr/local/spark-2.1.1/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/usr/local/spark-2.1.1/python/pyspark/ml/pipeline.py", line 108, in _fit
    model = stage.fit(dataset)
  File "/usr/local/spark-2.1.1/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/usr/local/spark-2.1.1/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/usr/local/spark-2.1.1/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/hadoopuser/anaconda2/lib/python2.7/site-packages/py4j/java_gateway.py", line 1160, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/spark-2.1.1/python/pyspark/sql/utils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'Field "features" does not exist.'
hadoopuser@rackserver-PowerEdge-R220:~/workspace/RandomForest_CV$ 

请帮我解决 pySpark 中的这个问题.谢谢.

Please help me out to solve this issue in pySpark. Thank You.

我在这里展示数据集的详细信息.不,我没有专门的功能栏.下面是 trainingData.take(5) 的输出,它显示了数据集的前 5 行.

I am showing the details of dataset here. No I don't have features column specifically. Below is the output of trainingData.take(5) which displays first 5 rows of dataset.

[行(V4366=0.0, V4460=0.232, V4916=-0.017, V1495=-0.104, V1639=0.005, V1967=-0.008, V3049=0.177, V37645, V37640=-0.104, V37645=-0.104=., V5409=0), Row(V4366=0.0, V4460=0.111, V4916=-0.003, V1495=-0.137, V1639=0.001, V1967=-0.01, V3049=0.7869, V-3049=0.7867,2.57=0.467=0.0, V5409=0), 行(V4366=0.0, V4460=-0.391, V4916=-0.003, V1495=-0.155, V1639=-0.006, V1967=-0.019, V.3060=-0.396, V.3046=-0.0030.189, V524=0.001, V5409=0), 行(V4366=0.0, V4460=0.098, V4916=-0.012, V1495=-0.108, V1639=0.005, V19062,703,707=-0.097=-0.926, V524=0.002, V5409=0), 行(V4366=0.0, V4460=0.026, V4916=-0.004, V1495=-0.139, V1639=0.003, V19046=-V19046=-3-3046=-30.208, V3869=-0.782, V524=0.001, V5409=0)]

[Row(V4366=0.0, V4460=0.232, V4916=-0.017, V1495=-0.104, V1639=0.005, V1967=-0.008, V3049=0.177, V3746=-0.675, V3869=-3.451, V524=0.004, V5409=0), Row(V4366=0.0, V4460=0.111, V4916=-0.003, V1495=-0.137, V1639=0.001, V1967=-0.01, V3049=0.01, V3746=-0.867, V3869=-2.759, V524=0.0, V5409=0), Row(V4366=0.0, V4460=-0.391, V4916=-0.003, V1495=-0.155, V1639=-0.006, V1967=-0.019, V3049=-0.706, V3746=0.166, V3869=0.189, V524=0.001, V5409=0), Row(V4366=0.0, V4460=0.098, V4916=-0.012, V1495=-0.108, V1639=0.005, V1967=-0.002, V3049=0.033, V3746=-0.787, V3869=-0.926, V524=0.002, V5409=0), Row(V4366=0.0, V4460=0.026, V4916=-0.004, V1495=-0.139, V1639=0.003, V1967=-0.006, V3049=-0.045, V3746=-0.208, V3869=-0.782, V524=0.001, V5409=0)]

其中 V433 到 V524 是特征.V5409是类标签.

where V433 to V524 are features. V5409 is the class label.

推荐答案

Spark 数据帧不像在 Spark ML 中那样使用;您的所有特征都需要是单个列中的向量,通常命名为features.以下是使用您提供的 5 行作为示例的方法:

Spark dataframes are not used like that in Spark ML; all your features need to be vectors in a single column, usually named features. Here is how you can do it using the 5 rows you have provided as an example:

spark.version
# u'2.2.0'

from pyspark.sql import Row
from pyspark.ml.linalg import Vectors

# your sample data:
temp_df = spark.createDataFrame([Row(V4366=0.0, V4460=0.232, V4916=-0.017, V1495=-0.104, V1639=0.005, V1967=-0.008, V3049=0.177, V3746=-0.675, V3869=-3.451, V524=0.004, V5409=0), Row(V4366=0.0, V4460=0.111, V4916=-0.003, V1495=-0.137, V1639=0.001, V1967=-0.01, V3049=0.01, V3746=-0.867, V3869=-2.759, V524=0.0, V5409=0), Row(V4366=0.0, V4460=-0.391, V4916=-0.003, V1495=-0.155, V1639=-0.006, V1967=-0.019, V3049=-0.706, V3746=0.166, V3869=0.189, V524=0.001, V5409=0), Row(V4366=0.0, V4460=0.098, V4916=-0.012, V1495=-0.108, V1639=0.005, V1967=-0.002, V3049=0.033, V3746=-0.787, V3869=-0.926, V524=0.002, V5409=0), Row(V4366=0.0, V4460=0.026, V4916=-0.004, V1495=-0.139, V1639=0.003, V1967=-0.006, V3049=-0.045, V3746=-0.208, V3869=-0.782, V524=0.001, V5409=0)])

trainingData=temp_df.rdd.map(lambda x:(Vectors.dense(x[0:-1]), x[-1])).toDF(["features", "label"])
trainingData.show()
# +--------------------+-----+ 
# |            features|label|
# +--------------------+-----+
# |[-0.104,0.005,-0....|    0| 
# |[-0.137,0.001,-0....|    0|
# |[-0.155,-0.006,-0...|    0|
# |[-0.108,0.005,-0....|    0|
# |[-0.139,0.003,-0....|    0|
# +--------------------+-----+

之后,您的管道应该运行良好(我假设您确实具有多类分类,因为您的样本仅包含 0 作为标签),只需更改 rf 中的标签列和evaluator 如下:

after which, your pipeline should run fine (I am assuming that indeed you have multi-class classification, since your sample contains only 0's as labels) with only changing the label column in your rf and evaluator as follows:

rf = RandomForestClassifier(numTrees=100, maxDepth=5, maxBins=5, labelCol="label",featuresCol="features",seed=42)
evaluator = MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("accuracy")

最后,打印精度 将不起作用 - 您需要 model.avgMetrics 代替.

Finally, print accuracy will not work - you'll need model.avgMetrics instead.

这篇关于pyspark.sql.utils.IllegalArgumentException: u'Field "features";不存在.'的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆