在Pyspark中评估分类器时,"SparkSession"对象没有属性"serializer" [英] 'SparkSession' object has no attribute 'serializer' when evaluating a classifier in Pyspark
问题描述
我正在以批处理模式使用Apache Spark.我已经建立了一个完整的管道,将文本转换为TFIDF向量,然后使用Logistic回归预测布尔类:
I am using Apache spark in batch mode. I have set up an entire pipeline that transforms text into TFIDF vectors and then predicts a boolean class using Logistic regression:
# Chain previously created feature transformers, indexers and regression in a Pipeline
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf,
labelIndexer, featureIndexer, lr])
#Fit the full model to the training data
model = pipeline.fit(trainingData)
#Predict test data
predictions = model.transform(testData)
我可以检查predictions
,这是一个火花数据帧,这正是我期望的.
接下来,我想看一个混淆矩阵,所以我将分数和标签转换为RDD并将其传递给BinaryClassificationMetrics():
I can examine predictions
, which is a spark dataframe, and it is what I expect it to be.
Next, I want to see a confusion matrix, so I convert the scores and labels to a RDD and pass that to BinaryClassificationMetrics():
predictionAndLabels = predictions.select('prediction','label').rdd
最后,我将其传递给BinaryClassificationMetrics:
Finally, I pass that to the BinaryClassificationMetrics:
metrics = BinaryClassificationMetrics(predictionAndLabels) #this errors out
这是错误:
AttributeError: 'SparkSession' object has no attribute 'serializer'
此错误没有帮助,对其进行搜索会引发一系列广泛的问题.我发现唯一看起来相似的是这篇帖子,没有答案:
This error is not helpful and searching for it raises a broad spectrum of issues. the only thing I've found that seems similar is this post which has no answers: How to resolve error "AttributeError: 'SparkSession' object has no attribute 'serializer'?
感谢您的协助!
推荐答案
为了繁荣,这是我为解决此问题所做的工作.当我启动Spark Session和SQL上下文时,我正在这样做,这是不对的:
For prosperity's sake, here's what I did to fix this. When I initiate the Spark Session and the SQL context, I was doing this, which is not right:
sc = SparkSession.builder.appName('App Name').master("local[*]").getOrCreate()
sqlContext = SQLContext(sc)
此问题已通过解决来解决:
This problem was resolved by doing this instead:
sc = SparkSession.builder.appName('App Name').master("local[*]").getOrCreate()
sqlContext = SQLContext(sparkContext=sc.sparkContext, sparkSession=sc)
我不确定为什么需要将其明确,如果有人知道的话,欢迎社区提出澄清.
I'm not sure why that needed to be explicit, and would welcome clarification from the community if someone knows.
这篇关于在Pyspark中评估分类器时,"SparkSession"对象没有属性"serializer"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!