误差在Apache星火培训Logistic回归模型。 SPARK-5063 [英] Error with training logistic regression model on Apache Spark. SPARK-5063
问题描述
我想建立与Apache星火逻辑回归模型。
这里是code。
I am trying to build a Logistic Regression model with Apache Spark. Here is the code.
parsedData = raw_data.map(mapper) # mapper is a function that generates pair of label and feature vector as LabeledPoint object
featureVectors = parsedData.map(lambda point: point.features) # get feature vectors from parsed data
scaler = StandardScaler(True, True).fit(featureVectors) #this creates a standardization model to scale the features
scaledData = parsedData.map(lambda lp: LabeledPoint(lp.label, scaler.transform(lp.features))) #trasform the features to scale mean to zero and unit std deviation
modelScaledSGD = LogisticRegressionWithSGD.train(scaledData, iterations = 10)
但我得到这个错误:
But I get this error:
例外:看来,你正试图从广播的变量,动作或transforamtion参考SparkContext。 SparkContext只能在驱动程序中使用,而不是在code它工人运行。欲了解更多信息,请参阅SPARK-5063。
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
我不知道如何解决这个问题。任何帮助将是greately AP preciated。
I am not sure how to work around this. Any help would be greately appreciated.
推荐答案
您看到的问题是pretty大致相同,我在的如何使用Java / Scala的功能,从一个动作或一个转型?来改变你要调用函数斯卡拉,它需要访问 SparkContext
因此你看到错误。
Problem you see is pretty much the same as the one I've described in How to use Java/Scala function from an action or a transformation? To transform you have to call Scala function, and it requires access to the SparkContext
hence the error you see.
来处理这个标准的方法是仅处理您的数据所需的部分,然后压缩的结果。
Standard way to handle this is to process only the required part of your data and then zip the results.
labels = parsedData.map(lambda point: point.label)
featuresTransformed = scaler.transform(featureVectors)
scaledData = (labels
.zip(featuresTransformed)
.map(lambda p: LabeledPoint(p[0], p[1])))
modelScaledSGD = LogisticRegressionWithSGD.train(...)
如果不打算实现基于 MLlib
组件它可以更容易使用高层次的 ML
API 。
If don't plan to implement your own methods based on MLlib
components it could easier to use high level ML
API.
修改
这里有两个可能的问题。
There are two possible problems here.
- 在这一点上
LogisticRegressionWithSGD
只有二项式分类(感谢 eliasah 指出了这一点)。如果您需要多标签分类可以用LogisticRegressionWithLBFGS
替换它。 -
StandardScaler
仅支持密集向量,因此限制了应用程序。
- At this point
LogisticRegressionWithSGD
supports only binomial classification (Thanks to eliasah for pointing that out) . If you need multi-label classification you can replace it withLogisticRegressionWithLBFGS
. StandardScaler
supports only dense vectors so it has limited applications.
这篇关于误差在Apache星火培训Logistic回归模型。 SPARK-5063的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!