在 Apache Spark 上训练逻辑回归模型时出错.SPARK-5063 [英] Error with training logistic regression model on Apache Spark. SPARK-5063

查看:65
本文介绍了在 Apache Spark 上训练逻辑回归模型时出错.SPARK-5063的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Apache Spark 构建逻辑回归模型.这是代码.

I am trying to build a Logistic Regression model with Apache Spark. Here is the code.

parsedData = raw_data.map(mapper) # mapper is a function that generates pair of label and feature vector as LabeledPoint object
featureVectors = parsedData.map(lambda point: point.features) # get feature vectors from parsed data 
scaler = StandardScaler(True, True).fit(featureVectors) #this creates a standardization model to scale the features
scaledData = parsedData.map(lambda lp: LabeledPoint(lp.label, scaler.transform(lp.features))) #trasform the features to scale mean to zero and unit std deviation
modelScaledSGD = LogisticRegressionWithSGD.train(scaledData, iterations = 10)

但我收到此错误:

异常:您似乎试图从广播变量、操作或转换中引用 SparkContext.SparkContext 只能在驱动程序上使用,不能在它在工作程序上运行的代码中使用.有关详细信息,请参阅 SPARK-5063.

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

我不知道如何解决这个问题.任何帮助将不胜感激.

I am not sure how to work around this. Any help would be greately appreciated.

推荐答案

您看到的问题与我在 如何从动作或转换中使用 Java/Scala 函数? 要进行转换,您必须调用 Scala 函数,并且它需要访问 SparkContext,因此您会看到错误.

Problem you see is pretty much the same as the one I've described in How to use Java/Scala function from an action or a transformation? To transform you have to call Scala function, and it requires access to the SparkContext hence the error you see.

处理此问题的标准方法是仅处理所需的数据部分,然后压缩结果.

Standard way to handle this is to process only the required part of your data and then zip the results.

labels = parsedData.map(lambda point: point.label)
featuresTransformed = scaler.transform(featureVectors)

scaledData = (labels
    .zip(featuresTransformed)
    .map(lambda p: LabeledPoint(p[0], p[1])))

modelScaledSGD = LogisticRegressionWithSGD.train(...)

如果不打算基于 MLlib 组件实现自己的方法,那么使用高级 ML API.

If don't plan to implement your own methods based on MLlib components it could easier to use high level ML API.

编辑:

这里有两个可能的问题.

There are two possible problems here.

  1. 此时LogisticRegressionWithSGD支持仅二项式 分类(感谢 eliasah 指出这一点).如果您需要多标签分类,您可以将其替换为 LogisticRegressionWithLBFGS.
  2. StandardScaler 仅支持密集向量,因此应用有限.
  1. At this point LogisticRegressionWithSGD supports only binomial classification (Thanks to eliasah for pointing that out) . If you need multi-label classification you can replace it with LogisticRegressionWithLBFGS.
  2. StandardScaler supports only dense vectors so it has limited applications.

这篇关于在 Apache Spark 上训练逻辑回归模型时出错.SPARK-5063的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆