星火不利用所有核心同时运行LinearRegressionwithSGD [英] Spark not utilizing all the core while running LinearRegressionwithSGD

查看：590 发布时间：2016/5/22 15:27:56 apache-spark apache-spark-mllib

本文介绍了星火不利用所有核心同时运行LinearRegressionwithSGD的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我我的本地机器（16G，8个CPU内核）上运行的火花。我试图在训练300MB大小的数据线性回归模型。我检查了CPU的统计数据，也是程序运行时，它只是执行一个线程。
文档说，他们已经实现了SGD的分布式版本。
的http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer

 从pyspark.mllib.regression进口LabeledPoint，LinearRegressionWithSGD，LinearRegressionModel
从pyspark进口SparkContext
高清parsePoint（线）：
  值= [浮动（X）在line.replace X（''，''）.split（''）]
  返回LabeledPoint（值[0]，值[1：]）SC = SparkContext（本地，线性注册简单）
数据= sc.textFile（/家庭/ guptap / Dropbox的/ spark_opt /的test.txt）
data.cache（）
parsedData = Data.Map中（parsePoint）
模型= LinearRegressionWithSGD.train（parsedData）valuesAnd preDS = parsedData.map（拉姆达号码：（p.label，型号为predict（p.features）））
MSE = valuesAnd preds.map（拉姆达（V，P）：（ⅴ -  P）** 2）。降低（拉姆达X，Y：X + Y）/ valuesAnd preds.count（）
打印（均方误差=+ STR（MSE））
model.save（SCmyModelPath）
sameModel = LinearRegressionModel.load（SC，myModelPath）

解决方案

我想你想要做的是明确规定核心与当地的上下文中使用的数量。你可以从注释中看到<一个href=\"https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkConf.scala#L77\"相对=nofollow>这里，本地（这是你在做什么）在一个线程中实例化上下文，而 本地[4]将有4个内核上运行。我相信你也可以使用本地[*]在您的系统上的所有内核上运行。

I am running Spark on my local machine (16G,8 cpu cores). I was trying to train linear regression model on dataset of size 300MB. I checked the cpu statistics and also the programs running, it just executes one thread. The documentation says they have implemented distributed version of SGD. http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer

from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
from pyspark import SparkContext


def parsePoint(line):
  values = [float(x) for x in line.replace(',', ' ').split(' ')]
  return LabeledPoint(values[0], values[1:])

sc = SparkContext("local", "Linear Reg Simple")
data = sc.textFile("/home/guptap/Dropbox/spark_opt/test.txt")
data.cache()
parsedData = data.map(parsePoint)


model = LinearRegressionWithSGD.train(parsedData)

valuesAndPreds = parsedData.map(lambda p: (p.label,model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))


model.save(sc, "myModelPath")
sameModel = LinearRegressionModel.load(sc, "myModelPath")

解决方案

I think what you want to do is explicitly state the number of cores to use with the local context. As you can see from the comments here, "local" (which is what you're doing) instantiates a context on one thread whereas "local[4]" will run with 4 cores. I believe you can also use "local[*]" to run on all cores on your system.

这篇关于星火不利用所有核心同时运行LinearRegressionwithSGD的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

星火不利用所有核心同时运行LinearRegressionwithSGD [英] Spark not utilizing all the core while running LinearRegressionwithSGD

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

星火不利用所有核心同时运行LinearRegressionwithSGD [英] Spark not utilizing all the core while running LinearRegressionwithSGD

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭