星火不利用所有核心同时运行LinearRegressionwithSGD [英] Spark not utilizing all the core while running LinearRegressionwithSGD

查看:590
本文介绍了星火不利用所有核心同时运行LinearRegressionwithSGD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我我的本地机器(16G,8个CPU内核)上运行的火花。我试图在训练300MB大小的数据线性回归模型。我检查了CPU的统计数据,也是程序运行时,它只是执行一个线程。
文档说,他们已经实现了SGD的分布式版本。
http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer

 从pyspark.mllib.regression进口LabeledPoint,LinearRegressionWithSGD,LinearRegressionModel
从pyspark进口SparkContext
高清parsePoint(线):
  值= [浮动(X)在line.replace X('','').split('')]
  返回LabeledPoint(值[0],值[1:])SC = SparkContext(本地,线性注册简单)
数据= sc.textFile(/家庭/ guptap / Dropbox的/ spark_opt /的test.txt)
data.cache()
parsedData = Data.Map中(parsePoint)
模型= LinearRegressionWithSGD.train(parsedData)values​​And preDS = parsedData.map(拉姆达号码:(p.label,型号为predict(p.features)))
MSE = values​​And preds.map(拉姆达(V,P):(ⅴ - P)** 2)。降低(拉姆达X,Y:X + Y)/ values​​And preds.count()
打印(均方误差=+ STR(MSE))
model.save(SCmyModelPath)
sameModel = LinearRegressionModel.load(SC,myModelPath)


解决方案

我想你想要做的是明确规定核心与当地的上下文中使用的数量。你可以从注释中看到<一个href=\"https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkConf.scala#L77\"相对=nofollow>这里,本地(这是你在做什么)在一个线程中实例化上下文,而 本地[4]将有4个内核上运行。我相信你也可以使用本地[*]在您的系统上的所有内核上运行。

I am running Spark on my local machine (16G,8 cpu cores). I was trying to train linear regression model on dataset of size 300MB. I checked the cpu statistics and also the programs running, it just executes one thread. The documentation says they have implemented distributed version of SGD. http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer

from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
from pyspark import SparkContext


def parsePoint(line):
  values = [float(x) for x in line.replace(',', ' ').split(' ')]
  return LabeledPoint(values[0], values[1:])

sc = SparkContext("local", "Linear Reg Simple")
data = sc.textFile("/home/guptap/Dropbox/spark_opt/test.txt")
data.cache()
parsedData = data.map(parsePoint)


model = LinearRegressionWithSGD.train(parsedData)

valuesAndPreds = parsedData.map(lambda p: (p.label,model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))


model.save(sc, "myModelPath")
sameModel = LinearRegressionModel.load(sc, "myModelPath")

解决方案

I think what you want to do is explicitly state the number of cores to use with the local context. As you can see from the comments here, "local" (which is what you're doing) instantiates a context on one thread whereas "local[4]" will run with 4 cores. I believe you can also use "local[*]" to run on all cores on your system.

这篇关于星火不利用所有核心同时运行LinearRegressionwithSGD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆