星火不利用所有核心同时运行LinearRegressionwithSGD [英] Spark not utilizing all the core while running LinearRegressionwithSGD
问题描述
我我的本地机器(16G,8个CPU内核)上运行的火花。我试图在训练300MB大小的数据线性回归模型。我检查了CPU的统计数据,也是程序运行时,它只是执行一个线程。
文档说,他们已经实现了SGD的分布式版本。
的http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer
从pyspark.mllib.regression进口LabeledPoint,LinearRegressionWithSGD,LinearRegressionModel
从pyspark进口SparkContext
高清parsePoint(线):
值= [浮动(X)在line.replace X('','').split('')]
返回LabeledPoint(值[0],值[1:])SC = SparkContext(本地,线性注册简单)
数据= sc.textFile(/家庭/ guptap / Dropbox的/ spark_opt /的test.txt)
data.cache()
parsedData = Data.Map中(parsePoint)
模型= LinearRegressionWithSGD.train(parsedData)valuesAnd preDS = parsedData.map(拉姆达号码:(p.label,型号为predict(p.features)))
MSE = valuesAnd preds.map(拉姆达(V,P):(ⅴ - P)** 2)。降低(拉姆达X,Y:X + Y)/ valuesAnd preds.count()
打印(均方误差=+ STR(MSE))
model.save(SCmyModelPath)
sameModel = LinearRegressionModel.load(SC,myModelPath)
我想你想要做的是明确规定核心与当地的上下文中使用的数量。你可以从注释中看到<一个href=\"https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkConf.scala#L77\"相对=nofollow>这里,本地
(这是你在做什么)在一个线程中实例化上下文,而 本地[4]
将有4个内核上运行。我相信你也可以使用本地[*]
在您的系统上的所有内核上运行。
I am running Spark on my local machine (16G,8 cpu cores). I was trying to train linear regression model on dataset of size 300MB. I checked the cpu statistics and also the programs running, it just executes one thread. The documentation says they have implemented distributed version of SGD. http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
from pyspark import SparkContext
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
return LabeledPoint(values[0], values[1:])
sc = SparkContext("local", "Linear Reg Simple")
data = sc.textFile("/home/guptap/Dropbox/spark_opt/test.txt")
data.cache()
parsedData = data.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData)
valuesAndPreds = parsedData.map(lambda p: (p.label,model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
model.save(sc, "myModelPath")
sameModel = LinearRegressionModel.load(sc, "myModelPath")
I think what you want to do is explicitly state the number of cores to use with the local context. As you can see from the comments here, "local"
(which is what you're doing) instantiates a context on one thread whereas "local[4]"
will run with 4 cores. I believe you can also use "local[*]"
to run on all cores on your system.
这篇关于星火不利用所有核心同时运行LinearRegressionwithSGD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!