如何在Spark集群模式下运行此代码 [英] How to run this code on Spark Cluster mode
问题描述
我想在集群上运行我的代码:我的代码:
I want to run my code on a Cluster: my code:
import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations._
import edu.stanford.nlp.pipeline._
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.JavaConversions._
import scala.collection.mutable.ArrayBuffer
object Pre2 {
def plainTextToLemmas(text: String, pipeline: StanfordCoreNLP): Seq[String] = {
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 0 ) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local")
.setAppName("pre2")
val sc = new SparkContext(conf)
val plainText = sc.textFile("data/in.txt")
val lemmatized = plainText.mapPartitions(p => {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
p.map(q => plainTextToLemmas(q, pipeline))
})
val lemmatized1 = lemmatized.map(l => l.head + l.tail.mkString(" "))
val lemmatized2 = lemmatized1.filter(_.nonEmpty)
lemmatized2.coalesce(1).saveAsTextFile("data/out.txt)
}
}
和群集功能:
2个节点
每个节点具有:60g RAM
each node has : 60g RAM
每个节点具有:48个内核
each node has : 48 Cores
共享磁盘
我在此集群上安装了Spark,这些节点之一是作为master和worker的,另一个节点是worker.
I installed Spark on this cluster and one of these nodes is as a master and worker and another node is a worker .
当我在终端中使用此命令运行代码时:
when i run my code with this command in terminal :
./bin/spark-submit --master spark://192.168.1.20:7077 --class Main--deploy-mode群集代码/Pre2.jar
./bin/spark-submit --master spark://192.168.1.20:7077 --class Main --deploy-mode cluster code/Pre2.jar
显示:
19年5月15日15:27:21警告RestSubmissionClient:无法连接到服务器spark://192.168.1.20:7077.警告:主端点spark://192.168.1.20:7077不是REST服务器.退回改用旧版提交网关.19/08/15 15:27:22警告NativeCodeLoader:无法加载您的本机hadoop库平台...在适用情况下使用内置Java类驱动程序已成功提交为driver-20150819152724-0002 ...等待中在轮询主机以获取驱动程序状态之前...轮询主机以获取驱动程序状态driver-20150819152724-0002的状态为RUNNING1192.168.1.19:33485(worker-20150819115013-192.168.1.19-33485)
15/08/19 15:27:21 WARN RestSubmissionClient: Unable to connect to server spark://192.168.1.20:7077. Warning: Master endpoint spark://192.168.1.20:7077 was not a REST server. Falling back to legacy submission gateway instead. 15/08/19 15:27:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Driver successfully submitted as driver-20150819152724-0002 ... waiting before polling master for driver state ... polling master for driver state State of driver-20150819152724-0002 is RUNNING Driver running on 1192.168.1.19:33485 (worker-20150819115013-192.168.1.19-33485)
如何在Spark独立集群上运行以上代码?
How can i run above code on Spark standalone cluster ?
推荐答案
确保使用 8080
端口签出WebUI.在您的示例中,该地址为 192.168.1.20:8080
.
Make sure you check out the WebUI using 8080
port. In your example it would be 192.168.1.20:8080
.
如果您以Spark Standalone Cluster模式运行它,请在没有-deploy-mode集群
的情况下进行尝试,并通过添加-executor-memory 60g
If you are running it in Spark Standalone Cluster mode, try it without --deploy-mode cluster
and hard code your nodes memory by adding --executor-memory 60g
这篇关于如何在Spark集群模式下运行此代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!