如何在Spark集群模式下运行此代码 [英] How to run this code on Spark Cluster mode

查看:106
本文介绍了如何在Spark集群模式下运行此代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在集群上运行我的代码:我的代码:

I want to run my code on a Cluster: my code:

import java.util.Properties

import edu.stanford.nlp.ling.CoreAnnotations._
import edu.stanford.nlp.pipeline._
import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.JavaConversions._
import scala.collection.mutable.ArrayBuffer

object Pre2 {

  def plainTextToLemmas(text: String, pipeline: StanfordCoreNLP): Seq[String] = {
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
      val lemma = token.get(classOf[LemmaAnnotation])
      if (lemma.length > 0 ) {
        lemmas += lemma.toLowerCase
      }
    }
    lemmas
  }
  def main(args: Array[String]): Unit = {

    val conf = new SparkConf()
      .setMaster("local")
      .setAppName("pre2")

    val sc = new SparkContext(conf)
      val plainText = sc.textFile("data/in.txt")
      val lemmatized = plainText.mapPartitions(p => {
        val props = new Properties()
        props.put("annotators", "tokenize, ssplit, pos, lemma")
        val pipeline = new StanfordCoreNLP(props)
        p.map(q => plainTextToLemmas(q, pipeline))
      })
      val lemmatized1 = lemmatized.map(l => l.head + l.tail.mkString(" "))
      val lemmatized2 = lemmatized1.filter(_.nonEmpty)
      lemmatized2.coalesce(1).saveAsTextFile("data/out.txt)
  }
}

和群集功能:

2个节点

每个节点具有:60g RAM

each node has : 60g RAM

每个节点具有:48个内核

each node has : 48 Cores

共享磁盘

我在此集群上安装了Spark,这些节点之一是作为master和worker的,另一个节点是worker.

I installed Spark on this cluster and one of these nodes is as a master and worker and another node is a worker .

当我在终端中使用此命令运行代码时:

when i run my code with this command in terminal :

./bin/spark-submit --master spark://192.168.1.20:7077 --class Main--deploy-mode群集代码/Pre2.jar

./bin/spark-submit --master spark://192.168.1.20:7077 --class Main --deploy-mode cluster code/Pre2.jar

显示:

19年5月15日15:27:21警告RestSubmissionClient:无法连接到服务器spark://192.168.1.20:7077.警告:主端点spark://192.168.1.20:7077不是REST服务器.退回改用旧版提交网关.19/08/15 15:27:22警告NativeCodeLoader:无法加载您的本机hadoop库平台...在适用情况下使用内置Java类驱动程序已成功提交为driver-20150819152724-0002 ...等待中在轮询主机以获取驱动程序状态之前...轮询主机以获取驱动程序状态driver-20150819152724-0002的状态为RUNNING1192.168.1.19:33485(worker-20150819115013-192.168.1.19-33485)

15/08/19 15:27:21 WARN RestSubmissionClient: Unable to connect to server spark://192.168.1.20:7077. Warning: Master endpoint spark://192.168.1.20:7077 was not a REST server. Falling back to legacy submission gateway instead. 15/08/19 15:27:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Driver successfully submitted as driver-20150819152724-0002 ... waiting before polling master for driver state ... polling master for driver state State of driver-20150819152724-0002 is RUNNING Driver running on 1192.168.1.19:33485 (worker-20150819115013-192.168.1.19-33485)

如何在Spark独立集群上运行以上代码?

How can i run above code on Spark standalone cluster ?

推荐答案

确保使用 8080 端口签出WebUI.在您的示例中,该地址为 192.168.1.20:8080 .

Make sure you check out the WebUI using 8080 port. In your example it would be 192.168.1.20:8080.

如果您以Spark Standalone Cluster模式运行它,请在没有-deploy-mode集群的情况下进行尝试,并通过添加-executor-memory 60g

If you are running it in Spark Standalone Cluster mode, try it without --deploy-mode cluster and hard code your nodes memory by adding --executor-memory 60g

这篇关于如何在Spark集群模式下运行此代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆