Spark奴隶的效率低下 [英] Low efficiency of Spark's slaves

查看:74
本文介绍了Spark奴隶的效率低下的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是火花新手,我需要建议...

I'm a spark newbie and I need advice...

我正在独立集群上测试Spark的功能,该集群包含一台主计算机(8个CPU,16GB RAM)(用于启动 start-master.sh 并启动应用程序)和两台从属计算机(两者-8个CPU,16GB RAM),它们开始 spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT 并应用于分布式计算.我成功运行了群集-从属服务器连接到主服务器,并且应用程序正常运行.

I'm testing Spark's capabilities on the standalone cluster, which contains one master machine (8 CPUs, 16GB RAM)(which start start-master.sh and launch application) and two slave machines (both - 8 CPUs, 16GB RAM), which start spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT and should use for distributed computings. I succesfully run cluster - slaves connect to master and application works correctly.

问题是,与在一台本地计算机上运行的应用程序相比,我看不出任何性能上的效率...像以前一样,主计算机的所有CPU都已满载,从计算机的CPU几乎处于不活动状态.也许有人可以说出典型的情况-为什么会这样呢?我了解到,我几乎没有提供有关我的问题和配置的信息,但是我不知道从哪里开始以及在这种情况下真正重要的是什么.

The problem is that I don't see any efficiency in performance by comparison with running application on one local machine... As before - all CPU's of master machine are fully loaded and CPU's of slaves machines are almost inactive. Maybe somebody can name the typical situations - why is this happening? I understand, that I gave little information about my problem and configuration, but I don't know where to start and what is really important in this case.

示例代码:

SparkConf conf = new SparkConf().setAppName("myapplication").setMaster("mastermachine:7077").setJars(new String[] {"target/myapplication-0.0.1-SNAPSHOT-driver.jar"}).set("spark.home","path_to_spark");
    JavaSparkContext sc = new JavaSparkContext(conf);

...

// "Document" - is my data structure that keeps text from file

List<Document> doclist = fileNameList.stream().parallel().flatMap(docName -> Arrays.asList(getDoc(docName)).stream()).collect(Collectors.toList());

JavaRDD<Document> rdd = sc.parallelize(doclist);

Set<String> words = rdd.collect().stream().parallel().flatMap(doc -> doc.getWords().stream()).collect(Collectors.toSet());

sc.stop();

即我从这里的文本文件中获取单词字符串,并将其收集到集合中.

i.e. I'm getting strings of words from textfiles here and collect them in the Set.

推荐答案

鉴于提供的代码,很明显您没有看到任何并行的分布式处理,因为基本上没有使用Spark.

Given the provided code, it's clear that you are not seeing any parallel distributed processing because basically Spark is not being used.

collect 是一种将分布式RDD数据检索到驱动程序的操作,因此在执行此操作时:

collect, in Spark lingo, is an action that retrieves the distributed RDD data to the driver, so when doing this:

// parallelize the document list to distribute over the Spark cluster
JavaRDD<Document> rdd = sc.parallelize(doclist);
// get all documents back to the driver
List<Document> docs = rdd.collect()

发出 collect 后,在上一步中分发的集合将带回驱动程序.所有进一步的操作都在驱动程序上进行,您将看不到任何分布式计算的发生.

After issuing collect, the collection distributed in the previous step is brought back to the driver. All further operations happen on the driver and you won't see any distributed computation happening.

该代码的正确Spark版本为:

A correct Spark version of that code would be:

val docList = ???
val rdd = sc.parallelize(docList)
val wordsRdd = rdd.flatMap(doc => doc.getWords)

在此示例中, flatMap 函数将作为任务分布在集群上.

In this example, the flatMap function will be distributed as tasks over the cluster.

这篇关于Spark奴隶的效率低下的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆