Spark foreachPartition仅在主服务器上运行 [英] Spark foreachPartition run only on master

查看:340
本文介绍了Spark foreachPartition仅在主服务器上运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个DataProc集群,其中有一个主机和4个工人. 我有这份出色的工作:

I have a DataProc cluster with one master and 4 workers. I have this spark job:

JavaRDD<Signal> rdd_data = javaSparkContext.parallelize(my_data, 8);

rdd_data.foreachPartition(partitionOfRecords -> {
    println("Items in partition-" + partitionOfRecords.count(y=>true));
    })

my_data是一个包含约1000个元素的数组. 群集上的作业以正确的方式启动并返回正确的数据,但它仅在主服务器上运行,而不在工作线程上运行. 我在集群中的每台计算机上都使用了dataproc image 1.4

Where my_data is an array with about 1000 elements. The job on cluster starts in the right way and return correct data, but it runs only on master and not on workers. I use dataproc image 1.4 for every machine in the cluster

任何人都可以帮助我了解为什么这项工作只能在master上进行吗?

Anybody can help me to understand why this job runs only on master?

推荐答案

这里有两个景点:

  1. 仅在执行程序与运行Spark程序的客户端位于同一节点的情况下,行println("Items in partition-" + partitionOfRecords.count(y=>true));将打印预期结果.发生这种情况是因为println命令在幕后使用stdout流,该流只能在同一台机器上访问,因此无法将来自不同节点的消息传播到客户端程序.
  2. 将master设置为local [1]时,您将强制Spark使用一个线程在本地运行,因此Spark和客户端程序使用相同的stdout流,并且您可以看到程序输出.这也意味着驱动程序和执行程序是同一节点.
  1. The line println("Items in partition-" + partitionOfRecords.count(y=>true)); will print the expected results only in the case that the executor is the same node as the client running the Spark program. This happens since the println command uses stdout stream under the hood which is accessible only on the same machine hence messages from different nodes can not be propagated to the client program.
  2. When you set master to local[1] you force Spark to run locally using one thread therefore Spark and the client program are using the same stdout stream and you are able to see the program output. That also means that driver and executor is the same node.

这篇关于Spark foreachPartition仅在主服务器上运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆