Spark分区/集群执行 [英] Spark partitioning/cluster enforcing

查看:59
本文介绍了Spark分区/集群执行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将使用大量结构如下的文件:

I will be using a large amount of files structured as follows:

/day/hour-min.txt.gz

共14天.我将使用一个由90个节点/工人组成的集群.

with a total of 14 days. I will use a cluster of 90 nodes/workers.

我正在读取wholeTextFiles()的所有内容,因为这是允许我适当地拆分数据的唯一方法.所有计算将在每分钟的基础上完成(因此基本上是每个文件),最后减少一些步骤.大约有20.000个文件;如何有效地对它们进行分区?我要让火花决定吗?

I am reading everything with wholeTextFiles() as it is the only way that allows me to split the data appropriately. All the computations will be done on a per-minute basis (so basically per file) with a few reduce steps at the end. There are roughly 20.000 files; How to efficiently partition them? Do I let spark decide?

理想情况下,我认为每个节点都应该接收整个文件;默认情况下,spark会这样做吗?我可以执行吗?怎么样?

Ideally, I think each node should receive entire files; does spark do that by default? Can I enforce it? How?

推荐答案

我认为每个节点都应该接收整个文件;默认情况下,spark会这样做吗?

I think each node should receive entire files; does spark do that by default?

是的,考虑到WholeTextFileRDD(在sc.wholeTextFiles之后得到的内容)有其自己的WholeTextFileInputFormat可以将整个文件作为一条记录读取,因此可以满足您的要求.如果您的Spark执行程序和数据节点位于同一位置,则还可以期望Node-local数据局部性. (一旦您的应用程序运行,您可以在Spark UI中进行检查.)

Yes, given that WholeTextFileRDD(what you get after sc.wholeTextFiles) has its own WholeTextFileInputFormat to read the whole files as a single record, you're covered. If your Spark executors and datanodes are co-located, you can also expect Node-local data locality. (You can check this in Spark UI once your application is running.)

关于sc.wholeTextFiles的Spark文档的注意事项:

A word of caution from note withing the Spark documentation for sc.wholeTextFiles:

首选小文件,也可以使用大文件,但可能会导致 表现不佳.

Small files are preferred, large file is also allowable, but may cause bad performance.

这篇关于Spark分区/集群执行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆