Spark分区/集群执行 [英] Spark partitioning/cluster enforcing
问题描述
我将使用大量结构如下的文件:
I will be using a large amount of files structured as follows:
/day/hour-min.txt.gz
共14天.我将使用一个由90个节点/工人组成的集群.
with a total of 14 days. I will use a cluster of 90 nodes/workers.
我正在读取wholeTextFiles()
的所有内容,因为这是允许我适当地拆分数据的唯一方法.所有计算将在每分钟的基础上完成(因此基本上是每个文件),最后减少一些步骤.大约有20.000个文件;如何有效地对它们进行分区?我要让火花决定吗?
I am reading everything with wholeTextFiles()
as it is the only way that allows me to split the data appropriately. All the computations will be done on a per-minute basis (so basically per file) with a few reduce steps at the end. There are roughly 20.000 files; How to efficiently partition them? Do I let spark decide?
理想情况下,我认为每个节点都应该接收整个文件;默认情况下,spark会这样做吗?我可以执行吗?怎么样?
Ideally, I think each node should receive entire files; does spark do that by default? Can I enforce it? How?
推荐答案
我认为每个节点都应该接收整个文件;默认情况下,spark会这样做吗?
I think each node should receive entire files; does spark do that by default?
是的,考虑到WholeTextFileRDD
(在sc.wholeTextFiles
之后得到的内容)有其自己的WholeTextFileInputFormat
可以将整个文件作为一条记录读取,因此可以满足您的要求.如果您的Spark执行程序和数据节点位于同一位置,则还可以期望Node-local数据局部性. (一旦您的应用程序运行,您可以在Spark UI中进行检查.)
Yes, given that WholeTextFileRDD
(what you get after sc.wholeTextFiles
) has its own WholeTextFileInputFormat
to read the whole files as a single record, you're covered. If your Spark executors and datanodes are co-located, you can also expect Node-local data locality. (You can check this in Spark UI once your application is running.)
关于sc.wholeTextFiles
的Spark文档的注意事项:
A word of caution from note withing the Spark documentation for sc.wholeTextFiles
:
首选小文件,也可以使用大文件,但可能会导致 表现不佳.
Small files are preferred, large file is also allowable, but may cause bad performance.
这篇关于Spark分区/集群执行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!