Spark分区/集群执行 [英] Spark partitioning/cluster enforcing

查看：59 发布时间：2020/9/20 19:47:53 file apache-spark distributed-computing partitioning bigdata

本文介绍了Spark分区/集群执行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我将使用大量结构如下的文件:

I will be using a large amount of files structured as follows:

/day/hour-min.txt.gz

共14天.我将使用一个由90个节点/工人组成的集群.

with a total of 14 days. I will use a cluster of 90 nodes/workers.

我正在读取wholeTextFiles()的所有内容，因为这是允许我适当地拆分数据的唯一方法.所有计算将在每分钟的基础上完成(因此基本上是每个文件)，最后减少一些步骤.大约有20.000个文件；如何有效地对它们进行分区?我要让火花决定吗?

I am reading everything with wholeTextFiles() as it is the only way that allows me to split the data appropriately. All the computations will be done on a per-minute basis (so basically per file) with a few reduce steps at the end. There are roughly 20.000 files; How to efficiently partition them? Do I let spark decide?

理想情况下，我认为每个节点都应该接收整个文件；默认情况下，spark会这样做吗?我可以执行吗?怎么样?

Ideally, I think each node should receive entire files; does spark do that by default? Can I enforce it? How?

Spark分区/集群执行 [英] Spark partitioning/cluster enforcing

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark分区/集群执行 [英] Spark partitioning/cluster enforcing

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭