如何拼花文件分割成许多星火分区? [英] How to split parquet files into many partitions in Spark?

查看:153
本文介绍了如何拼花文件分割成许多星火分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我刚才1拼花文件我与星火阅读(使用SQL的东西),我想它有100个分区进行处理。我试过设置 spark.default.parallelism 100,我们也试图改变实木复合地板的COM pression为none(从gzip的)。无论我们做什么火花工作的第一阶段只有一个分区(一次洗牌发生时被重新划分为100,其后明显的事情是非常非常快)。

So I have just 1 parquet file I'm reading with Spark (using the SQL stuff) and I'd like it to be processed with 100 partitions. I've tried setting spark.default.parallelism to 100, we have also tried changing the compression of the parquet to none (from gzip). No matter what we do the first stage of the spark job only has a single partition (once a shuffle occurs it gets repartitioned into 100 and thereafter obviously things are much much faster).

现在根据几个来源(如下面)实木复合地板应为分割(即使使用gzip!),所以我超级困惑,他会喜欢一些建议。

Now according to a few sources (like below) parquet should be splittable (even if using gzip!), so I'm super confused and would love some advice.

<一个href=\"https://www.safaribooksonline.com/library/view/hadoop-application-architectures/9781491910313/ch01.html\" rel=\"nofollow\">https://www.safaribooksonline.com/library/view/hadoop-application-architectures/9781491910313/ch01.html

我使用的火花1.0.0,显然为 spark.sql.shuffle.partitions 的默认值是200,所以也不能说。实际上所有并行的默认值比1多得多,所以我不明白这是怎么回事。

I'm using spark 1.0.0, and apparently the default value for spark.sql.shuffle.partitions is 200, so it can't be that. In fact all the defaults for parallelism are much more than 1, so I don't understand what's going on.

推荐答案

也许你的实木复合地板的文件只需要一个HDFS块。创建具有许多HDFS块大木文件并加载它

Maybe your parquet file only takes one HDFS block. Create a big parquet file that has many HDFS blocks and load it

val k = sc.parquetFile("the-big-table.parquet")
k.partitions.length

您会看到相同数量的分区作为HDFS块。这对我工作的罚款(火花1.1.0)

You'll see same number of partitions as HDFS blocks. This worked fine for me (spark-1.1.0)

这篇关于如何拼花文件分割成许多星火分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆