Spark-如何在N个分区上进行计算,然后写入1个文件 [英] Spark - How to do computation on N partitions and then write to 1 file

查看:418
本文介绍了Spark-如何在N个分区上进行计算,然后写入1个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在多个分区上进行计算,以从并行性中受益,然后将结果写入单个文件(可能是镶木地板文件)中.我在PySpark 1.6.0中尝试的工作流程类似于:

I would like to do a computation on many partitions, to benefit from the parallelism, and then write my results to a single file, probably a parquet file. The workflow I tried in PySpark 1.6.0 was something like:

data_df = sqlContext.read.load('my_parquet_file')
mapped_df = sqlContext.createDataFrame(data_df.map(lambda row: changeRow(row)), ['c1', 'c2'])
coalesced_df = mapped_df.coalesce(1)
coalesced_df.write.parquet('new_parquet_file')

但是从查看Spark的Web UI来看,包括map部分在内的所有工作都是在单个线程上进行的.

but it appears from looking at Spark's web UI that all of the work, including the map part is happening on a single thread.

是否有一种方法可以对此进行调整,以使map出现在许多分区上,而write只出现在1个分区上?我尝试过的唯一尝试做的就是在mapcoalesce之间放置一个mapped_df.count(),但这并不是令人满意的方式.

Is there a way to tweak this so that the map happens on many partitions while the write happens only on 1? The only thing I've tried that I think worked was by putting a mapped_df.count() between the map and the coalesce, but that doesn't feel like a satisfying way of doing it.

推荐答案

您要使用"repartition(1)"而不是"coalesce(1)".问题是分区"将很乐意进行改组以实现其目标,而合作"则不会.

You want to use "repartition(1)" instead of "coalesce(1)". The issue is that "repartition" will happily do shuffling to accomplish its ends, while "coalesce" will not.

"Coalesce"比"repartition"要有效得多,但是必须谨慎使用,否则并行性最终会受到严重限制.所有合并"的分区合并成一个特定的结果分区,必须驻留在同一节点上. "coalesce(1)"调用需要单个结果分区,因此"mapped_df"的所有分区都必须位于单个节点上.为了实现这一目标,Spark鞋拔将"mapped_df"放到一个分区中.

"Coalesce" is much more efficient than "repartition", but has to be used carefully, or parallelism will end up being severely constrained as you have experienced. All the partitions "coalesce" merges into a particular result partition have to reside on the same node. The "coalesce(1)" call demands a single result partition, so all partitions of "mapped_df" need to reside on a single node. To make that true, Spark shoehorns "mapped_df" into a single partition.

这篇关于Spark-如何在N个分区上进行计算,然后写入1个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆