Spark-如何在N个分区上进行计算,然后写入1个文件 [英] Spark - How to do computation on N partitions and then write to 1 file
问题描述
我想在多个分区上进行计算,以从并行性中受益,然后将结果写入单个文件(可能是镶木地板文件)中.我在PySpark 1.6.0中尝试的工作流程类似于:
I would like to do a computation on many partitions, to benefit from the parallelism, and then write my results to a single file, probably a parquet file. The workflow I tried in PySpark 1.6.0 was something like:
data_df = sqlContext.read.load('my_parquet_file')
mapped_df = sqlContext.createDataFrame(data_df.map(lambda row: changeRow(row)), ['c1', 'c2'])
coalesced_df = mapped_df.coalesce(1)
coalesced_df.write.parquet('new_parquet_file')
但是从查看Spark的Web UI来看,包括map
部分在内的所有工作都是在单个线程上进行的.
but it appears from looking at Spark's web UI that all of the work, including the map
part is happening on a single thread.
是否有一种方法可以对此进行调整,以使map
出现在许多分区上,而write
只出现在1个分区上?我尝试过的唯一尝试做的就是在map
和coalesce
之间放置一个mapped_df.count()
,但这并不是令人满意的方式.
Is there a way to tweak this so that the map
happens on many partitions while the write
happens only on 1? The only thing I've tried that I think worked was by putting a mapped_df.count()
between the map
and the coalesce
, but that doesn't feel like a satisfying way of doing it.
推荐答案
您要使用"repartition(1)"而不是"coalesce(1)".问题是分区"将很乐意进行改组以实现其目标,而合作"则不会.
You want to use "repartition(1)" instead of "coalesce(1)". The issue is that "repartition" will happily do shuffling to accomplish its ends, while "coalesce" will not.
"Coalesce"比"repartition"要有效得多,但是必须谨慎使用,否则并行性最终会受到严重限制.所有合并"的分区合并成一个特定的结果分区,必须驻留在同一节点上. "coalesce(1)"调用需要单个结果分区,因此"mapped_df"的所有分区都必须位于单个节点上.为了实现这一目标,Spark鞋拔将"mapped_df"放到一个分区中.
"Coalesce" is much more efficient than "repartition", but has to be used carefully, or parallelism will end up being severely constrained as you have experienced. All the partitions "coalesce" merges into a particular result partition have to reside on the same node. The "coalesce(1)" call demands a single result partition, so all partitions of "mapped_df" need to reside on a single node. To make that true, Spark shoehorns "mapped_df" into a single partition.
这篇关于Spark-如何在N个分区上进行计算,然后写入1个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!