Spark-如何在N个分区上进行计算，然后写入1个文件 [英] Spark - How to do computation on N partitions and then write to 1 file

查看：418 发布时间：2020/9/4 4:47:55 apache-spark pyspark

本文介绍了Spark-如何在N个分区上进行计算，然后写入1个文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想在多个分区上进行计算，以从并行性中受益，然后将结果写入单个文件(可能是镶木地板文件)中.我在PySpark 1.6.0中尝试的工作流程类似于:

I would like to do a computation on many partitions, to benefit from the parallelism, and then write my results to a single file, probably a parquet file. The workflow I tried in PySpark 1.6.0 was something like:

data_df = sqlContext.read.load('my_parquet_file')
mapped_df = sqlContext.createDataFrame(data_df.map(lambda row: changeRow(row)), ['c1', 'c2'])
coalesced_df = mapped_df.coalesce(1)
coalesced_df.write.parquet('new_parquet_file')

但是从查看Spark的Web UI来看，包括map部分在内的所有工作都是在单个线程上进行的.

but it appears from looking at Spark's web UI that all of the work, including the map part is happening on a single thread.

是否有一种方法可以对此进行调整，以使map出现在许多分区上，而write只出现在1个分区上?我尝试过的唯一尝试做的就是在map和coalesce之间放置一个mapped_df.count()，但这并不是令人满意的方式.

Is there a way to tweak this so that the map happens on many partitions while the write happens only on 1? The only thing I've tried that I think worked was by putting a mapped_df.count() between the map and the coalesce, but that doesn't feel like a satisfying way of doing it.

Spark-如何在N个分区上进行计算，然后写入1个文件 [英] Spark - How to do computation on N partitions and then write to 1 file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark-如何在N个分区上进行计算，然后写入1个文件 [英] Spark - How to do computation on N partitions and then write to 1 file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭