PySpark输出文件数 [英] PySpark Number of Output Files

查看：568 发布时间：2020/9/4 21:08:56 apache-spark pyspark pyspark-sql

本文介绍了PySpark输出文件数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是Spark新手.我有一个简单的pyspark脚本.它会读取一个json文件，将其展平并将其作为实木复合地板压缩文件写入S3位置.

I am a Spark Newbie. I have a simple pyspark script. It reads a json file, flattens it and writes it to S3 location as parquet compressed file.

读取和转换步骤运行非常快，并使用50个执行程序(我在conf中设置).但是写入阶段需要很长时间，并且只能写入一个大文件(480MB).

The read and transformation steps run very fast and uses 50 executors (which I set in the conf). But the write stage takes a long time and writes only one large file (480MB).

如何确定保存的文件数量? 可以以某种方式加快写入操作吗?

How is the number of files saved decided? Can the write operation be sped up somehow?

谢谢，拉姆.

推荐答案

输出的文件数等于要保存的RDD的分区数.在此示例中，RDD被重新分区以控制输出文件的数量.

The number of files output is equal to the the number of partitions of the RDD being saved. In this sample, the RDD is repartitioned to control the number of output files.

尝试:

重新分区(numPartitions)-重新整理RDD中的数据随机地创建更多或更少的分区，并在整个分区之间保持平衡. 这总是会通过网络重新整理所有数据.

repartition(numPartitions) - Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

>>> dataRDD.repartition(2).saveAsTextFile("/user/cloudera/sqoop_import/orders_test")

输出的文件数与RDD的分区数相同.

The number of files output is the same as the number of partitionds of the RDD.

$ hadoop fs -ls /user/cloudera/sqoop_import/orders_test
Found 3 items
-rw-r--r--   1 cloudera cloudera          0 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/_SUCCESS
-rw-r--r--   1 cloudera cloudera    1499519 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00000
-rw-r--r--   1 cloudera cloudera    1500425 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00001

还要检查以下内容: coalesce(numPartitions)

源1 | 源2

更新:

Update:

textFile方法还接受一个可选的第二个参数控制文件的分区数.默认情况下，Spark 为文件的每个块创建一个分区(块的大小为64MB， HDFS中的默认设置)，但您也可以要求更多数量的通过传递更大的值进行分区.请注意，您不能少于比块分区.

The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

...但是这是可能的分区的最小数量，因此无法保证.

... but this is minimum number of possible partitions so they are not guaranteed.

因此，如果要在读取时进行分区，则应使用此....

so if you want to partition on read, you should use this....

dataRDD=sc.textFile("/user/cloudera/sqoop_import/orders").repartition(2)

这篇关于PySpark输出文件数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PySpark输出文件数 [英] PySpark Number of Output Files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

PySpark输出文件数 [英] PySpark Number of Output Files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭